kandi background
Explore Kits

spark | Apache Spark - A unified analytics engine

 by   apache Scala Version: Current License: Apache-2.0

 by   apache Scala Version: Current License: Apache-2.0

Download this library from

kandi X-RAY | spark Summary

spark is a Scala library typically used in Big Data, Kafka, Spark, Hadoop applications. spark has no bugs, it has a Permissive License and it has medium support. However spark has 6 vulnerabilities. You can download it from GitHub.
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs. Please refer to the build documentation at "Specifying the Hadoop Version and Enabling YARN" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • spark has a medium active ecosystem.
  • It has 32507 star(s) with 25453 fork(s). There are 2076 watchers for this library.
  • It had no major release in the last 12 months.
  • spark has no issues reported. There are 232 open pull requests and 0 closed requests.
  • It has a neutral sentiment in the developer community.
  • The latest version of spark is current.
spark Support
Best in #Scala
Average in #Scala
spark Support
Best in #Scala
Average in #Scala

quality kandi Quality

  • spark has 0 bugs and 0 code smells.
spark Quality
Best in #Scala
Average in #Scala
spark Quality
Best in #Scala
Average in #Scala

securitySecurity

  • spark has 6 vulnerability issues reported (2 critical, 2 high, 2 medium, 0 low).
  • spark code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.
spark Security
Best in #Scala
Average in #Scala
spark Security
Best in #Scala
Average in #Scala

license License

  • spark is licensed under the Apache-2.0 License. This license is Permissive.
  • Permissive licenses have the least restrictions, and you can use them in most projects.
spark License
Best in #Scala
Average in #Scala
spark License
Best in #Scala
Average in #Scala

buildReuse

  • spark releases are not available. You will need to build from source code and install.
  • Installation instructions are not available. Examples and code snippets are available.
  • It has 958877 lines of code, 58210 functions and 5909 files.
  • It has medium code complexity. Code complexity directly impacts maintainability of the code.
spark Reuse
Best in #Scala
Average in #Scala
spark Reuse
Best in #Scala
Average in #Scala
Top functions reviewed by kandi - BETA

Coming Soon for all Libraries!

Currently covering the most popular Java, JavaScript and Python libraries. See a SAMPLE HERE.
kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.

spark Key Features

Apache Spark - A unified analytics engine for large-scale data processing

Building Spark

copy iconCopydownload iconDownload
./build/mvn -DskipTests clean package

Interactive Scala Shell

copy iconCopydownload iconDownload
./bin/spark-shell

Interactive Python Shell

copy iconCopydownload iconDownload
./bin/pyspark

Example Programs

copy iconCopydownload iconDownload
./bin/run-example SparkPi

Running Tests

copy iconCopydownload iconDownload
./dev/run-tests

Why joining structure-identic dataframes gives different results?

copy iconCopydownload iconDownload
df3 = df2.alias('df2').join(df1.alias('df1'), (F.col('df1.c1') == F.col('df2.c2')), 'full')
df3.show()

# Output
# +----+------+----+----+---+------+----+---+
# |  ID|Status|  c1|  c2| ID|Status|  c1| c2|
# +----+------+----+----+---+------+----+---+
# |   4|    ok|null|   A|  1|   bad|   A|  A|
# |null|  null|null|null|  4|    ok|null|  A|
# +----+------+----+----+---+------+----+---+
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
   +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
:  +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
:     +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
   +- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
      +- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
+----+------+----+----+----+------+----+----+
|   4|    ok|null|   A|null|  null|null|null|
|null|  null|null|null|   1|   bad|   A|   A|
|null|  null|null|null|   4|    ok|null|   A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
   +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
:  +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
:     +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
   +- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
      +- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
+----+------+----+----+----+------+----+----+
|   4|    ok|null|   A|null|  null|null|null|
|null|  null|null|null|   1|   bad|   A|   A|
|null|  null|null|null|   4|    ok|null|   A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
   +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
:  +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
:     +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
   +- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
      +- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
+----+------+----+----+----+------+----+----+
|   4|    ok|null|   A|null|  null|null|null|
|null|  null|null|null|   1|   bad|   A|   A|
|null|  null|null|null|   4|    ok|null|   A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
   +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
:  +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
:     +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
   +- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
      +- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
+----+------+----+----+----+------+----+----+
|   4|    ok|null|   A|null|  null|null|null|
|null|  null|null|null|   1|   bad|   A|   A|
|null|  null|null|null|   4|    ok|null|   A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
   +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
:  +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
:     +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
   +- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
      +- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
+----+------+----+----+----+------+----+----+
|   4|    ok|null|   A|null|  null|null|null|
|null|  null|null|null|   1|   bad|   A|   A|
|null|  null|null|null|   4|    ok|null|   A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
   +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
:  +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
:     +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
   +- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
      +- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
+----+------+----+----+----+------+----+----+
|   4|    ok|null|   A|null|  null|null|null|
|null|  null|null|null|   1|   bad|   A|   A|
|null|  null|null|null|   4|    ok|null|   A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
   +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
:  +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
:     +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
   +- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
      +- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
+----+------+----+----+----+------+----+----+
|   4|    ok|null|   A|null|  null|null|null|
|null|  null|null|null|   1|   bad|   A|   A|
|null|  null|null|null|   4|    ok|null|   A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')

AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>

copy iconCopydownload iconDownload
import numpy as np 
import pandas as pd
df =pd.DataFrame(np.random.rand(3,6))

with open("dump_from_v1.3.4.pickle", "wb") as f: 
    pickle.dump(df, f) 

quit()
import pickle

with open("dump_from_v1.3.4.pickle", "rb") as f: 
    df = pickle.load(f) 


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-ff5c218eca92> in <module>
      1 with open("dump_from_v1.3.4.pickle", "rb") as f:
----> 2     df = pickle.load(f)
      3 

AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks' from '/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py'>
-----------------------
import numpy as np 
import pandas as pd
df =pd.DataFrame(np.random.rand(3,6))

with open("dump_from_v1.3.4.pickle", "wb") as f: 
    pickle.dump(df, f) 

quit()
import pickle

with open("dump_from_v1.3.4.pickle", "rb") as f: 
    df = pickle.load(f) 


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-ff5c218eca92> in <module>
      1 with open("dump_from_v1.3.4.pickle", "rb") as f:
----> 2     df = pickle.load(f)
      3 

AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks' from '/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py'>

Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

copy iconCopydownload iconDownload
sc = SparkContext()
# Get current sparkconf which is set by glue
conf = sc.getConf()
# add additional spark configurations
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
# Restart spark context
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
# create glue context with the restarted sc
glueContext = GlueContext(sc)

Cannot find conda info. Please verify your conda installation on EMR

copy iconCopydownload iconDownload
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh  -O /home/hadoop/miniconda.sh \
    && /bin/bash ~/miniconda.sh -b -p $HOME/conda

echo -e '\n export PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc


conda config --set always_yes yes --set changeps1 no
conda config -f --add channels conda-forge


conda create -n zoo python=3.7 # "zoo" is conda environment name
conda init bash
source activate zoo
conda install python 3.7.0 -c conda-forge orca 
sudo /home/hadoop/conda/envs/zoo/bin/python3.7 -m pip install virtualenv
“spark.pyspark.python": "/home/hadoop/conda/envs/zoo/bin/python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/home/hadoop/conda/envs/zoo/bin/,
"zeppelin.pyspark.python" : "/home/hadoop/conda/bin/python",
"zeppelin.python": "/home/hadoop/conda/bin/python"
-----------------------
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh  -O /home/hadoop/miniconda.sh \
    && /bin/bash ~/miniconda.sh -b -p $HOME/conda

echo -e '\n export PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc


conda config --set always_yes yes --set changeps1 no
conda config -f --add channels conda-forge


conda create -n zoo python=3.7 # "zoo" is conda environment name
conda init bash
source activate zoo
conda install python 3.7.0 -c conda-forge orca 
sudo /home/hadoop/conda/envs/zoo/bin/python3.7 -m pip install virtualenv
“spark.pyspark.python": "/home/hadoop/conda/envs/zoo/bin/python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/home/hadoop/conda/envs/zoo/bin/,
"zeppelin.pyspark.python" : "/home/hadoop/conda/bin/python",
"zeppelin.python": "/home/hadoop/conda/bin/python"

How to set Docker Compose `env_file` relative to `.yml` file when multiple `--file` option is used?

copy iconCopydownload iconDownload
  env_file:
    - ${BACKEND_BASE:-.}/.env

Read spark data with column that clashes with partition name

copy iconCopydownload iconDownload
df= spark.read.json("s3://bucket/table/**/*.json")

renamedDF= df.withColumnRenamed("old column name","new column name")
-----------------------
from pyspark.sql import functions as F

Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
conf = sc._jsc.hadoopConfiguration()

s3_path = "s3://bucket/prefix"
file_cols = ["id", "color", "date"]
partitions_cols = ["company", "service", "date"]

# listing all files for input path
json_files = []
files = Path(s3_path).getFileSystem(conf).listFiles(Path(s3_path), True)

while files.hasNext():
    path = files.next().getPath()
    if path.getName().endswith(".json"):
        json_files.append(path.toString())

df = spark.read.json(json_files) # you can pass here the schema of the files without the partition columns

# renaming file column in if exists in partitions
df = df.select(*[
    F.col(c).alias(c) if c not in partitions_cols else F.col(c).alias(f"file_{c}")
    for c in df.columns
])

# parse partitions from filenames
for p in partitions_cols:
    df = df.withColumn(p, F.regexp_extract(F.input_file_name(), f"/{p}=([^/]+)/", 1))

df.show()

#+-----+----------+---+-------+-------+----------+
#|color| file_date| id|company|service|      date|
#+-----+----------+---+-------+-------+----------+
#|green|2021-08-08|baz|   abcd|    xyz|2021-01-01|
#| blue|2021-12-12|foo|   abcd|    xyz|2021-01-01|
#|  red|2021-10-10|bar|   abcd|    xyz|2021-01-01|
#+-----+----------+---+-------+-------+----------+
-----------------------
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DateType

# schema of json files
schema = StructType([
    StructField('id', StringType(), True),
    StructField('color', StringType(), True),
    StructField('date', DateType(), True)
])

df = sparkSession.read.text('resources') \
    .withColumnRenamed('date', 'partition_date') \
    .withColumn('json', F.from_json(F.col('value'), schema)) \
    .select('company', 'service', 'partition_date', 'json.*') \
    .withColumnRenamed('date', 'file_date') \
    .withColumnRenamed('partition_date', 'date')
{"id": "foo", "color": "blue", "date": "2021-12-12"}
{"id": "bar", "color": "red", "date": "2021-12-13"}
{"id": "kix", "color": "yellow", "date": "2021-12-14"}
{"id": "kaz", "color": "blue", "date": "2021-12-15"}
{"id": "dir", "color": "red", "date": "2021-12-16"}
{"id": "tux", "color": "yellow", "date": "2021-12-17"}
+-------+-------+----------+---+------+----------+
|company|service|      date| id| color| file_date|
+-------+-------+----------+---+------+----------+
|   abcd|    xyz|2021-01-01|kaz|  blue|2021-12-15|
|   abcd|    xyz|2021-01-01|dir|   red|2021-12-16|
|   abcd|    xyz|2021-01-01|tux|yellow|2021-12-17|
|   abcd|    xyz|2021-01-01|foo|  blue|2021-12-12|
|   abcd|    xyz|2021-01-01|bar|   red|2021-12-13|
|   abcd|    xyz|2021-01-01|kix|yellow|2021-12-14|
+-------+-------+----------+---+------+----------+
-----------------------
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DateType

# schema of json files
schema = StructType([
    StructField('id', StringType(), True),
    StructField('color', StringType(), True),
    StructField('date', DateType(), True)
])

df = sparkSession.read.text('resources') \
    .withColumnRenamed('date', 'partition_date') \
    .withColumn('json', F.from_json(F.col('value'), schema)) \
    .select('company', 'service', 'partition_date', 'json.*') \
    .withColumnRenamed('date', 'file_date') \
    .withColumnRenamed('partition_date', 'date')
{"id": "foo", "color": "blue", "date": "2021-12-12"}
{"id": "bar", "color": "red", "date": "2021-12-13"}
{"id": "kix", "color": "yellow", "date": "2021-12-14"}
{"id": "kaz", "color": "blue", "date": "2021-12-15"}
{"id": "dir", "color": "red", "date": "2021-12-16"}
{"id": "tux", "color": "yellow", "date": "2021-12-17"}
+-------+-------+----------+---+------+----------+
|company|service|      date| id| color| file_date|
+-------+-------+----------+---+------+----------+
|   abcd|    xyz|2021-01-01|kaz|  blue|2021-12-15|
|   abcd|    xyz|2021-01-01|dir|   red|2021-12-16|
|   abcd|    xyz|2021-01-01|tux|yellow|2021-12-17|
|   abcd|    xyz|2021-01-01|foo|  blue|2021-12-12|
|   abcd|    xyz|2021-01-01|bar|   red|2021-12-13|
|   abcd|    xyz|2021-01-01|kix|yellow|2021-12-14|
+-------+-------+----------+---+------+----------+
-----------------------
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DateType

# schema of json files
schema = StructType([
    StructField('id', StringType(), True),
    StructField('color', StringType(), True),
    StructField('date', DateType(), True)
])

df = sparkSession.read.text('resources') \
    .withColumnRenamed('date', 'partition_date') \
    .withColumn('json', F.from_json(F.col('value'), schema)) \
    .select('company', 'service', 'partition_date', 'json.*') \
    .withColumnRenamed('date', 'file_date') \
    .withColumnRenamed('partition_date', 'date')
{"id": "foo", "color": "blue", "date": "2021-12-12"}
{"id": "bar", "color": "red", "date": "2021-12-13"}
{"id": "kix", "color": "yellow", "date": "2021-12-14"}
{"id": "kaz", "color": "blue", "date": "2021-12-15"}
{"id": "dir", "color": "red", "date": "2021-12-16"}
{"id": "tux", "color": "yellow", "date": "2021-12-17"}
+-------+-------+----------+---+------+----------+
|company|service|      date| id| color| file_date|
+-------+-------+----------+---+------+----------+
|   abcd|    xyz|2021-01-01|kaz|  blue|2021-12-15|
|   abcd|    xyz|2021-01-01|dir|   red|2021-12-16|
|   abcd|    xyz|2021-01-01|tux|yellow|2021-12-17|
|   abcd|    xyz|2021-01-01|foo|  blue|2021-12-12|
|   abcd|    xyz|2021-01-01|bar|   red|2021-12-13|
|   abcd|    xyz|2021-01-01|kix|yellow|2021-12-14|
+-------+-------+----------+---+------+----------+
-----------------------
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DateType

# schema of json files
schema = StructType([
    StructField('id', StringType(), True),
    StructField('color', StringType(), True),
    StructField('date', DateType(), True)
])

df = sparkSession.read.text('resources') \
    .withColumnRenamed('date', 'partition_date') \
    .withColumn('json', F.from_json(F.col('value'), schema)) \
    .select('company', 'service', 'partition_date', 'json.*') \
    .withColumnRenamed('date', 'file_date') \
    .withColumnRenamed('partition_date', 'date')
{"id": "foo", "color": "blue", "date": "2021-12-12"}
{"id": "bar", "color": "red", "date": "2021-12-13"}
{"id": "kix", "color": "yellow", "date": "2021-12-14"}
{"id": "kaz", "color": "blue", "date": "2021-12-15"}
{"id": "dir", "color": "red", "date": "2021-12-16"}
{"id": "tux", "color": "yellow", "date": "2021-12-17"}
+-------+-------+----------+---+------+----------+
|company|service|      date| id| color| file_date|
+-------+-------+----------+---+------+----------+
|   abcd|    xyz|2021-01-01|kaz|  blue|2021-12-15|
|   abcd|    xyz|2021-01-01|dir|   red|2021-12-16|
|   abcd|    xyz|2021-01-01|tux|yellow|2021-12-17|
|   abcd|    xyz|2021-01-01|foo|  blue|2021-12-12|
|   abcd|    xyz|2021-01-01|bar|   red|2021-12-13|
|   abcd|    xyz|2021-01-01|kix|yellow|2021-12-14|
+-------+-------+----------+---+------+----------+

How do I parse xml documents in Palantir Foundry?

copy iconCopydownload iconDownload
buildscript {
    repositories {
       // some other things
    }

    dependencies {
        classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
    }
}

apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'

dependencies {
    condaJars "com.databricks:spark-xml_2.13:0.14.0"
}

// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'

// ... some other awesome features you should enable
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many


def read_files(spark_session, paths):
    parsed_dfs = []
    for file_name in paths:
        parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
        parsed_dfs += [parsed_df]
    output_df = union_many(*parsed_dfs, how="wide")
    return output_df


@transform(
    the_output=Output("my.awesome.output"),
    the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
    session = ctx.spark_session
    input_filesystem = the_input.filesystem()
    hadoop_path = input_filesystem.hadoop_path
    files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
    output_df = read_files(session, files)
    the_output.write_dataframe(output_df)
<tag>
<field1>
my_value
</field1>
</tag>
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename


def test_parse_xml(spark_session):
    file_path = resource_filename(__name__, "sample.xml")
    parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
    assert parsed_df.count() == 1
    assert set(parsed_df.columns) == {"field1"}
-----------------------
buildscript {
    repositories {
       // some other things
    }

    dependencies {
        classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
    }
}

apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'

dependencies {
    condaJars "com.databricks:spark-xml_2.13:0.14.0"
}

// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'

// ... some other awesome features you should enable
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many


def read_files(spark_session, paths):
    parsed_dfs = []
    for file_name in paths:
        parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
        parsed_dfs += [parsed_df]
    output_df = union_many(*parsed_dfs, how="wide")
    return output_df


@transform(
    the_output=Output("my.awesome.output"),
    the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
    session = ctx.spark_session
    input_filesystem = the_input.filesystem()
    hadoop_path = input_filesystem.hadoop_path
    files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
    output_df = read_files(session, files)
    the_output.write_dataframe(output_df)
<tag>
<field1>
my_value
</field1>
</tag>
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename


def test_parse_xml(spark_session):
    file_path = resource_filename(__name__, "sample.xml")
    parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
    assert parsed_df.count() == 1
    assert set(parsed_df.columns) == {"field1"}
-----------------------
buildscript {
    repositories {
       // some other things
    }

    dependencies {
        classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
    }
}

apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'

dependencies {
    condaJars "com.databricks:spark-xml_2.13:0.14.0"
}

// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'

// ... some other awesome features you should enable
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many


def read_files(spark_session, paths):
    parsed_dfs = []
    for file_name in paths:
        parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
        parsed_dfs += [parsed_df]
    output_df = union_many(*parsed_dfs, how="wide")
    return output_df


@transform(
    the_output=Output("my.awesome.output"),
    the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
    session = ctx.spark_session
    input_filesystem = the_input.filesystem()
    hadoop_path = input_filesystem.hadoop_path
    files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
    output_df = read_files(session, files)
    the_output.write_dataframe(output_df)
<tag>
<field1>
my_value
</field1>
</tag>
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename


def test_parse_xml(spark_session):
    file_path = resource_filename(__name__, "sample.xml")
    parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
    assert parsed_df.count() == 1
    assert set(parsed_df.columns) == {"field1"}
-----------------------
buildscript {
    repositories {
       // some other things
    }

    dependencies {
        classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
    }
}

apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'

dependencies {
    condaJars "com.databricks:spark-xml_2.13:0.14.0"
}

// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'

// ... some other awesome features you should enable
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many


def read_files(spark_session, paths):
    parsed_dfs = []
    for file_name in paths:
        parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
        parsed_dfs += [parsed_df]
    output_df = union_many(*parsed_dfs, how="wide")
    return output_df


@transform(
    the_output=Output("my.awesome.output"),
    the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
    session = ctx.spark_session
    input_filesystem = the_input.filesystem()
    hadoop_path = input_filesystem.hadoop_path
    files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
    output_df = read_files(session, files)
    the_output.write_dataframe(output_df)
<tag>
<field1>
my_value
</field1>
</tag>
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename


def test_parse_xml(spark_session):
    file_path = resource_filename(__name__, "sample.xml")
    parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
    assert parsed_df.count() == 1
    assert set(parsed_df.columns) == {"field1"}

docker build vue3 not compatible with element-ui on node:16-buster-slim

copy iconCopydownload iconDownload
...
COPY package.json /home
RUN npm config set legacy-peer-deps true
RUN npm install --prefix /home

Why is repartition faster than partitionBy in Spark?

copy iconCopydownload iconDownload
spark.range(1000).withColumn("partition", 'id % 100)
    .repartition('partition).write.csv("/tmp/test.csv")
spark.range(1000).withColumn("partition", 'id % 100)
    .write.partitionBy("partition").csv("/tmp/test2.csv")
-----------------------
spark.range(1000).withColumn("partition", 'id % 100)
    .repartition('partition).write.csv("/tmp/test.csv")
spark.range(1000).withColumn("partition", 'id % 100)
    .write.partitionBy("partition").csv("/tmp/test2.csv")
-----------------------
df = spark.read.format("xml") \
  .options(rowTag="DeviceData") \
  .load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
-----------------------
df = spark.read.format("xml") \
  .options(rowTag="DeviceData") \
  .load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
-----------------------
df = spark.read.format("xml") \
  .options(rowTag="DeviceData") \
  .load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
-----------------------
df = spark.read.format("xml") \
  .options(rowTag="DeviceData") \
  .load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
-----------------------
df = spark.read.format("xml") \
  .options(rowTag="DeviceData") \
  .load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
-----------------------
df = spark.read.format("xml") \
  .options(rowTag="DeviceData") \
  .load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
-----------------------
df = spark.read.format("xml") \
  .options(rowTag="DeviceData") \
  .load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \

Get difference between two version of delta lake table

copy iconCopydownload iconDownload
import uk.co.gresearch.spark.diff.DatasetDiff

df1.diff(df2)
-----------------------
val lastVersion = DeltaTable.forPath(spark, PATH_TO_DELTA_TABLE)
    .history()
    .select(col("version"))
    .collect.toList
    .headOption
    .getOrElse(throw new Exception("Is this table empty ?"))
val addPathList = spark
    .read
    .json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
    .where(s"add is not null")
    .select(s"add.path")
    .collect()
    .map(path => formatPath(path.toString))
    .toList
val removePathList = spark
    .read
    .json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
    .where(s"remove is not null")
    .select(s"remove.path")
    .collect()
    .map(path => formatPath(path.toString))
    .toList
import org.apache.spark.sql.functions._
val addDF = spark
  .read
  .format("parquet")
  .load(addPathList: _*)
  .withColumn("add_remove", lit("add"))
val removeDF = spark
  .read
  .format("parquet")
  .load(removePathList: _*)
  .withColumn("add_remove", lit("remove"))
addDF.union(removeDF).show()


+----------+----------+
|updatedate|add_remove|
+----------+----------+
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
+----------+----------+
only showing top 20 rows
-----------------------
val lastVersion = DeltaTable.forPath(spark, PATH_TO_DELTA_TABLE)
    .history()
    .select(col("version"))
    .collect.toList
    .headOption
    .getOrElse(throw new Exception("Is this table empty ?"))
val addPathList = spark
    .read
    .json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
    .where(s"add is not null")
    .select(s"add.path")
    .collect()
    .map(path => formatPath(path.toString))
    .toList
val removePathList = spark
    .read
    .json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
    .where(s"remove is not null")
    .select(s"remove.path")
    .collect()
    .map(path => formatPath(path.toString))
    .toList
import org.apache.spark.sql.functions._
val addDF = spark
  .read
  .format("parquet")
  .load(addPathList: _*)
  .withColumn("add_remove", lit("add"))
val removeDF = spark
  .read
  .format("parquet")
  .load(removePathList: _*)
  .withColumn("add_remove", lit("remove"))
addDF.union(removeDF).show()


+----------+----------+
|updatedate|add_remove|
+----------+----------+
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
+----------+----------+
only showing top 20 rows
-----------------------
val lastVersion = DeltaTable.forPath(spark, PATH_TO_DELTA_TABLE)
    .history()
    .select(col("version"))
    .collect.toList
    .headOption
    .getOrElse(throw new Exception("Is this table empty ?"))
val addPathList = spark
    .read
    .json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
    .where(s"add is not null")
    .select(s"add.path")
    .collect()
    .map(path => formatPath(path.toString))
    .toList
val removePathList = spark
    .read
    .json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
    .where(s"remove is not null")
    .select(s"remove.path")
    .collect()
    .map(path => formatPath(path.toString))
    .toList
import org.apache.spark.sql.functions._
val addDF = spark
  .read
  .format("parquet")
  .load(addPathList: _*)
  .withColumn("add_remove", lit("add"))
val removeDF = spark
  .read
  .format("parquet")
  .load(removePathList: _*)
  .withColumn("add_remove", lit("remove"))
addDF.union(removeDF).show()


+----------+----------+
|updatedate|add_remove|
+----------+----------+
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
+----------+----------+
only showing top 20 rows
-----------------------
val lastVersion = DeltaTable.forPath(spark, PATH_TO_DELTA_TABLE)
    .history()
    .select(col("version"))
    .collect.toList
    .headOption
    .getOrElse(throw new Exception("Is this table empty ?"))
val addPathList = spark
    .read
    .json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
    .where(s"add is not null")
    .select(s"add.path")
    .collect()
    .map(path => formatPath(path.toString))
    .toList
val removePathList = spark
    .read
    .json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
    .where(s"remove is not null")
    .select(s"remove.path")
    .collect()
    .map(path => formatPath(path.toString))
    .toList
import org.apache.spark.sql.functions._
val addDF = spark
  .read
  .format("parquet")
  .load(addPathList: _*)
  .withColumn("add_remove", lit("add"))
val removeDF = spark
  .read
  .format("parquet")
  .load(removePathList: _*)
  .withColumn("add_remove", lit("remove"))
addDF.union(removeDF).show()


+----------+----------+
|updatedate|add_remove|
+----------+----------+
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
|      null|       add|
+----------+----------+
only showing top 20 rows

Community Discussions

Trending Discussions on spark
  • spark-shell throws java.lang.reflect.InvocationTargetException on running
  • Why joining structure-identic dataframes gives different results?
  • AttributeError: Can't get attribute 'new_block' on &lt;module 'pandas.core.internals.blocks'&gt;
  • Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0
  • NoSuchMethodError on com.fasterxml.jackson.dataformat.xml.XmlMapper.coercionConfigDefaults()
  • Cannot find conda info. Please verify your conda installation on EMR
  • How to set Docker Compose `env_file` relative to `.yml` file when multiple `--file` option is used?
  • Read spark data with column that clashes with partition name
  • How do I parse xml documents in Palantir Foundry?
  • docker build vue3 not compatible with element-ui on node:16-buster-slim
Trending Discussions on spark

QUESTION

spark-shell throws java.lang.reflect.InvocationTargetException on running

Asked 2022-Apr-01 at 19:53

When I execute run-example SparkPi, for example, it works perfectly, but when I run spark-shell, it throws these exceptions:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/big_data/spark-3.2.0-bin-hadoop3.2-scala2.13/jars/spark-unsafe_2.13-3.2.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0
      /_/

Using Scala version 2.13.5 (OpenJDK 64-Bit Server VM, Java 11.0.9.1)
Type in expressions to have them evaluated.
Type :help for more information.
21/12/11 19:28:36 ERROR SparkContext: Error initializing SparkContext.
java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at org.apache.spark.executor.Executor.addReplClassLoaderIfNeeded(Executor.scala:909)
        at org.apache.spark.executor.Executor.<init>(Executor.scala:160)
        at org.apache.spark.scheduler.local.LocalEndpoint.<init>(LocalSchedulerBackend.scala:64)
        at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:132)
        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:581)
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
        at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
        at scala.Option.getOrElse(Option.scala:201)
        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
        at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
        at $line3.$read$$iw.<init>(<console>:5)
        at $line3.$read.<init>(<console>:4)
        at $line3.$read$.<clinit>(<console>)
        at $line3.$eval$.$print$lzycompute(<synthetic>:6)
        at $line3.$eval$.$print(<synthetic>:5)
        at $line3.$eval.$print(<synthetic>)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
        at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
        at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
        at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
        at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
        at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
        at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
        at scala.collection.immutable.List.foreach(List.scala:333)
        at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
        at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
        at org.apache.spark.repl.Main$.doMain(Main.scala:84)
        at org.apache.spark.repl.Main$.main(Main.scala:59)
        at org.apache.spark.repl.Main.main(Main.scala)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes
        at java.base/java.net.URI$Parser.fail(URI.java:2913)
        at java.base/java.net.URI$Parser.checkChars(URI.java:3084)
        at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3166)
        at java.base/java.net.URI$Parser.parse(URI.java:3114)
        at java.base/java.net.URI.<init>(URI.java:600)
        at org.apache.spark.repl.ExecutorClassLoader.<init>(ExecutorClassLoader.scala:57)
        ... 67 more
21/12/11 19:28:36 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
        at org.apache.spark.scheduler.local.LocalSchedulerBackend.org$apache$spark$scheduler$local$LocalSchedulerBackend$$stop(LocalSchedulerBackend.scala:173)
        at org.apache.spark.scheduler.local.LocalSchedulerBackend.stop(LocalSchedulerBackend.scala:144)
        at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:927)
        at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2516)
        at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2086)
        at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1442)
        at org.apache.spark.SparkContext.stop(SparkContext.scala:2086)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:677)
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
        at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
        at scala.Option.getOrElse(Option.scala:201)
        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
        at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
        at $line3.$read$$iw.<init>(<console>:5)
        at $line3.$read.<init>(<console>:4)
        at $line3.$read$.<clinit>(<console>)
        at $line3.$eval$.$print$lzycompute(<synthetic>:6)
        at $line3.$eval$.$print(<synthetic>:5)
        at $line3.$eval.$print(<synthetic>)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
        at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
        at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
        at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
        at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
        at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
        at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
        at scala.collection.immutable.List.foreach(List.scala:333)
        at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
        at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
        at org.apache.spark.repl.Main$.doMain(Main.scala:84)
        at org.apache.spark.repl.Main$.main(Main.scala:59)
        at org.apache.spark.repl.Main.main(Main.scala)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/12/11 19:28:36 WARN MetricsSystem: Stopping a MetricsSystem that is not running
21/12/11 19:28:36 ERROR Main: Failed to initialize Spark session.
java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at org.apache.spark.executor.Executor.addReplClassLoaderIfNeeded(Executor.scala:909)
        at org.apache.spark.executor.Executor.<init>(Executor.scala:160)
        at org.apache.spark.scheduler.local.LocalEndpoint.<init>(LocalSchedulerBackend.scala:64)
        at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:132)
        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:581)
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
        at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
        at scala.Option.getOrElse(Option.scala:201)
        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
        at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
        at $line3.$read$$iw.<init>(<console>:5)
        at $line3.$read.<init>(<console>:4)
        at $line3.$read$.<clinit>(<console>)
        at $line3.$eval$.$print$lzycompute(<synthetic>:6)
        at $line3.$eval$.$print(<synthetic>:5)
        at $line3.$eval.$print(<synthetic>)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
        at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
        at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
        at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
        at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
        at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
        at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
        at scala.collection.immutable.List.foreach(List.scala:333)
        at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
        at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
        at org.apache.spark.repl.Main$.doMain(Main.scala:84)
        at org.apache.spark.repl.Main$.main(Main.scala:59)
        at org.apache.spark.repl.Main.main(Main.scala)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes
        at java.base/java.net.URI$Parser.fail(URI.java:2913)
        at java.base/java.net.URI$Parser.checkChars(URI.java:3084)
        at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3166)
        at java.base/java.net.URI$Parser.parse(URI.java:3114)
        at java.base/java.net.URI.<init>(URI.java:600)
        at org.apache.spark.repl.ExecutorClassLoader.<init>(ExecutorClassLoader.scala:57)
        ... 67 more
21/12/11 19:28:36 ERROR Utils: Uncaught exception in thread shutdown-hook-0
java.lang.ExceptionInInitializerError
        at org.apache.spark.executor.Executor.stop(Executor.scala:333)
        at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at scala.util.Try$.apply(Try.scala:210)
        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NullPointerException
        at org.apache.spark.shuffle.ShuffleBlockPusher$.<clinit>(ShuffleBlockPusher.scala:465)
        ... 16 more
21/12/11 19:28:36 WARN ShutdownHookManager: ShutdownHook '' failed, java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError
java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError
        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
        at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
        at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
Caused by: java.lang.ExceptionInInitializerError
        at org.apache.spark.executor.Executor.stop(Executor.scala:333)
        at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
        at scala.util.Try$.apply(Try.scala:210)
        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NullPointerException
        at org.apache.spark.shuffle.ShuffleBlockPusher$.<clinit>(ShuffleBlockPusher.scala:465)
        ... 16 more

As I can see it caused by Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes, but I don't understand what does it mean exactly and how to deal with that

How can I solve this problem?

I use Spark 3.2.0 Pre-built for Apache Hadoop 3.3 and later (Scala 2.13)

JAVA_HOME, HADOOP_HOME, SPARK_HOME path variables are set.

ANSWER

Answered 2022-Jan-07 at 15:11

i face the same problem, i think Spark 3.2 is the problem itself

switched to Spark 3.1.2, it works fine

Source https://stackoverflow.com/questions/70317481

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install spark

You can download it from GitHub.

Support

You can find the latest Spark documentation, including a programming guide, on the project web page. This README file only contains basic setup instructions.

DOWNLOAD this Library from

Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

Explore Related Topics

Share this Page

share link
Reuse Pre-built Kits with spark
Compare Scala Libraries with Highest Support
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

  • © 2022 Open Weaver Inc.