Support
Quality
Security
License
Reuse
kandi has reviewed spark and discovered the below as its top functions. This is intended to give you an instant insight into spark implemented functionality, and help decide if they suit your requirements.
A simple expressive web framework for java. Spark has a kotlin DSL https://github.com/perwendel/spark-kotlin
default
<dependency>
<groupId>com.sparkjava</groupId>
<artifactId>spark-core</artifactId>
<version>2.9.3</version>
</dependency>
Why joining structure-identic dataframes gives different results?
df3 = df2.alias('df2').join(df1.alias('df1'), (F.col('df1.c1') == F.col('df2.c2')), 'full')
df3.show()
# Output
# +----+------+----+----+---+------+----+---+
# | ID|Status| c1| c2| ID|Status| c1| c2|
# +----+------+----+----+---+------+----+---+
# | 4| ok|null| A| 1| bad| A| A|
# |null| null|null|null| 4| ok|null| A|
# +----+------+----+----+---+------+----+---+
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
: +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
: +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
+- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
+- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
| ID|Status| c1| c2| ID|Status| c1| c2|
+----+------+----+----+----+------+----+----+
| 4| ok|null| A|null| null|null|null|
|null| null|null|null| 1| bad| A| A|
|null| null|null|null| 4| ok|null| A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
: +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
: +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
+- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
+- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
| ID|Status| c1| c2| ID|Status| c1| c2|
+----+------+----+----+----+------+----+----+
| 4| ok|null| A|null| null|null|null|
|null| null|null|null| 1| bad| A| A|
|null| null|null|null| 4| ok|null| A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
: +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
: +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
+- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
+- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
| ID|Status| c1| c2| ID|Status| c1| c2|
+----+------+----+----+----+------+----+----+
| 4| ok|null| A|null| null|null|null|
|null| null|null|null| 1| bad| A| A|
|null| null|null|null| 4| ok|null| A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
: +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
: +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
+- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
+- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
| ID|Status| c1| c2| ID|Status| c1| c2|
+----+------+----+----+----+------+----+----+
| 4| ok|null| A|null| null|null|null|
|null| null|null|null| 1| bad| A| A|
|null| null|null|null| 4| ok|null| A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
: +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
: +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
+- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
+- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
| ID|Status| c1| c2| ID|Status| c1| c2|
+----+------+----+----+----+------+----+----+
| 4| ok|null| A|null| null|null|null|
|null| null|null|null| 1| bad| A| A|
|null| null|null|null| 4| ok|null| A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
: +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
: +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
+- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
+- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
| ID|Status| c1| c2| ID|Status| c1| c2|
+----+------+----+----+----+------+----+----+
| 4| ok|null| A|null| null|null|null|
|null| null|null|null| 1| bad| A| A|
|null| null|null|null| 4| ok|null| A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')
-----------------------
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
*(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
+- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
+- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#2 = A)
:- *(1) Project [ID#0L, Status#1, c1#2, A AS c2#6]
: +- *(1) Filter (isnotnull(Status#1) AND (Status#1 = ok))
: +- *(1) Scan ExistingRDD[ID#0L,Status#1,c1#2]
+- BroadcastExchange IdentityBroadcastMode, [id=#75]
+- *(2) Project [ID#46L, Status#47, c1#48, A AS c2#45]
+- *(2) Scan ExistingRDD[ID#46L,Status#47,c1#48]
+----+------+----+----+----+------+----+----+
| ID|Status| c1| c2| ID|Status| c1| c2|
+----+------+----+----+----+------+----+----+
| 4| ok|null| A|null| null|null|null|
|null| null|null|null| 1| bad| A| A|
|null| null|null|null| 4| ok|null| A|
+----+------+----+----+----+------+----+----+
== Physical Plan ==
*(1) Scan ExistingRDD[ID#98L,Status#99,c1#100,c2#101]
== Physical Plan ==
*(1) Filter (isnotnull(Status#124) AND (Status#124 = ok))
+- *(1) Scan ExistingRDD[ID#123L,Status#124,c1#125,c2#126]
df3 = df1.join(df2, (df1.c1 == df2.c2), 'full')
AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>
import numpy as np
import pandas as pd
df =pd.DataFrame(np.random.rand(3,6))
with open("dump_from_v1.3.4.pickle", "wb") as f:
pickle.dump(df, f)
quit()
import pickle
with open("dump_from_v1.3.4.pickle", "rb") as f:
df = pickle.load(f)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-ff5c218eca92> in <module>
1 with open("dump_from_v1.3.4.pickle", "rb") as f:
----> 2 df = pickle.load(f)
3
AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks' from '/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py'>
-----------------------
import numpy as np
import pandas as pd
df =pd.DataFrame(np.random.rand(3,6))
with open("dump_from_v1.3.4.pickle", "wb") as f:
pickle.dump(df, f)
quit()
import pickle
with open("dump_from_v1.3.4.pickle", "rb") as f:
df = pickle.load(f)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-ff5c218eca92> in <module>
1 with open("dump_from_v1.3.4.pickle", "rb") as f:
----> 2 df = pickle.load(f)
3
AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks' from '/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py'>
Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0
sc = SparkContext()
# Get current sparkconf which is set by glue
conf = sc.getConf()
# add additional spark configurations
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
# Restart spark context
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
# create glue context with the restarted sc
glueContext = GlueContext(sc)
Cannot find conda info. Please verify your conda installation on EMR
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
&& /bin/bash ~/miniconda.sh -b -p $HOME/conda
echo -e '\n export PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc
conda config --set always_yes yes --set changeps1 no
conda config -f --add channels conda-forge
conda create -n zoo python=3.7 # "zoo" is conda environment name
conda init bash
source activate zoo
conda install python 3.7.0 -c conda-forge orca
sudo /home/hadoop/conda/envs/zoo/bin/python3.7 -m pip install virtualenv
“spark.pyspark.python": "/home/hadoop/conda/envs/zoo/bin/python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/home/hadoop/conda/envs/zoo/bin/,
"zeppelin.pyspark.python" : "/home/hadoop/conda/bin/python",
"zeppelin.python": "/home/hadoop/conda/bin/python"
-----------------------
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
&& /bin/bash ~/miniconda.sh -b -p $HOME/conda
echo -e '\n export PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc
conda config --set always_yes yes --set changeps1 no
conda config -f --add channels conda-forge
conda create -n zoo python=3.7 # "zoo" is conda environment name
conda init bash
source activate zoo
conda install python 3.7.0 -c conda-forge orca
sudo /home/hadoop/conda/envs/zoo/bin/python3.7 -m pip install virtualenv
“spark.pyspark.python": "/home/hadoop/conda/envs/zoo/bin/python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/home/hadoop/conda/envs/zoo/bin/,
"zeppelin.pyspark.python" : "/home/hadoop/conda/bin/python",
"zeppelin.python": "/home/hadoop/conda/bin/python"
How to set Docker Compose `env_file` relative to `.yml` file when multiple `--file` option is used?
env_file:
- ${BACKEND_BASE:-.}/.env
Read spark data with column that clashes with partition name
df= spark.read.json("s3://bucket/table/**/*.json")
renamedDF= df.withColumnRenamed("old column name","new column name")
-----------------------
from pyspark.sql import functions as F
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
conf = sc._jsc.hadoopConfiguration()
s3_path = "s3://bucket/prefix"
file_cols = ["id", "color", "date"]
partitions_cols = ["company", "service", "date"]
# listing all files for input path
json_files = []
files = Path(s3_path).getFileSystem(conf).listFiles(Path(s3_path), True)
while files.hasNext():
path = files.next().getPath()
if path.getName().endswith(".json"):
json_files.append(path.toString())
df = spark.read.json(json_files) # you can pass here the schema of the files without the partition columns
# renaming file column in if exists in partitions
df = df.select(*[
F.col(c).alias(c) if c not in partitions_cols else F.col(c).alias(f"file_{c}")
for c in df.columns
])
# parse partitions from filenames
for p in partitions_cols:
df = df.withColumn(p, F.regexp_extract(F.input_file_name(), f"/{p}=([^/]+)/", 1))
df.show()
#+-----+----------+---+-------+-------+----------+
#|color| file_date| id|company|service| date|
#+-----+----------+---+-------+-------+----------+
#|green|2021-08-08|baz| abcd| xyz|2021-01-01|
#| blue|2021-12-12|foo| abcd| xyz|2021-01-01|
#| red|2021-10-10|bar| abcd| xyz|2021-01-01|
#+-----+----------+---+-------+-------+----------+
-----------------------
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DateType
# schema of json files
schema = StructType([
StructField('id', StringType(), True),
StructField('color', StringType(), True),
StructField('date', DateType(), True)
])
df = sparkSession.read.text('resources') \
.withColumnRenamed('date', 'partition_date') \
.withColumn('json', F.from_json(F.col('value'), schema)) \
.select('company', 'service', 'partition_date', 'json.*') \
.withColumnRenamed('date', 'file_date') \
.withColumnRenamed('partition_date', 'date')
{"id": "foo", "color": "blue", "date": "2021-12-12"}
{"id": "bar", "color": "red", "date": "2021-12-13"}
{"id": "kix", "color": "yellow", "date": "2021-12-14"}
{"id": "kaz", "color": "blue", "date": "2021-12-15"}
{"id": "dir", "color": "red", "date": "2021-12-16"}
{"id": "tux", "color": "yellow", "date": "2021-12-17"}
+-------+-------+----------+---+------+----------+
|company|service| date| id| color| file_date|
+-------+-------+----------+---+------+----------+
| abcd| xyz|2021-01-01|kaz| blue|2021-12-15|
| abcd| xyz|2021-01-01|dir| red|2021-12-16|
| abcd| xyz|2021-01-01|tux|yellow|2021-12-17|
| abcd| xyz|2021-01-01|foo| blue|2021-12-12|
| abcd| xyz|2021-01-01|bar| red|2021-12-13|
| abcd| xyz|2021-01-01|kix|yellow|2021-12-14|
+-------+-------+----------+---+------+----------+
-----------------------
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DateType
# schema of json files
schema = StructType([
StructField('id', StringType(), True),
StructField('color', StringType(), True),
StructField('date', DateType(), True)
])
df = sparkSession.read.text('resources') \
.withColumnRenamed('date', 'partition_date') \
.withColumn('json', F.from_json(F.col('value'), schema)) \
.select('company', 'service', 'partition_date', 'json.*') \
.withColumnRenamed('date', 'file_date') \
.withColumnRenamed('partition_date', 'date')
{"id": "foo", "color": "blue", "date": "2021-12-12"}
{"id": "bar", "color": "red", "date": "2021-12-13"}
{"id": "kix", "color": "yellow", "date": "2021-12-14"}
{"id": "kaz", "color": "blue", "date": "2021-12-15"}
{"id": "dir", "color": "red", "date": "2021-12-16"}
{"id": "tux", "color": "yellow", "date": "2021-12-17"}
+-------+-------+----------+---+------+----------+
|company|service| date| id| color| file_date|
+-------+-------+----------+---+------+----------+
| abcd| xyz|2021-01-01|kaz| blue|2021-12-15|
| abcd| xyz|2021-01-01|dir| red|2021-12-16|
| abcd| xyz|2021-01-01|tux|yellow|2021-12-17|
| abcd| xyz|2021-01-01|foo| blue|2021-12-12|
| abcd| xyz|2021-01-01|bar| red|2021-12-13|
| abcd| xyz|2021-01-01|kix|yellow|2021-12-14|
+-------+-------+----------+---+------+----------+
-----------------------
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DateType
# schema of json files
schema = StructType([
StructField('id', StringType(), True),
StructField('color', StringType(), True),
StructField('date', DateType(), True)
])
df = sparkSession.read.text('resources') \
.withColumnRenamed('date', 'partition_date') \
.withColumn('json', F.from_json(F.col('value'), schema)) \
.select('company', 'service', 'partition_date', 'json.*') \
.withColumnRenamed('date', 'file_date') \
.withColumnRenamed('partition_date', 'date')
{"id": "foo", "color": "blue", "date": "2021-12-12"}
{"id": "bar", "color": "red", "date": "2021-12-13"}
{"id": "kix", "color": "yellow", "date": "2021-12-14"}
{"id": "kaz", "color": "blue", "date": "2021-12-15"}
{"id": "dir", "color": "red", "date": "2021-12-16"}
{"id": "tux", "color": "yellow", "date": "2021-12-17"}
+-------+-------+----------+---+------+----------+
|company|service| date| id| color| file_date|
+-------+-------+----------+---+------+----------+
| abcd| xyz|2021-01-01|kaz| blue|2021-12-15|
| abcd| xyz|2021-01-01|dir| red|2021-12-16|
| abcd| xyz|2021-01-01|tux|yellow|2021-12-17|
| abcd| xyz|2021-01-01|foo| blue|2021-12-12|
| abcd| xyz|2021-01-01|bar| red|2021-12-13|
| abcd| xyz|2021-01-01|kix|yellow|2021-12-14|
+-------+-------+----------+---+------+----------+
-----------------------
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DateType
# schema of json files
schema = StructType([
StructField('id', StringType(), True),
StructField('color', StringType(), True),
StructField('date', DateType(), True)
])
df = sparkSession.read.text('resources') \
.withColumnRenamed('date', 'partition_date') \
.withColumn('json', F.from_json(F.col('value'), schema)) \
.select('company', 'service', 'partition_date', 'json.*') \
.withColumnRenamed('date', 'file_date') \
.withColumnRenamed('partition_date', 'date')
{"id": "foo", "color": "blue", "date": "2021-12-12"}
{"id": "bar", "color": "red", "date": "2021-12-13"}
{"id": "kix", "color": "yellow", "date": "2021-12-14"}
{"id": "kaz", "color": "blue", "date": "2021-12-15"}
{"id": "dir", "color": "red", "date": "2021-12-16"}
{"id": "tux", "color": "yellow", "date": "2021-12-17"}
+-------+-------+----------+---+------+----------+
|company|service| date| id| color| file_date|
+-------+-------+----------+---+------+----------+
| abcd| xyz|2021-01-01|kaz| blue|2021-12-15|
| abcd| xyz|2021-01-01|dir| red|2021-12-16|
| abcd| xyz|2021-01-01|tux|yellow|2021-12-17|
| abcd| xyz|2021-01-01|foo| blue|2021-12-12|
| abcd| xyz|2021-01-01|bar| red|2021-12-13|
| abcd| xyz|2021-01-01|kix|yellow|2021-12-14|
+-------+-------+----------+---+------+----------+
How do I parse xml documents in Palantir Foundry?
buildscript {
repositories {
// some other things
}
dependencies {
classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
}
}
apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'
dependencies {
condaJars "com.databricks:spark-xml_2.13:0.14.0"
}
// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
// ... some other awesome features you should enable
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many
def read_files(spark_session, paths):
parsed_dfs = []
for file_name in paths:
parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
parsed_dfs += [parsed_df]
output_df = union_many(*parsed_dfs, how="wide")
return output_df
@transform(
the_output=Output("my.awesome.output"),
the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
session = ctx.spark_session
input_filesystem = the_input.filesystem()
hadoop_path = input_filesystem.hadoop_path
files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
output_df = read_files(session, files)
the_output.write_dataframe(output_df)
<tag>
<field1>
my_value
</field1>
</tag>
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename
def test_parse_xml(spark_session):
file_path = resource_filename(__name__, "sample.xml")
parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
assert parsed_df.count() == 1
assert set(parsed_df.columns) == {"field1"}
-----------------------
buildscript {
repositories {
// some other things
}
dependencies {
classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
}
}
apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'
dependencies {
condaJars "com.databricks:spark-xml_2.13:0.14.0"
}
// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
// ... some other awesome features you should enable
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many
def read_files(spark_session, paths):
parsed_dfs = []
for file_name in paths:
parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
parsed_dfs += [parsed_df]
output_df = union_many(*parsed_dfs, how="wide")
return output_df
@transform(
the_output=Output("my.awesome.output"),
the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
session = ctx.spark_session
input_filesystem = the_input.filesystem()
hadoop_path = input_filesystem.hadoop_path
files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
output_df = read_files(session, files)
the_output.write_dataframe(output_df)
<tag>
<field1>
my_value
</field1>
</tag>
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename
def test_parse_xml(spark_session):
file_path = resource_filename(__name__, "sample.xml")
parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
assert parsed_df.count() == 1
assert set(parsed_df.columns) == {"field1"}
-----------------------
buildscript {
repositories {
// some other things
}
dependencies {
classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
}
}
apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'
dependencies {
condaJars "com.databricks:spark-xml_2.13:0.14.0"
}
// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
// ... some other awesome features you should enable
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many
def read_files(spark_session, paths):
parsed_dfs = []
for file_name in paths:
parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
parsed_dfs += [parsed_df]
output_df = union_many(*parsed_dfs, how="wide")
return output_df
@transform(
the_output=Output("my.awesome.output"),
the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
session = ctx.spark_session
input_filesystem = the_input.filesystem()
hadoop_path = input_filesystem.hadoop_path
files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
output_df = read_files(session, files)
the_output.write_dataframe(output_df)
<tag>
<field1>
my_value
</field1>
</tag>
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename
def test_parse_xml(spark_session):
file_path = resource_filename(__name__, "sample.xml")
parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
assert parsed_df.count() == 1
assert set(parsed_df.columns) == {"field1"}
-----------------------
buildscript {
repositories {
// some other things
}
dependencies {
classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
}
}
apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'
dependencies {
condaJars "com.databricks:spark-xml_2.13:0.14.0"
}
// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
// ... some other awesome features you should enable
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many
def read_files(spark_session, paths):
parsed_dfs = []
for file_name in paths:
parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
parsed_dfs += [parsed_df]
output_df = union_many(*parsed_dfs, how="wide")
return output_df
@transform(
the_output=Output("my.awesome.output"),
the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
session = ctx.spark_session
input_filesystem = the_input.filesystem()
hadoop_path = input_filesystem.hadoop_path
files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
output_df = read_files(session, files)
the_output.write_dataframe(output_df)
<tag>
<field1>
my_value
</field1>
</tag>
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename
def test_parse_xml(spark_session):
file_path = resource_filename(__name__, "sample.xml")
parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
assert parsed_df.count() == 1
assert set(parsed_df.columns) == {"field1"}
docker build vue3 not compatible with element-ui on node:16-buster-slim
...
COPY package.json /home
RUN npm config set legacy-peer-deps true
RUN npm install --prefix /home
Why is repartition faster than partitionBy in Spark?
spark.range(1000).withColumn("partition", 'id % 100)
.repartition('partition).write.csv("/tmp/test.csv")
spark.range(1000).withColumn("partition", 'id % 100)
.write.partitionBy("partition").csv("/tmp/test2.csv")
-----------------------
spark.range(1000).withColumn("partition", 'id % 100)
.repartition('partition).write.csv("/tmp/test.csv")
spark.range(1000).withColumn("partition", 'id % 100)
.write.partitionBy("partition").csv("/tmp/test2.csv")
-----------------------
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
-----------------------
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
-----------------------
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
-----------------------
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
-----------------------
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
-----------------------
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
-----------------------
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.write.format("json") \
.partitionBy("partition") \
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
Get difference between two version of delta lake table
import uk.co.gresearch.spark.diff.DatasetDiff
df1.diff(df2)
-----------------------
val lastVersion = DeltaTable.forPath(spark, PATH_TO_DELTA_TABLE)
.history()
.select(col("version"))
.collect.toList
.headOption
.getOrElse(throw new Exception("Is this table empty ?"))
val addPathList = spark
.read
.json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
.where(s"add is not null")
.select(s"add.path")
.collect()
.map(path => formatPath(path.toString))
.toList
val removePathList = spark
.read
.json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
.where(s"remove is not null")
.select(s"remove.path")
.collect()
.map(path => formatPath(path.toString))
.toList
import org.apache.spark.sql.functions._
val addDF = spark
.read
.format("parquet")
.load(addPathList: _*)
.withColumn("add_remove", lit("add"))
val removeDF = spark
.read
.format("parquet")
.load(removePathList: _*)
.withColumn("add_remove", lit("remove"))
addDF.union(removeDF).show()
+----------+----------+
|updatedate|add_remove|
+----------+----------+
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
+----------+----------+
only showing top 20 rows
-----------------------
val lastVersion = DeltaTable.forPath(spark, PATH_TO_DELTA_TABLE)
.history()
.select(col("version"))
.collect.toList
.headOption
.getOrElse(throw new Exception("Is this table empty ?"))
val addPathList = spark
.read
.json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
.where(s"add is not null")
.select(s"add.path")
.collect()
.map(path => formatPath(path.toString))
.toList
val removePathList = spark
.read
.json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
.where(s"remove is not null")
.select(s"remove.path")
.collect()
.map(path => formatPath(path.toString))
.toList
import org.apache.spark.sql.functions._
val addDF = spark
.read
.format("parquet")
.load(addPathList: _*)
.withColumn("add_remove", lit("add"))
val removeDF = spark
.read
.format("parquet")
.load(removePathList: _*)
.withColumn("add_remove", lit("remove"))
addDF.union(removeDF).show()
+----------+----------+
|updatedate|add_remove|
+----------+----------+
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
+----------+----------+
only showing top 20 rows
-----------------------
val lastVersion = DeltaTable.forPath(spark, PATH_TO_DELTA_TABLE)
.history()
.select(col("version"))
.collect.toList
.headOption
.getOrElse(throw new Exception("Is this table empty ?"))
val addPathList = spark
.read
.json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
.where(s"add is not null")
.select(s"add.path")
.collect()
.map(path => formatPath(path.toString))
.toList
val removePathList = spark
.read
.json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
.where(s"remove is not null")
.select(s"remove.path")
.collect()
.map(path => formatPath(path.toString))
.toList
import org.apache.spark.sql.functions._
val addDF = spark
.read
.format("parquet")
.load(addPathList: _*)
.withColumn("add_remove", lit("add"))
val removeDF = spark
.read
.format("parquet")
.load(removePathList: _*)
.withColumn("add_remove", lit("remove"))
addDF.union(removeDF).show()
+----------+----------+
|updatedate|add_remove|
+----------+----------+
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
+----------+----------+
only showing top 20 rows
-----------------------
val lastVersion = DeltaTable.forPath(spark, PATH_TO_DELTA_TABLE)
.history()
.select(col("version"))
.collect.toList
.headOption
.getOrElse(throw new Exception("Is this table empty ?"))
val addPathList = spark
.read
.json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
.where(s"add is not null")
.select(s"add.path")
.collect()
.map(path => formatPath(path.toString))
.toList
val removePathList = spark
.read
.json(s"ROOT_PATH/_delta_log/0000NUMVERSION.json")
.where(s"remove is not null")
.select(s"remove.path")
.collect()
.map(path => formatPath(path.toString))
.toList
import org.apache.spark.sql.functions._
val addDF = spark
.read
.format("parquet")
.load(addPathList: _*)
.withColumn("add_remove", lit("add"))
val removeDF = spark
.read
.format("parquet")
.load(removePathList: _*)
.withColumn("add_remove", lit("remove"))
addDF.union(removeDF).show()
+----------+----------+
|updatedate|add_remove|
+----------+----------+
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
| null| add|
+----------+----------+
only showing top 20 rows
QUESTION
spark-shell throws java.lang.reflect.InvocationTargetException on running
Asked 2022-Apr-01 at 19:53When I execute run-example SparkPi
, for example, it works perfectly, but
when I run spark-shell
, it throws these exceptions:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/big_data/spark-3.2.0-bin-hadoop3.2-scala2.13/jars/spark-unsafe_2.13-3.2.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.0
/_/
Using Scala version 2.13.5 (OpenJDK 64-Bit Server VM, Java 11.0.9.1)
Type in expressions to have them evaluated.
Type :help for more information.
21/12/11 19:28:36 ERROR SparkContext: Error initializing SparkContext.
java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at org.apache.spark.executor.Executor.addReplClassLoaderIfNeeded(Executor.scala:909)
at org.apache.spark.executor.Executor.<init>(Executor.scala:160)
at org.apache.spark.scheduler.local.LocalEndpoint.<init>(LocalSchedulerBackend.scala:64)
at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:132)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:581)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
at $line3.$read$$iw.<init>(<console>:5)
at $line3.$read.<init>(<console>:4)
at $line3.$read$.<clinit>(<console>)
at $line3.$eval$.$print$lzycompute(<synthetic>:6)
at $line3.$eval$.$print(<synthetic>:5)
at $line3.$eval.$print(<synthetic>)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
at scala.collection.immutable.List.foreach(List.scala:333)
at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
at org.apache.spark.repl.Main$.doMain(Main.scala:84)
at org.apache.spark.repl.Main$.main(Main.scala:59)
at org.apache.spark.repl.Main.main(Main.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes
at java.base/java.net.URI$Parser.fail(URI.java:2913)
at java.base/java.net.URI$Parser.checkChars(URI.java:3084)
at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3166)
at java.base/java.net.URI$Parser.parse(URI.java:3114)
at java.base/java.net.URI.<init>(URI.java:600)
at org.apache.spark.repl.ExecutorClassLoader.<init>(ExecutorClassLoader.scala:57)
... 67 more
21/12/11 19:28:36 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.scheduler.local.LocalSchedulerBackend.org$apache$spark$scheduler$local$LocalSchedulerBackend$$stop(LocalSchedulerBackend.scala:173)
at org.apache.spark.scheduler.local.LocalSchedulerBackend.stop(LocalSchedulerBackend.scala:144)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:927)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2516)
at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2086)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1442)
at org.apache.spark.SparkContext.stop(SparkContext.scala:2086)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:677)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
at $line3.$read$$iw.<init>(<console>:5)
at $line3.$read.<init>(<console>:4)
at $line3.$read$.<clinit>(<console>)
at $line3.$eval$.$print$lzycompute(<synthetic>:6)
at $line3.$eval$.$print(<synthetic>:5)
at $line3.$eval.$print(<synthetic>)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
at scala.collection.immutable.List.foreach(List.scala:333)
at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
at org.apache.spark.repl.Main$.doMain(Main.scala:84)
at org.apache.spark.repl.Main$.main(Main.scala:59)
at org.apache.spark.repl.Main.main(Main.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/12/11 19:28:36 WARN MetricsSystem: Stopping a MetricsSystem that is not running
21/12/11 19:28:36 ERROR Main: Failed to initialize Spark session.
java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at org.apache.spark.executor.Executor.addReplClassLoaderIfNeeded(Executor.scala:909)
at org.apache.spark.executor.Executor.<init>(Executor.scala:160)
at org.apache.spark.scheduler.local.LocalEndpoint.<init>(LocalSchedulerBackend.scala:64)
at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:132)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:581)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
at $line3.$read$$iw.<init>(<console>:5)
at $line3.$read.<init>(<console>:4)
at $line3.$read$.<clinit>(<console>)
at $line3.$eval$.$print$lzycompute(<synthetic>:6)
at $line3.$eval$.$print(<synthetic>:5)
at $line3.$eval.$print(<synthetic>)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
at scala.collection.immutable.List.foreach(List.scala:333)
at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
at org.apache.spark.repl.Main$.doMain(Main.scala:84)
at org.apache.spark.repl.Main$.main(Main.scala:59)
at org.apache.spark.repl.Main.main(Main.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes
at java.base/java.net.URI$Parser.fail(URI.java:2913)
at java.base/java.net.URI$Parser.checkChars(URI.java:3084)
at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3166)
at java.base/java.net.URI$Parser.parse(URI.java:3114)
at java.base/java.net.URI.<init>(URI.java:600)
at org.apache.spark.repl.ExecutorClassLoader.<init>(ExecutorClassLoader.scala:57)
... 67 more
21/12/11 19:28:36 ERROR Utils: Uncaught exception in thread shutdown-hook-0
java.lang.ExceptionInInitializerError
at org.apache.spark.executor.Executor.stop(Executor.scala:333)
at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at scala.util.Try$.apply(Try.scala:210)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NullPointerException
at org.apache.spark.shuffle.ShuffleBlockPusher$.<clinit>(ShuffleBlockPusher.scala:465)
... 16 more
21/12/11 19:28:36 WARN ShutdownHookManager: ShutdownHook '' failed, java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError
java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
Caused by: java.lang.ExceptionInInitializerError
at org.apache.spark.executor.Executor.stop(Executor.scala:333)
at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at scala.util.Try$.apply(Try.scala:210)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NullPointerException
at org.apache.spark.shuffle.ShuffleBlockPusher$.<clinit>(ShuffleBlockPusher.scala:465)
... 16 more
As I can see it caused by Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes
, but I don't understand what does it mean exactly and how to deal with that
How can I solve this problem?
I use Spark 3.2.0 Pre-built for Apache Hadoop 3.3 and later (Scala 2.13)
JAVA_HOME, HADOOP_HOME, SPARK_HOME path variables are set.
ANSWER
Answered 2022-Jan-07 at 15:11i face the same problem, i think Spark 3.2 is the problem itself
switched to Spark 3.1.2, it works fine
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Save this library and start creating your kit
Save this library and start creating your kit