Spark

Explore all libraries in Spark

Explore all Spark open source software, libraries, packages, source code, cloud functions and APIs.

Popular New Releases in Spark

elasticsearch

Elasticsearch 8.1.3

xgboost

Release candidate of version 1.6.0

kibana

Kibana 8.1.3

luigi

3.0.3

mlflow

MLflow 1.25.1

Popular Libraries in Spark

elasticsearch

by elastic java

59266 NOASSERTION

Free and Open, Distributed, RESTful Search Engine

spark

by apache scala

32507 Apache-2.0

Apache Spark - A unified analytics engine for large-scale data processing

xgboost

by dmlc c++

22464 Apache-2.0

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

kafka

by apache java

21667 Apache-2.0

Mirror of Apache Kafka

data-science-ipython-notebooks

by donnemartin python

21519 NOASSERTION

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

flink

by apache java

18609 Apache-2.0

Apache Flink

kibana

by elastic typescript

17328 NOASSERTION

Your window into the Elastic Stack

luigi

by spotify python

14716 Apache-2.0

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

presto

by prestodb java

13394 Apache-2.0

The official home of the Presto distributed SQL query engine for big data

Explore all libraries in Spark

Trending New libraries in Spark

airbyte

by airbytehq java

6468 NOASSERTION

Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.

orchest

by orchest python

2877 AGPL-3.0

Build data pipelines, the easy way 🛠️

SZT-bigdata

by geekyouth scala

1137 GPL-3.0

深圳地铁大数据客流分析系统🚇🚄🌟

goodreads_etl_pipeline

by san089 python

832 MIT

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

notebooks

by huggingface jupyter notebook

824 Apache-2.0

Notebooks using the Hugging Face libraries 🤗

dlink

by DataLinkDC java

735 Apache-2.0

Dinky is an out of the box one-stop real-time computing platform dedicated to the construction and practice of Unified Batch & Streaming and Unified Data Lake & Data Warehouse. Based on Apache Flink, Dinky provides the ability to connect many big data frameworks including OLAP and Data Lake.

fugue

by fugue-project python

626 Apache-2.0

A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark and Dask without any rewrites.

notebooker

by man-group python

603 AGPL-3.0

Productionise & schedule your Jupyter Notebooks as easily as you wrote them.

Udacity-Data-Engineering-Projects

by san089 python

521 NOASSERTION

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

Top Authors in Spark

PacktPublishing

101 Libraries

2708

apache

90 Libraries

154349

aws-samples

42 Libraries

1369

awslabs

24 Libraries

9195

databricks

22 Libraries

16423

MrPowers

21 Libraries

1557

jgperrin

20 Libraries

220

mraad

20 Libraries

210

pkourany

19 Libraries

165

knoldus

19 Libraries

261

PacktPublishing

101 Libraries

2708

apache

90 Libraries

154349

aws-samples

42 Libraries

1369

awslabs

24 Libraries

9195

databricks

22 Libraries

16423

MrPowers

21 Libraries

1557

jgperrin

20 Libraries

220

mraad

20 Libraries

210

pkourany

19 Libraries

165

knoldus

19 Libraries

261

Trending Kits in Spark

No Trending Kits are available at this moment for Spark

Trending Discussions on Spark

spark-shell throws java.lang.reflect.InvocationTargetException on running

Why joining structure-identic dataframes gives different results?

AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>

Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

NoSuchMethodError on com.fasterxml.jackson.dataformat.xml.XmlMapper.coercionConfigDefaults()

Cannot find conda info. Please verify your conda installation on EMR

How to set Docker Compose `env_file` relative to `.yml` file when multiple `--file` option is used?

Read spark data with column that clashes with partition name

How do I parse xml documents in Palantir Foundry?

docker build vue3 not compatible with element-ui on node:16-buster-slim

spark-shell throws java.lang.reflect.InvocationTargetException on running

Why joining structure-identic dataframes gives different results?

AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>

Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

NoSuchMethodError on com.fasterxml.jackson.dataformat.xml.XmlMapper.coercionConfigDefaults()

Cannot find conda info. Please verify your conda installation on EMR

How to set Docker Compose `env_file` relative to `.yml` file when multiple `--file` option is used?

Read spark data with column that clashes with partition name

How do I parse xml documents in Palantir Foundry?

docker build vue3 not compatible with element-ui on node:16-buster-slim

QUESTION

spark-shell throws java.lang.reflect.InvocationTargetException on running

Asked 2022-Apr-01 at 19:53

When I execute run-example SparkPi, for example, it works perfectly, but when I run spark-shell, it throws these exceptions:

1WARNING: An illegal reflective access operation has occurred
2WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/big_data/spark-3.2.0-bin-hadoop3.2-scala2.13/jars/spark-unsafe_2.13-3.2.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
3WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
4WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
5WARNING: All illegal access operations will be denied in a future release
6Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
7Setting default log level to &quot;WARN&quot;.
8To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
9Welcome to
10      ____              __
11     / __/__  ___ _____/ /__
12    _\ \/ _ \/ _ `/ __/  '_/
13   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0
14      /_/
15
16Using Scala version 2.13.5 (OpenJDK 64-Bit Server VM, Java 11.0.9.1)
17Type in expressions to have them evaluated.
18Type :help for more information.
1921/12/11 19:28:36 ERROR SparkContext: Error initializing SparkContext.
20java.lang.reflect.InvocationTargetException
21        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
22        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
23        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
24        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
25        at org.apache.spark.executor.Executor.addReplClassLoaderIfNeeded(Executor.scala:909)
26        at org.apache.spark.executor.Executor.&lt;init&gt;(Executor.scala:160)
27        at org.apache.spark.scheduler.local.LocalEndpoint.&lt;init&gt;(LocalSchedulerBackend.scala:64)
28        at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:132)
29        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
30        at org.apache.spark.SparkContext.&lt;init&gt;(SparkContext.scala:581)
31        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
32        at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
33        at scala.Option.getOrElse(Option.scala:201)
34        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
35        at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
36        at $line3.$read$$iw.&lt;init&gt;(&lt;console&gt;:5)
37        at $line3.$read.&lt;init&gt;(&lt;console&gt;:4)
38        at $line3.$read$.&lt;clinit&gt;(&lt;console&gt;)
39        at $line3.$eval$.$print$lzycompute(&lt;synthetic&gt;:6)
40        at $line3.$eval$.$print(&lt;synthetic&gt;:5)
41        at $line3.$eval.$print(&lt;synthetic&gt;)
42        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
43        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
44        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
45        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
46        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
47        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
48        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
49        at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
50        at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
51        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
52        at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
53        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
54        at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
55        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
56        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
57        at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
58        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
59        at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
60        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
61        at scala.collection.immutable.List.foreach(List.scala:333)
62        at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
63        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
64        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
65        at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
66        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
67        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
68        at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
69        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
70        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
71        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
72        at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
73        at org.apache.spark.repl.Main$.doMain(Main.scala:84)
74        at org.apache.spark.repl.Main$.main(Main.scala:59)
75        at org.apache.spark.repl.Main.main(Main.scala)
76        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
77        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
78        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
79        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
80        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
81        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
82        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
83        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
84        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
85        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
86        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
87        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
88Caused by: java.net.URISyntaxException: Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes
89        at java.base/java.net.URI$Parser.fail(URI.java:2913)
90        at java.base/java.net.URI$Parser.checkChars(URI.java:3084)
91        at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3166)
92        at java.base/java.net.URI$Parser.parse(URI.java:3114)
93        at java.base/java.net.URI.&lt;init&gt;(URI.java:600)
94        at org.apache.spark.repl.ExecutorClassLoader.&lt;init&gt;(ExecutorClassLoader.scala:57)
95        ... 67 more
9621/12/11 19:28:36 ERROR Utils: Uncaught exception in thread main
97java.lang.NullPointerException
98        at org.apache.spark.scheduler.local.LocalSchedulerBackend.org$apache$spark$scheduler$local$LocalSchedulerBackend$$stop(LocalSchedulerBackend.scala:173)
99        at org.apache.spark.scheduler.local.LocalSchedulerBackend.stop(LocalSchedulerBackend.scala:144)
100        at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:927)
101        at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2516)
102        at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2086)
103        at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1442)
104        at org.apache.spark.SparkContext.stop(SparkContext.scala:2086)
105        at org.apache.spark.SparkContext.&lt;init&gt;(SparkContext.scala:677)
106        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
107        at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
108        at scala.Option.getOrElse(Option.scala:201)
109        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
110        at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
111        at $line3.$read$$iw.&lt;init&gt;(&lt;console&gt;:5)
112        at $line3.$read.&lt;init&gt;(&lt;console&gt;:4)
113        at $line3.$read$.&lt;clinit&gt;(&lt;console&gt;)
114        at $line3.$eval$.$print$lzycompute(&lt;synthetic&gt;:6)
115        at $line3.$eval$.$print(&lt;synthetic&gt;:5)
116        at $line3.$eval.$print(&lt;synthetic&gt;)
117        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
118        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
119        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
120        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
121        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
122        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
123        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
124        at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
125        at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
126        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
127        at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
128        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
129        at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
130        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
131        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
132        at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
133        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
134        at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
135        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
136        at scala.collection.immutable.List.foreach(List.scala:333)
137        at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
138        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
139        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
140        at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
141        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
142        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
143        at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
144        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
145        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
146        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
147        at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
148        at org.apache.spark.repl.Main$.doMain(Main.scala:84)
149        at org.apache.spark.repl.Main$.main(Main.scala:59)
150        at org.apache.spark.repl.Main.main(Main.scala)
151        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
152        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
153        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
154        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
155        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
156        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
157        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
158        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
159        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
160        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
161        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
162        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16321/12/11 19:28:36 WARN MetricsSystem: Stopping a MetricsSystem that is not running
16421/12/11 19:28:36 ERROR Main: Failed to initialize Spark session.
165java.lang.reflect.InvocationTargetException
166        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
167        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
168        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
169        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
170        at org.apache.spark.executor.Executor.addReplClassLoaderIfNeeded(Executor.scala:909)
171        at org.apache.spark.executor.Executor.&lt;init&gt;(Executor.scala:160)
172        at org.apache.spark.scheduler.local.LocalEndpoint.&lt;init&gt;(LocalSchedulerBackend.scala:64)
173        at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:132)
174        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
175        at org.apache.spark.SparkContext.&lt;init&gt;(SparkContext.scala:581)
176        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
177        at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
178        at scala.Option.getOrElse(Option.scala:201)
179        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
180        at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
181        at $line3.$read$$iw.&lt;init&gt;(&lt;console&gt;:5)
182        at $line3.$read.&lt;init&gt;(&lt;console&gt;:4)
183        at $line3.$read$.&lt;clinit&gt;(&lt;console&gt;)
184        at $line3.$eval$.$print$lzycompute(&lt;synthetic&gt;:6)
185        at $line3.$eval$.$print(&lt;synthetic&gt;:5)
186        at $line3.$eval.$print(&lt;synthetic&gt;)
187        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
188        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
189        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
190        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
191        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
192        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
193        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
194        at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
195        at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
196        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
197        at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
198        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
199        at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
200        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
201        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
202        at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
203        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
204        at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
205        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
206        at scala.collection.immutable.List.foreach(List.scala:333)
207        at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
208        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
209        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
210        at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
211        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
212        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
213        at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
214        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
215        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
216        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
217        at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
218        at org.apache.spark.repl.Main$.doMain(Main.scala:84)
219        at org.apache.spark.repl.Main$.main(Main.scala:59)
220        at org.apache.spark.repl.Main.main(Main.scala)
221        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
222        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
223        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
224        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
225        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
226        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
227        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
228        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
229        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
230        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
231        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
232        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
233Caused by: java.net.URISyntaxException: Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes
234        at java.base/java.net.URI$Parser.fail(URI.java:2913)
235        at java.base/java.net.URI$Parser.checkChars(URI.java:3084)
236        at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3166)
237        at java.base/java.net.URI$Parser.parse(URI.java:3114)
238        at java.base/java.net.URI.&lt;init&gt;(URI.java:600)
239        at org.apache.spark.repl.ExecutorClassLoader.&lt;init&gt;(ExecutorClassLoader.scala:57)
240        ... 67 more
24121/12/11 19:28:36 ERROR Utils: Uncaught exception in thread shutdown-hook-0
242java.lang.ExceptionInInitializerError
243        at org.apache.spark.executor.Executor.stop(Executor.scala:333)
244        at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
245        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
246        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
247        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
248        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
249        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
250        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
251        at scala.util.Try$.apply(Try.scala:210)
252        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
253        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
254        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
255        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
256        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
257        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
258        at java.base/java.lang.Thread.run(Thread.java:829)
259Caused by: java.lang.NullPointerException
260        at org.apache.spark.shuffle.ShuffleBlockPusher$.&lt;clinit&gt;(ShuffleBlockPusher.scala:465)
261        ... 16 more
26221/12/11 19:28:36 WARN ShutdownHookManager: ShutdownHook '' failed, java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError
263java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError
264        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
265        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
266        at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
267        at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
268Caused by: java.lang.ExceptionInInitializerError
269        at org.apache.spark.executor.Executor.stop(Executor.scala:333)
270        at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
271        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
272        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
273        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
274        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
275        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
276        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
277        at scala.util.Try$.apply(Try.scala:210)
278        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
279        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
280        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
281        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
282        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
283        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
284        at java.base/java.lang.Thread.run(Thread.java:829)
285Caused by: java.lang.NullPointerException
286        at org.apache.spark.shuffle.ShuffleBlockPusher$.&lt;clinit&gt;(ShuffleBlockPusher.scala:465)
287        ... 16 more
288

As I can see it caused by Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes, but I don't understand what does it mean exactly and how to deal with that

How can I solve this problem?

I use Spark 3.2.0 Pre-built for Apache Hadoop 3.3 and later (Scala 2.13)

JAVA_HOME, HADOOP_HOME, SPARK_HOME path variables are set.

ANSWER

Answered 2022-Jan-07 at 15:11

i face the same problem, i think Spark 3.2 is the problem itself

switched to Spark 3.1.2, it works fine

Source https://stackoverflow.com/questions/70317481

QUESTION

Why joining structure-identic dataframes gives different results?

Asked 2022-Mar-21 at 13:05

Update: the root issue was a bug which was fixed in Spark 3.2.0.

Input df structures are identic in both runs, but outputs are different. Only the second run returns desired result (df6). I know I can use aliases for dataframes which would return desired result.

The question. What is the underlying Spark mechanics in creating df3? Spark reads df1.c1 == df2.c2 in the join's on clause, but it's evident that it does not pay attention to the dfs provided. What's under the hood there? How to anticipate such behaviour?

First run (incorrect df3 result):

1data = [
2    (1, 'bad', 'A'),
3    (4, 'ok', None)]
4df1 = spark.createDataFrame(data, ['ID', 'Status', 'c1'])
5df1 = df1.withColumn('c2', F.lit('A'))
6df1.show()
7
8#+---+------+----+---+
9#| ID|Status|  c1| c2|
10#+---+------+----+---+
11#|  1|   bad|   A|  A|
12#|  4|    ok|null|  A|
13#+---+------+----+---+
14
15df2 = df1.filter((F.col('Status') == 'ok'))
16df2.show()
17
18#+---+------+----+---+
19#| ID|Status|  c1| c2|
20#+---+------+----+---+
21#|  4|    ok|null|  A|
22#+---+------+----+---+
23
24df3 = df2.join(df1, (df1.c1 == df2.c2), 'full')
25df3.show()
26
27#+----+------+----+----+----+------+----+----+
28#|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
29#+----+------+----+----+----+------+----+----+
30#|   4|    ok|null|   A|null|  null|null|null|
31#|null|  null|null|null|   1|   bad|   A|   A|
32#|null|  null|null|null|   4|    ok|null|   A|
33#+----+------+----+----+----+------+----+----+
34

Second run (correct df6 result):

1data = [
2    (1, 'bad', 'A'),
3    (4, 'ok', None)]
4df1 = spark.createDataFrame(data, ['ID', 'Status', 'c1'])
5df1 = df1.withColumn('c2', F.lit('A'))
6df1.show()
7
8#+---+------+----+---+
9#| ID|Status|  c1| c2|
10#+---+------+----+---+
11#|  1|   bad|   A|  A|
12#|  4|    ok|null|  A|
13#+---+------+----+---+
14
15df2 = df1.filter((F.col('Status') == 'ok'))
16df2.show()
17
18#+---+------+----+---+
19#| ID|Status|  c1| c2|
20#+---+------+----+---+
21#|  4|    ok|null|  A|
22#+---+------+----+---+
23
24df3 = df2.join(df1, (df1.c1 == df2.c2), 'full')
25df3.show()
26
27#+----+------+----+----+----+------+----+----+
28#|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
29#+----+------+----+----+----+------+----+----+
30#|   4|    ok|null|   A|null|  null|null|null|
31#|null|  null|null|null|   1|   bad|   A|   A|
32#|null|  null|null|null|   4|    ok|null|   A|
33#+----+------+----+----+----+------+----+----+
34data = [
35    (1, 'bad', 'A', 'A'),
36    (4, 'ok', None, 'A')]
37df4 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2'])
38df4.show()
39
40#+---+------+----+---+
41#| ID|Status|  c1| c2|
42#+---+------+----+---+
43#|  1|   bad|   A|  A|
44#|  4|    ok|null|  A|
45#+---+------+----+---+
46
47df5 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2']).filter((F.col('Status') == 'ok'))
48df5.show()
49
50#+---+------+----+---+
51#| ID|Status|  c1| c2|
52#+---+------+----+---+
53#|  4|    ok|null|  A|
54#+---+------+----+---+
55
56df6 = df5.join(df4, (df4.c1 == df5.c2), 'full')
57df6.show()
58
59#+----+------+----+----+---+------+----+---+
60#|  ID|Status|  c1|  c2| ID|Status|  c1| c2|
61#+----+------+----+----+---+------+----+---+
62#|null|  null|null|null|  4|    ok|null|  A|
63#|   4|    ok|null|   A|  1|   bad|   A|  A|
64#+----+------+----+----+---+------+----+---+
65

I can see the physical plans are different in a way that different joins are used internally (BroadcastNestedLoopJoin and SortMergeJoin). But this by itself does not explain why results are different as they should still be same for different internal join types.

1data = [
2    (1, 'bad', 'A'),
3    (4, 'ok', None)]
4df1 = spark.createDataFrame(data, ['ID', 'Status', 'c1'])
5df1 = df1.withColumn('c2', F.lit('A'))
6df1.show()
7
8#+---+------+----+---+
9#| ID|Status|  c1| c2|
10#+---+------+----+---+
11#|  1|   bad|   A|  A|
12#|  4|    ok|null|  A|
13#+---+------+----+---+
14
15df2 = df1.filter((F.col('Status') == 'ok'))
16df2.show()
17
18#+---+------+----+---+
19#| ID|Status|  c1| c2|
20#+---+------+----+---+
21#|  4|    ok|null|  A|
22#+---+------+----+---+
23
24df3 = df2.join(df1, (df1.c1 == df2.c2), 'full')
25df3.show()
26
27#+----+------+----+----+----+------+----+----+
28#|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
29#+----+------+----+----+----+------+----+----+
30#|   4|    ok|null|   A|null|  null|null|null|
31#|null|  null|null|null|   1|   bad|   A|   A|
32#|null|  null|null|null|   4|    ok|null|   A|
33#+----+------+----+----+----+------+----+----+
34data = [
35    (1, 'bad', 'A', 'A'),
36    (4, 'ok', None, 'A')]
37df4 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2'])
38df4.show()
39
40#+---+------+----+---+
41#| ID|Status|  c1| c2|
42#+---+------+----+---+
43#|  1|   bad|   A|  A|
44#|  4|    ok|null|  A|
45#+---+------+----+---+
46
47df5 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2']).filter((F.col('Status') == 'ok'))
48df5.show()
49
50#+---+------+----+---+
51#| ID|Status|  c1| c2|
52#+---+------+----+---+
53#|  4|    ok|null|  A|
54#+---+------+----+---+
55
56df6 = df5.join(df4, (df4.c1 == df5.c2), 'full')
57df6.show()
58
59#+----+------+----+----+---+------+----+---+
60#|  ID|Status|  c1|  c2| ID|Status|  c1| c2|
61#+----+------+----+----+---+------+----+---+
62#|null|  null|null|null|  4|    ok|null|  A|
63#|   4|    ok|null|   A|  1|   bad|   A|  A|
64#+----+------+----+----+---+------+----+---+
65df3.explain()
66
67== Physical Plan ==
68BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#23335 = A)
69:- *(1) Project [ID#23333L, Status#23334, c1#23335, A AS c2#23339]
70:  +- *(1) Filter (isnotnull(Status#23334) AND (Status#23334 = ok))
71:     +- *(1) Scan ExistingRDD[ID#23333L,Status#23334,c1#23335]
72+- BroadcastExchange IdentityBroadcastMode, [id=#9250]
73   +- *(2) Project [ID#23379L, Status#23380, c1#23381, A AS c2#23378]
74      +- *(2) Scan ExistingRDD[ID#23379L,Status#23380,c1#23381]
75
76df6.explain()
77
78== Physical Plan ==
79SortMergeJoin [c2#23459], [c1#23433], FullOuter
80:- *(2) Sort [c2#23459 ASC NULLS FIRST], false, 0
81:  +- Exchange hashpartitioning(c2#23459, 200), ENSURE_REQUIREMENTS, [id=#9347]
82:     +- *(1) Filter (isnotnull(Status#23457) AND (Status#23457 = ok))
83:        +- *(1) Scan ExistingRDD[ID#23456L,Status#23457,c1#23458,c2#23459]
84+- *(4) Sort [c1#23433 ASC NULLS FIRST], false, 0
85   +- Exchange hashpartitioning(c1#23433, 200), ENSURE_REQUIREMENTS, [id=#9352]
86      +- *(3) Scan ExistingRDD[ID#23431L,Status#23432,c1#23433,c2#23434]
87

ANSWER

Answered 2021-Sep-24 at 16:19

Spark for some reason doesn't distinguish your c1 and c2 columns correctly. This is the fix for df3 to have your expected result:

1data = [
2    (1, 'bad', 'A'),
3    (4, 'ok', None)]
4df1 = spark.createDataFrame(data, ['ID', 'Status', 'c1'])
5df1 = df1.withColumn('c2', F.lit('A'))
6df1.show()
7
8#+---+------+----+---+
9#| ID|Status|  c1| c2|
10#+---+------+----+---+
11#|  1|   bad|   A|  A|
12#|  4|    ok|null|  A|
13#+---+------+----+---+
14
15df2 = df1.filter((F.col('Status') == 'ok'))
16df2.show()
17
18#+---+------+----+---+
19#| ID|Status|  c1| c2|
20#+---+------+----+---+
21#|  4|    ok|null|  A|
22#+---+------+----+---+
23
24df3 = df2.join(df1, (df1.c1 == df2.c2), 'full')
25df3.show()
26
27#+----+------+----+----+----+------+----+----+
28#|  ID|Status|  c1|  c2|  ID|Status|  c1|  c2|
29#+----+------+----+----+----+------+----+----+
30#|   4|    ok|null|   A|null|  null|null|null|
31#|null|  null|null|null|   1|   bad|   A|   A|
32#|null|  null|null|null|   4|    ok|null|   A|
33#+----+------+----+----+----+------+----+----+
34data = [
35    (1, 'bad', 'A', 'A'),
36    (4, 'ok', None, 'A')]
37df4 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2'])
38df4.show()
39
40#+---+------+----+---+
41#| ID|Status|  c1| c2|
42#+---+------+----+---+
43#|  1|   bad|   A|  A|
44#|  4|    ok|null|  A|
45#+---+------+----+---+
46
47df5 = spark.createDataFrame(data, ['ID', 'Status', 'c1', 'c2']).filter((F.col('Status') == 'ok'))
48df5.show()
49
50#+---+------+----+---+
51#| ID|Status|  c1| c2|
52#+---+------+----+---+
53#|  4|    ok|null|  A|
54#+---+------+----+---+
55
56df6 = df5.join(df4, (df4.c1 == df5.c2), 'full')
57df6.show()
58
59#+----+------+----+----+---+------+----+---+
60#|  ID|Status|  c1|  c2| ID|Status|  c1| c2|
61#+----+------+----+----+---+------+----+---+
62#|null|  null|null|null|  4|    ok|null|  A|
63#|   4|    ok|null|   A|  1|   bad|   A|  A|
64#+----+------+----+----+---+------+----+---+
65df3.explain()
66
67== Physical Plan ==
68BroadcastNestedLoopJoin BuildRight, FullOuter, (c1#23335 = A)
69:- *(1) Project [ID#23333L, Status#23334, c1#23335, A AS c2#23339]
70:  +- *(1) Filter (isnotnull(Status#23334) AND (Status#23334 = ok))
71:     +- *(1) Scan ExistingRDD[ID#23333L,Status#23334,c1#23335]
72+- BroadcastExchange IdentityBroadcastMode, [id=#9250]
73   +- *(2) Project [ID#23379L, Status#23380, c1#23381, A AS c2#23378]
74      +- *(2) Scan ExistingRDD[ID#23379L,Status#23380,c1#23381]
75
76df6.explain()
77
78== Physical Plan ==
79SortMergeJoin [c2#23459], [c1#23433], FullOuter
80:- *(2) Sort [c2#23459 ASC NULLS FIRST], false, 0
81:  +- Exchange hashpartitioning(c2#23459, 200), ENSURE_REQUIREMENTS, [id=#9347]
82:     +- *(1) Filter (isnotnull(Status#23457) AND (Status#23457 = ok))
83:        +- *(1) Scan ExistingRDD[ID#23456L,Status#23457,c1#23458,c2#23459]
84+- *(4) Sort [c1#23433 ASC NULLS FIRST], false, 0
85   +- Exchange hashpartitioning(c1#23433, 200), ENSURE_REQUIREMENTS, [id=#9352]
86      +- *(3) Scan ExistingRDD[ID#23431L,Status#23432,c1#23433,c2#23434]
87df3 = df2.alias('df2').join(df1.alias('df1'), (F.col('df1.c1') == F.col('df2.c2')), 'full')
88df3.show()
89
90# Output
91# +----+------+----+----+---+------+----+---+
92# |  ID|Status|  c1|  c2| ID|Status|  c1| c2|
93# +----+------+----+----+---+------+----+---+
94# |   4|    ok|null|   A|  1|   bad|   A|  A|
95# |null|  null|null|null|  4|    ok|null|  A|
96# +----+------+----+----+---+------+----+---+
97

Source https://stackoverflow.com/questions/69316256

QUESTION

AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>

Asked 2022-Feb-25 at 13:18

I was using pyspark on AWS EMR (4 r5.xlarge as 4 workers, each has one executor and 4 cores), and I got AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'. Below is a snippet of the code that threw this error:

1search =  SearchEngine(db_file_dir = &quot;/tmp/db&quot;)
2conn = sqlite3.connect(&quot;/tmp/db/simple_db.sqlite&quot;)
3pdf_ = pd.read_sql_query('''select  zipcode, lat, lng, 
4                        bounds_west, bounds_east, bounds_north, bounds_south from 
5                        simple_zipcode''',conn)
6brd_pdf = spark.sparkContext.broadcast(pdf_) 
7conn.close()
8
9
10@udf('string')
11def get_zip_b(lat, lng):
12    pdf = brd_pdf.value 
13    out = pdf[(np.array(pdf[&quot;bounds_north&quot;]) &gt;= lat) &amp; 
14              (np.array(pdf[&quot;bounds_south&quot;]) &lt;= lat) &amp; 
15              (np.array(pdf['bounds_west']) &lt;= lng) &amp; 
16              (np.array(pdf['bounds_east']) &gt;= lng) ]
17    if len(out):
18        min_index = np.argmin( (np.array(out[&quot;lat&quot;]) - lat)**2 + (np.array(out[&quot;lng&quot;]) - lng)**2)
19        zip_ = str(out[&quot;zipcode&quot;].iloc[min_index])
20    else:
21        zip_ = 'bad'
22    return zip_
23
24df = df.withColumn('zipcode', get_zip_b(col(&quot;latitude&quot;),col(&quot;longitude&quot;)))
25

Below is the traceback, where line 102, in get_zip_b refers to pdf = brd_pdf.value:

1search =  SearchEngine(db_file_dir = &quot;/tmp/db&quot;)
2conn = sqlite3.connect(&quot;/tmp/db/simple_db.sqlite&quot;)
3pdf_ = pd.read_sql_query('''select  zipcode, lat, lng, 
4                        bounds_west, bounds_east, bounds_north, bounds_south from 
5                        simple_zipcode''',conn)
6brd_pdf = spark.sparkContext.broadcast(pdf_) 
7conn.close()
8
9
10@udf('string')
11def get_zip_b(lat, lng):
12    pdf = brd_pdf.value 
13    out = pdf[(np.array(pdf[&quot;bounds_north&quot;]) &gt;= lat) &amp; 
14              (np.array(pdf[&quot;bounds_south&quot;]) &lt;= lat) &amp; 
15              (np.array(pdf['bounds_west']) &lt;= lng) &amp; 
16              (np.array(pdf['bounds_east']) &gt;= lng) ]
17    if len(out):
18        min_index = np.argmin( (np.array(out[&quot;lat&quot;]) - lat)**2 + (np.array(out[&quot;lng&quot;]) - lng)**2)
19        zip_ = str(out[&quot;zipcode&quot;].iloc[min_index])
20    else:
21        zip_ = 'bad'
22    return zip_
23
24df = df.withColumn('zipcode', get_zip_b(col(&quot;latitude&quot;),col(&quot;longitude&quot;)))
2521/08/02 06:18:19 WARN TaskSetManager: Lost task 12.0 in stage 7.0 (TID 1814, ip-10-22-17-94.pclc0.merkle.local, executor 6): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
26  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py&quot;, line 605, in main
27    process()
28  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py&quot;, line 597, in process
29    serializer.dump_stream(out_iter, outfile)
30  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py&quot;, line 223, in dump_stream
31    self.serializer.dump_stream(self._batched(iterator), stream)
32  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py&quot;, line 141, in dump_stream
33    for obj in iterator:
34  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py&quot;, line 212, in _batched
35    for item in iterator:
36  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py&quot;, line 450, in mapper
37    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
38  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py&quot;, line 450, in &lt;genexpr&gt;
39    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
40  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py&quot;, line 90, in &lt;lambda&gt;
41    return lambda *a: f(*a)
42  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/util.py&quot;, line 121, in wrapper
43    return f(*args, **kwargs)
44  File &quot;/mnt/var/lib/hadoop/steps/s-1IBFS0SYWA19Z/Mobile_ID_process_center.py&quot;, line 102, in get_zip_b
45  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py&quot;, line 146, in value
46    self._value = self.load_from_path(self._path)
47  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py&quot;, line 123, in load_from_path
48    return self.load(f)
49  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py&quot;, line 129, in load
50    return pickle.load(file)
51AttributeError: Can't get attribute 'new_block' on &lt;module 'pandas.core.internals.blocks' from '/mnt/miniconda/lib/python3.9/site-packages/pandas/core/internals/blocks.py'&gt;
52

Some observations and thought process:

1, After doing some search online, the AttributeError in pyspark seems to be caused by mismatched pandas versions between driver and workers?

2, But I ran the same code on two different datasets, one worked without any errors but the other didn't, which seems very strange and undeterministic, and it seems like the errors may not be caused by mismatched pandas versions. Otherwise, neither two datasets would succeed.

3, I then ran the same code on the successful dataset again, but this time with different spark configurations: setting spark.driver.memory from 2048M to 4192m, and it threw AttributeError.

4, In conclusion, I think the AttributeError has something to do with driver. But I can't tell how they are related from the error message, and how to fix it: AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'.

ANSWER

Answered 2021-Aug-26 at 14:53

I had the same error using pandas 1.3.2 in the server while 1.2 in my client. Downgrading pandas to 1.2 solved the problem.

Source https://stackoverflow.com/questions/68625748

QUESTION

Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

Asked 2022-Feb-10 at 13:45

When switching from Glue 2.0 to 3.0, which means also switching from Spark 2.4 to 3.1.1, my jobs start to fail when processing timestamps prior to 1900 with this error:

1An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
2You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, 
3as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+s Proleptic Gregorian calendar.
4See more details in SPARK-31404.
5You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. 
6Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is.
7

I tried everything to set the int96RebaseModeInRead config in Glue, even contacted the Support, but it seems that currently Glue is overwriting that flag and you can not set it yourself.

If anyone knows a workaround, that would be great. Otherwise I will continue with Glue 2.0. and wait for the Glue dev team to fix this.

ANSWER

Answered 2022-Feb-10 at 13:45

I made it work by setting --conf to spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED.

This is a workaround though and Glue Dev team is working on a fix, although there is no ETA.

Also this is still very buggy. You can not call .show() on a DynamicFrame for example, you need to call it on a DataFrame. Also all my jobs failed where I call data_frame.rdd.isEmpty(), don't ask me why.

Update 24.11.2021: I reached out to the Glue Dev Team and they told me that this is the intended way of fixing it. There is a workaround that can be done inside of the script though:

1An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
2You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, 
3as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+s Proleptic Gregorian calendar.
4See more details in SPARK-31404.
5You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. 
6Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is.
7sc = SparkContext()
8# Get current sparkconf which is set by glue
9conf = sc.getConf()
10# add additional spark configurations
11conf.set(&quot;spark.sql.legacy.parquet.int96RebaseModeInRead&quot;, &quot;CORRECTED&quot;)
12conf.set(&quot;spark.sql.legacy.parquet.int96RebaseModeInWrite&quot;, &quot;CORRECTED&quot;)
13conf.set(&quot;spark.sql.legacy.parquet.datetimeRebaseModeInRead&quot;, &quot;CORRECTED&quot;)
14conf.set(&quot;spark.sql.legacy.parquet.datetimeRebaseModeInWrite&quot;, &quot;CORRECTED&quot;)
15# Restart spark context
16sc.stop()
17sc = SparkContext.getOrCreate(conf=conf)
18# create glue context with the restarted sc
19glueContext = GlueContext(sc)
20

Source https://stackoverflow.com/questions/68891312

QUESTION

NoSuchMethodError on com.fasterxml.jackson.dataformat.xml.XmlMapper.coercionConfigDefaults()

Asked 2022-Feb-09 at 12:31

I'm parsing a XML string to convert it to a JsonNode in Scala using a XmlMapper from the Jackson library. I code on a Databricks notebook, so compilation is done on a cloud cluster. When compiling my code I got this error java.lang.NoSuchMethodError: com.fasterxml.jackson.dataformat.xml.XmlMapper.coercionConfigDefaults()Lcom/fasterxml/jackson/databind/cfg/MutableCoercionConfig; with a hundred lines of "at com.databricks. ..."

I maybe forget to import something but for me this is ok (tell me if I'm wrong) :

1import ch.qos.logback.classic._
2import com.typesafe.scalalogging._
3import com.fasterxml.jackson._
4import com.fasterxml.jackson.core._
5import com.fasterxml.jackson.databind.{ObjectMapper, JsonNode}
6import com.fasterxml.jackson.dataformat.xml._
7import com.fasterxml.jackson.module.scala._
8import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
9import java.io._
10import java.time.Instant
11import java.util.concurrent.TimeUnit
12import javax.xml.parsers._
13import okhttp3.{Headers, OkHttpClient, Request, Response, RequestBody, FormBody}
14import okhttp3.OkHttpClient.Builder._
15import org.apache.spark._
16import org.xml.sax._
17

As I'm using Databricks, there's no SBT file for dependencies. Instead I installed the libs I need directly on the cluster. Here are the ones I'm using :

1import ch.qos.logback.classic._
2import com.typesafe.scalalogging._
3import com.fasterxml.jackson._
4import com.fasterxml.jackson.core._
5import com.fasterxml.jackson.databind.{ObjectMapper, JsonNode}
6import com.fasterxml.jackson.dataformat.xml._
7import com.fasterxml.jackson.module.scala._
8import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
9import java.io._
10import java.time.Instant
11import java.util.concurrent.TimeUnit
12import javax.xml.parsers._
13import okhttp3.{Headers, OkHttpClient, Request, Response, RequestBody, FormBody}
14import okhttp3.OkHttpClient.Builder._
15import org.apache.spark._
16import org.xml.sax._
17com.squareup.okhttp:okhttp:2.7.5
18com.squareup.okhttp3:okhttp:4.9.0
19com.squareup.okhttp3:okhttp:3.14.9
20org.scala-lang.modules:scala-swing_3:3.0.0
21ch.qos.logback:logback-classic:1.2.6
22com.typesafe:scalalogging-slf4j_2.10:1.1.0
23cc.spray.json:spray-json_2.9.1:1.0.1
24com.fasterxml.jackson.module:jackson-module-scala_3:2.13.0
25javax.xml.parsers:jaxp-api:1.4.5
26org.xml.sax:2.0.1
27

The code causing the error is simply (coming from here : https://www.baeldung.com/jackson-convert-xml-json Chapter 5):

1import ch.qos.logback.classic._
2import com.typesafe.scalalogging._
3import com.fasterxml.jackson._
4import com.fasterxml.jackson.core._
5import com.fasterxml.jackson.databind.{ObjectMapper, JsonNode}
6import com.fasterxml.jackson.dataformat.xml._
7import com.fasterxml.jackson.module.scala._
8import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
9import java.io._
10import java.time.Instant
11import java.util.concurrent.TimeUnit
12import javax.xml.parsers._
13import okhttp3.{Headers, OkHttpClient, Request, Response, RequestBody, FormBody}
14import okhttp3.OkHttpClient.Builder._
15import org.apache.spark._
16import org.xml.sax._
17com.squareup.okhttp:okhttp:2.7.5
18com.squareup.okhttp3:okhttp:4.9.0
19com.squareup.okhttp3:okhttp:3.14.9
20org.scala-lang.modules:scala-swing_3:3.0.0
21ch.qos.logback:logback-classic:1.2.6
22com.typesafe:scalalogging-slf4j_2.10:1.1.0
23cc.spray.json:spray-json_2.9.1:1.0.1
24com.fasterxml.jackson.module:jackson-module-scala_3:2.13.0
25javax.xml.parsers:jaxp-api:1.4.5
26org.xml.sax:2.0.1
27val xmlMapper: XmlMapper = new XmlMapper()
28val jsonNode: JsonNode = xmlMapper.readTree(responseBody.getBytes())
29

with responseBody being a String containing a XML document (I previously checked the integrity of the XML). When removing those two lines the code is working fine.

I've read tons of articles or forums but I can't figure out what's causing my issue. Can someone please help me ? Thanks a lot ! :)

ANSWER

Answered 2021-Oct-07 at 12:08

Welcome to dependency hell and breaking changes in libraries.

This usually happens, when various lib bring in different version of same lib. In this case it is Jackson. java.lang.NoSuchMethodError: com.fasterxml.jackson.dataformat.xml.XmlMapper.coercionConfigDefaults()Lcom/fasterxml/jackson/databind/cfg/MutableCoercionConfig; means: One lib probably require Jackson version, which has this method, but on class path is version, which does not yet have this funcion or got removed bcs was deprecated or renamed.

In case like this is good to print dependency tree and check version of Jackson required in libs. And if possible use newer versions of requid libs.

Solution: use libs, which use compatible versions of Jackson lib. No other shortcut possible.

Source https://stackoverflow.com/questions/69480470

QUESTION

Cannot find conda info. Please verify your conda installation on EMR

Asked 2022-Feb-05 at 00:17

I am trying to install conda on EMR and below is my bootstrap script, it looks like conda is getting installed but it is not getting added to environment variable. When I manually update the $PATH variable on EMR master node, it can identify conda. I want to use conda on Zeppelin.

I also tried adding condig into configuration like below while launching my EMR instance however I still get the below mentioned error.

1    &quot;classification&quot;: &quot;spark-env&quot;,
2    &quot;properties&quot;: {
3        &quot;conda&quot;: &quot;/home/hadoop/conda/bin&quot;
4    }
5

1    &quot;classification&quot;: &quot;spark-env&quot;,
2    &quot;properties&quot;: {
3        &quot;conda&quot;: &quot;/home/hadoop/conda/bin&quot;
4    }
5[hadoop@ip-172-30-5-150 ~]$ PATH=/home/hadoop/conda/bin:$PATH
6[hadoop@ip-172-30-5-150 ~]$ conda
7usage: conda [-h] [-V] command ...
8
9conda is a tool for managing and deploying applications, environments and packages.
10

1    &quot;classification&quot;: &quot;spark-env&quot;,
2    &quot;properties&quot;: {
3        &quot;conda&quot;: &quot;/home/hadoop/conda/bin&quot;
4    }
5[hadoop@ip-172-30-5-150 ~]$ PATH=/home/hadoop/conda/bin:$PATH
6[hadoop@ip-172-30-5-150 ~]$ conda
7usage: conda [-h] [-V] command ...
8
9conda is a tool for managing and deploying applications, environments and packages.
10#!/usr/bin/env bash
11
12
13# Install conda
14wget https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
15    &amp;&amp; /bin/bash ~/miniconda.sh -b -p $HOME/conda
16
17
18conda config --set always_yes yes --set changeps1 no
19conda install conda=4.2.13
20conda config -f --add channels conda-forge
21rm ~/miniconda.sh
22echo bootstrap_conda.sh completed. PATH now: $PATH
23export PYSPARK_PYTHON=&quot;/home/hadoop/conda/bin/python3.5&quot;
24
25echo -e '\nexport PATH=$HOME/conda/bin:$PATH' &gt;&gt; $HOME/.bashrc &amp;&amp; source $HOME/.bashrc
26
27
28conda create -n zoo python=3.7 # &quot;zoo&quot; is conda environment name, you can use any name you like.
29conda activate zoo
30sudo pip3 install tensorflow
31sudo pip3 install boto3
32sudo pip3 install botocore
33sudo pip3 install numpy
34sudo pip3 install pandas
35sudo pip3 install scipy
36sudo pip3 install s3fs
37sudo pip3 install matplotlib
38sudo pip3 install -U tqdm
39sudo pip3 install -U scikit-learn
40sudo pip3 install -U scikit-multilearn
41sudo pip3 install xlutils
42sudo pip3 install natsort
43sudo pip3 install pydot
44sudo pip3 install python-pydot
45sudo pip3 install python-pydot-ng
46sudo pip3 install pydotplus
47sudo pip3 install h5py
48sudo pip3 install graphviz
49sudo pip3 install recmetrics
50sudo pip3 install openpyxl
51sudo pip3 install xlrd
52sudo pip3 install xlwt
53sudo pip3 install tensorflow.io
54sudo pip3 install Cython
55sudo pip3 install ray
56sudo pip3 install zoo
57sudo pip3 install analytics-zoo
58sudo pip3 install analytics-zoo[ray]
59#sudo /usr/bin/pip-3.6 install -U imbalanced-learn
60
61
62

ANSWER

Answered 2022-Feb-05 at 00:17

I got the conda working by modifying the script as below, emr python versions were colliding with the conda version.:

1    &quot;classification&quot;: &quot;spark-env&quot;,
2    &quot;properties&quot;: {
3        &quot;conda&quot;: &quot;/home/hadoop/conda/bin&quot;
4    }
5[hadoop@ip-172-30-5-150 ~]$ PATH=/home/hadoop/conda/bin:$PATH
6[hadoop@ip-172-30-5-150 ~]$ conda
7usage: conda [-h] [-V] command ...
8
9conda is a tool for managing and deploying applications, environments and packages.
10#!/usr/bin/env bash
11
12
13# Install conda
14wget https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
15    &amp;&amp; /bin/bash ~/miniconda.sh -b -p $HOME/conda
16
17
18conda config --set always_yes yes --set changeps1 no
19conda install conda=4.2.13
20conda config -f --add channels conda-forge
21rm ~/miniconda.sh
22echo bootstrap_conda.sh completed. PATH now: $PATH
23export PYSPARK_PYTHON=&quot;/home/hadoop/conda/bin/python3.5&quot;
24
25echo -e '\nexport PATH=$HOME/conda/bin:$PATH' &gt;&gt; $HOME/.bashrc &amp;&amp; source $HOME/.bashrc
26
27
28conda create -n zoo python=3.7 # &quot;zoo&quot; is conda environment name, you can use any name you like.
29conda activate zoo
30sudo pip3 install tensorflow
31sudo pip3 install boto3
32sudo pip3 install botocore
33sudo pip3 install numpy
34sudo pip3 install pandas
35sudo pip3 install scipy
36sudo pip3 install s3fs
37sudo pip3 install matplotlib
38sudo pip3 install -U tqdm
39sudo pip3 install -U scikit-learn
40sudo pip3 install -U scikit-multilearn
41sudo pip3 install xlutils
42sudo pip3 install natsort
43sudo pip3 install pydot
44sudo pip3 install python-pydot
45sudo pip3 install python-pydot-ng
46sudo pip3 install pydotplus
47sudo pip3 install h5py
48sudo pip3 install graphviz
49sudo pip3 install recmetrics
50sudo pip3 install openpyxl
51sudo pip3 install xlrd
52sudo pip3 install xlwt
53sudo pip3 install tensorflow.io
54sudo pip3 install Cython
55sudo pip3 install ray
56sudo pip3 install zoo
57sudo pip3 install analytics-zoo
58sudo pip3 install analytics-zoo[ray]
59#sudo /usr/bin/pip-3.6 install -U imbalanced-learn
60
61
62wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh  -O /home/hadoop/miniconda.sh \
63    &amp;&amp; /bin/bash ~/miniconda.sh -b -p $HOME/conda
64
65echo -e '\n export PATH=$HOME/conda/bin:$PATH' &gt;&gt; $HOME/.bashrc &amp;&amp; source $HOME/.bashrc
66
67
68conda config --set always_yes yes --set changeps1 no
69conda config -f --add channels conda-forge
70
71
72conda create -n zoo python=3.7 # &quot;zoo&quot; is conda environment name
73conda init bash
74source activate zoo
75conda install python 3.7.0 -c conda-forge orca 
76sudo /home/hadoop/conda/envs/zoo/bin/python3.7 -m pip install virtualenv
77

and setting zeppelin python and pyspark parameters to:

1    &quot;classification&quot;: &quot;spark-env&quot;,
2    &quot;properties&quot;: {
3        &quot;conda&quot;: &quot;/home/hadoop/conda/bin&quot;
4    }
5[hadoop@ip-172-30-5-150 ~]$ PATH=/home/hadoop/conda/bin:$PATH
6[hadoop@ip-172-30-5-150 ~]$ conda
7usage: conda [-h] [-V] command ...
8
9conda is a tool for managing and deploying applications, environments and packages.
10#!/usr/bin/env bash
11
12
13# Install conda
14wget https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
15    &amp;&amp; /bin/bash ~/miniconda.sh -b -p $HOME/conda
16
17
18conda config --set always_yes yes --set changeps1 no
19conda install conda=4.2.13
20conda config -f --add channels conda-forge
21rm ~/miniconda.sh
22echo bootstrap_conda.sh completed. PATH now: $PATH
23export PYSPARK_PYTHON=&quot;/home/hadoop/conda/bin/python3.5&quot;
24
25echo -e '\nexport PATH=$HOME/conda/bin:$PATH' &gt;&gt; $HOME/.bashrc &amp;&amp; source $HOME/.bashrc
26
27
28conda create -n zoo python=3.7 # &quot;zoo&quot; is conda environment name, you can use any name you like.
29conda activate zoo
30sudo pip3 install tensorflow
31sudo pip3 install boto3
32sudo pip3 install botocore
33sudo pip3 install numpy
34sudo pip3 install pandas
35sudo pip3 install scipy
36sudo pip3 install s3fs
37sudo pip3 install matplotlib
38sudo pip3 install -U tqdm
39sudo pip3 install -U scikit-learn
40sudo pip3 install -U scikit-multilearn
41sudo pip3 install xlutils
42sudo pip3 install natsort
43sudo pip3 install pydot
44sudo pip3 install python-pydot
45sudo pip3 install python-pydot-ng
46sudo pip3 install pydotplus
47sudo pip3 install h5py
48sudo pip3 install graphviz
49sudo pip3 install recmetrics
50sudo pip3 install openpyxl
51sudo pip3 install xlrd
52sudo pip3 install xlwt
53sudo pip3 install tensorflow.io
54sudo pip3 install Cython
55sudo pip3 install ray
56sudo pip3 install zoo
57sudo pip3 install analytics-zoo
58sudo pip3 install analytics-zoo[ray]
59#sudo /usr/bin/pip-3.6 install -U imbalanced-learn
60
61
62wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh  -O /home/hadoop/miniconda.sh \
63    &amp;&amp; /bin/bash ~/miniconda.sh -b -p $HOME/conda
64
65echo -e '\n export PATH=$HOME/conda/bin:$PATH' &gt;&gt; $HOME/.bashrc &amp;&amp; source $HOME/.bashrc
66
67
68conda config --set always_yes yes --set changeps1 no
69conda config -f --add channels conda-forge
70
71
72conda create -n zoo python=3.7 # &quot;zoo&quot; is conda environment name
73conda init bash
74source activate zoo
75conda install python 3.7.0 -c conda-forge orca 
76sudo /home/hadoop/conda/envs/zoo/bin/python3.7 -m pip install virtualenv
77“spark.pyspark.python&quot;: &quot;/home/hadoop/conda/envs/zoo/bin/python3&quot;,
78&quot;spark.pyspark.virtualenv.enabled&quot;: &quot;true&quot;,
79&quot;spark.pyspark.virtualenv.type&quot;:&quot;native&quot;,
80&quot;spark.pyspark.virtualenv.bin.path&quot;:&quot;/home/hadoop/conda/envs/zoo/bin/,
81&quot;zeppelin.pyspark.python&quot; : &quot;/home/hadoop/conda/bin/python&quot;,
82&quot;zeppelin.python&quot;: &quot;/home/hadoop/conda/bin/python&quot;
83

Orca only support TF upto 1.5 hence it was not working as I am using TF2.

Source https://stackoverflow.com/questions/70901724

QUESTION

How to set Docker Compose `env_file` relative to `.yml` file when multiple `--file` option is used?

Asked 2021-Dec-20 at 18:51

I am trying to set my env_file configuration to be relative to each of the multiple docker-compose.yml file locations instead of relative to the first docker-compose.yml.

The documentation (https://docs.docker.com/compose/compose-file/compose-file-v3/#env_file) suggests this should be possible:

If you have specified a Compose file with docker-compose -f FILE, paths in env_file are relative to the directory that file is in.

For example, when I issue

1docker compose \
2  --file docker-compose.yml \
3  --file backend/docker-compose.yml \
4  --file docker-compose.override.yml up
5

all of the env_file paths in the second (i.e. backend/docker-compose.yml) and third (i.e. docker-compose.override.yml) are relative to the location of the first file (i.e. docker-compose.yml)

I would like to have the env_file settings in each docker-compose.yml file to be relative to the file that it is defined in.

Is this possible?

Thank you for your time 🙏

In case you are curious about the context:

I would like to have a backend repo that is self-contained and the backend developer can just work on it without needing the frontend container. The frontend repo will pull in the backend repo as a Git submodule, because the frontend container needs the backend container as a dependency. Here are my 2 repo's:

backend: https://gitlab.com/starting-spark/porter/backend
frontend: https://gitlab.com/starting-spark/porter/frontend

The backend is organized like this:

1docker compose \
2  --file docker-compose.yml \
3  --file backend/docker-compose.yml \
4  --file docker-compose.override.yml up
5/docker-compose.yml
6/docker-compose.override.yml
7

The frontend is organized like this:

1docker compose \
2  --file docker-compose.yml \
3  --file backend/docker-compose.yml \
4  --file docker-compose.override.yml up
5/docker-compose.yml
6/docker-compose.override.yml
7/docker-compose.yml
8/docker-compose.override.yml
9/backend/ # pulled in as a Git submodule
10/backend/docker-compose.yml
11/backend/docker-compose.override.yml
12

Everything works if I place my env_file inside the docker-compose.override.yml file. The backend's override env_file will be relative to the backend docker-compose.yml. And the frontend's override env_file will be relative to the frontend docker-compose.yml. The frontend will never use the backend's docker-compose.override.yml.

But I wanted to put the backend's env_file setting in to the backend's docker-compose.yml instead, so that projects needing the backend container can inherit and just use it's defaults. If the depender project wants to override backend's env_file, then it can do so in the depender project's docker-compose.override.yml.

I hope that makes sense.

If there's another pattern to organizing Docker-Compose projects that handles this scenario, please let me know.

I did want to avoid a mono-repo.

ANSWER

Answered 2021-Dec-20 at 18:51

It turns out that there's already an issue and discussion regarding this:

https://github.com/docker/compose/issues/3874

The thread points out that this is the expected behavior and is documented here: https://docs.docker.com/compose/extends/#understanding-multiple-compose-files

When you use multiple configuration files, you must make sure all paths in the files are relative to the base Compose file (the first Compose file specified with -f). This is required because override files need not be valid Compose files. Override files can contain small fragments of configuration. Tracking which fragment of a service is relative to which path is difficult and confusing, so to keep paths easier to understand, all paths must be defined relative to the base file.

There's a workaround within that discussion that works fairly well: https://github.com/docker/compose/issues/3874#issuecomment-470311052

The workaround is to use a ENV var that has a default:

${PROXY:-.}/haproxy/conf:/usr/local/etc/haproxy

Or in my case:

1docker compose \
2  --file docker-compose.yml \
3  --file backend/docker-compose.yml \
4  --file docker-compose.override.yml up
5/docker-compose.yml
6/docker-compose.override.yml
7/docker-compose.yml
8/docker-compose.override.yml
9/backend/ # pulled in as a Git submodule
10/backend/docker-compose.yml
11/backend/docker-compose.override.yml
12  env_file:
13    - ${BACKEND_BASE:-.}/.env
14

Hope that can be helpful for others 🤞

In case anyone is interested in the full code: backend's docker-compose.yml: https://gitlab.com/starting-spark/porter/backend/-/blob/3.4.3/docker-compose.yml#L13-14
backend's docker-compose.override.yml: https://gitlab.com/starting-spark/porter/backend/-/blob/3.4.3/docker-compose.override.yml#L3-4
backend's .env: https://gitlab.com/starting-spark/porter/backend/-/blob/3.4.3/.env
frontend's docker-compose.yml: https://gitlab.com/starting-spark/porter/frontend/-/blob/3.2.2/docker-compose.yml#L5-6
frontend's docker-compose.override.yml: https://gitlab.com/starting-spark/porter/frontend/-/blob/3.2.2/docker-compose.override.yml#L3-4
frontend's .env: https://gitlab.com/starting-spark/porter/frontend/-/blob/3.2.2/.env#L16

Source https://stackoverflow.com/questions/70414774

QUESTION

Read spark data with column that clashes with partition name

Asked 2021-Dec-17 at 16:15

I have the following file paths that we read with partitions on s3

1prefix/company=abcd/service=xyz/date=2021-01-01/file_01.json
2prefix/company=abcd/service=xyz/date=2021-01-01/file_02.json
3prefix/company=abcd/service=xyz/date=2021-01-01/file_03.json
4

When I read these with pyspark

1prefix/company=abcd/service=xyz/date=2021-01-01/file_01.json
2prefix/company=abcd/service=xyz/date=2021-01-01/file_02.json
3prefix/company=abcd/service=xyz/date=2021-01-01/file_03.json
4self.spark \
5    .read \
6    .option(&quot;basePath&quot;, 'prefix') \
7    .schema(self.schema) \
8    .json(['company=abcd/service=xyz/date=2021-01-01/'])
9

All the files have the same schema and get loaded in the table as rows. A file could be something like this:

1prefix/company=abcd/service=xyz/date=2021-01-01/file_01.json
2prefix/company=abcd/service=xyz/date=2021-01-01/file_02.json
3prefix/company=abcd/service=xyz/date=2021-01-01/file_03.json
4self.spark \
5    .read \
6    .option(&quot;basePath&quot;, 'prefix') \
7    .schema(self.schema) \
8    .json(['company=abcd/service=xyz/date=2021-01-01/'])
9{&quot;id&quot;: &quot;foo&quot;, &quot;color&quot;: &quot;blue&quot;, &quot;date&quot;: &quot;2021-12-12&quot;}
10

The issue is that sometimes the files have the date field that clashes with my partition code, like date. So I want to know if it is possible to load the files without the partition columns, rename the JSON date column and then add the partition columns.

Final table would be:

1prefix/company=abcd/service=xyz/date=2021-01-01/file_01.json
2prefix/company=abcd/service=xyz/date=2021-01-01/file_02.json
3prefix/company=abcd/service=xyz/date=2021-01-01/file_03.json
4self.spark \
5    .read \
6    .option(&quot;basePath&quot;, 'prefix') \
7    .schema(self.schema) \
8    .json(['company=abcd/service=xyz/date=2021-01-01/'])
9{&quot;id&quot;: &quot;foo&quot;, &quot;color&quot;: &quot;blue&quot;, &quot;date&quot;: &quot;2021-12-12&quot;}
10| id  | color | file_date  | company | service | date       |
11-------------------------------------------------------------
12| foo | blue  | 2021-12-12 | abcd    | xyz     | 2021-01-01 |
13| bar | red   | 2021-10-10 | abcd    | xyz     | 2021-01-01 |
14| baz | green | 2021-08-08 | abcd    | xyz     | 2021-01-01 |
15

EDIT:

More information: I have 5 or 6 partitions sometimes and date is one of them (not the last). And I need to read multiple date partitions at once too. The schema that I pass to Spark contains also the partition columns which makes it more complicated.

I don't control the input data so I need to read as is. I can rename the file columns but not the partition columns.

Would it be possible to add an alias to file columns as we would do when joining 2 dataframes?

Spark 3.1

ANSWER

Answered 2021-Dec-14 at 02:46

Yes, we can read all the json files without partition columns. Directly use the parent folder path and it will load all partitions data into the data frame.

After reading the data frame, you can use withColumn() function to rename the date field.

Something like the following should work

1prefix/company=abcd/service=xyz/date=2021-01-01/file_01.json
2prefix/company=abcd/service=xyz/date=2021-01-01/file_02.json
3prefix/company=abcd/service=xyz/date=2021-01-01/file_03.json
4self.spark \
5    .read \
6    .option(&quot;basePath&quot;, 'prefix') \
7    .schema(self.schema) \
8    .json(['company=abcd/service=xyz/date=2021-01-01/'])
9{&quot;id&quot;: &quot;foo&quot;, &quot;color&quot;: &quot;blue&quot;, &quot;date&quot;: &quot;2021-12-12&quot;}
10| id  | color | file_date  | company | service | date       |
11-------------------------------------------------------------
12| foo | blue  | 2021-12-12 | abcd    | xyz     | 2021-01-01 |
13| bar | red   | 2021-10-10 | abcd    | xyz     | 2021-01-01 |
14| baz | green | 2021-08-08 | abcd    | xyz     | 2021-01-01 |
15df= spark.read.json(&quot;s3://bucket/table/**/*.json&quot;)
16
17renamedDF= df.withColumnRenamed(&quot;old column name&quot;,&quot;new column name&quot;)
18

Source https://stackoverflow.com/questions/70339062

QUESTION

How do I parse xml documents in Palantir Foundry?

Asked 2021-Dec-09 at 21:17

I have a set of .xml documents that I want to parse.

I previously have tried to parse them using methods that take the file contents and dump them into a single cell, however I've noticed this doesn't work in practice since I'm seeing slower and slower run times, often with one task taking tens of hours to run:

The first transform of mine takes the .xml contents and puts it into a single cell, and a second transform takes this string and uses Python's xml library to parse the string into a document. This document I'm then able to extract properties from and return a DataFrame.

I'm using a UDF to conduct the process of mapping the string contents to the fields I want.

How can I make this faster / work better with large .xml files?

ANSWER

Answered 2021-Dec-09 at 21:17

For this problem, we're going to combine a couple of different techniques to make this code both testable and highly scalable.

Theory

When parsing raw files, you have a couple of options you can consider:

❌ You can write your own parser to read bytes from files and convert them into data Spark can understand.
- This is highly discouraged whenever possible due to the engineering time and unscalable architecture. It doesn't take advantage of distributed compute when you do this as you must bring the entire raw file to your parsing method before you can use it. This is not an effective use of your resources.
⚠ You can use your own parser library not made for Spark, such as the XML Python library mentioned in the question
- While this is less difficult to accomplish than writing your own parser, it still does not take advantage of distributed computation in Spark. It is easier to get something running, but it will eventually hit a limit of performance because it does not take advantage of low-level Spark functionality only exposed when writing a Spark library.
✅ You can use a Spark-native raw file parser
- This is the preferred option in all cases as it takes advantage of low-level Spark functionality and doesn't require you to write your own code. If a low-level Spark parser exists, you should use it.

In our case, we can use the Databricks parser to great effect.

In general, you should also avoid using the .udf method as it likely is being used instead of good functionality already available in the Spark API. UDFs are not as performant as native methods and should be used only when no other option is available.

A good example of UDFs covering up hidden problems would be string manipulations of column contents; while you technically can use a UDF to do things like splitting and trimming strings, these things already exist in the Spark API and will be orders of magnitude faster than your own code.

Design

Our design is going to use the following:

Low-level Spark-optimized file parsing done via the Databricks XML Parser
Test-driven raw file parsing as explained here

Wire the Parser

First, we need to add the .jar to our spark_session available inside Transforms. Thanks to recent improvements, this argument, when configured, will allow you to use the .jar in both Preview/Test and at full build time. Previously, this would have required a full build but not so now.

We need to go to our transforms-python/build.gradle file and add 2 blocks of config:

Enable the pytest plugin
Enable the condaJars argument and declare the .jar dependency

My /transforms-python/build.gradle now looks like the following:

1buildscript {
2    repositories {
3       // some other things
4    }
5
6    dependencies {
7        classpath &quot;com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}&quot;
8    }
9}
10
11apply plugin: 'com.palantir.transforms.lang.python'
12apply plugin: 'com.palantir.transforms.lang.python-defaults'
13
14dependencies {
15    condaJars &quot;com.databricks:spark-xml_2.13:0.14.0&quot;
16}
17
18// Apply the testing plugin
19apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
20
21// ... some other awesome features you should enable
22

After applying this config, you'll want to restart your Code Assist session by clicking on the bottom ribbon and hitting Refresh

After refreshing Code Assist, we now have low-level functionality available to parse our .xml files, now we need to test it!

Testing the Parser

If we adopt the same style of test-driven development as here, we end up with /transforms-python/src/myproject/datasets/xml_parse_transform.py with the following contents:

1buildscript {
2    repositories {
3       // some other things
4    }
5
6    dependencies {
7        classpath &quot;com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}&quot;
8    }
9}
10
11apply plugin: 'com.palantir.transforms.lang.python'
12apply plugin: 'com.palantir.transforms.lang.python-defaults'
13
14dependencies {
15    condaJars &quot;com.databricks:spark-xml_2.13:0.14.0&quot;
16}
17
18// Apply the testing plugin
19apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
20
21// ... some other awesome features you should enable
22from transforms.api import transform, Output, Input
23from transforms.verbs.dataframes import union_many
24
25
26def read_files(spark_session, paths):
27    parsed_dfs = []
28    for file_name in paths:
29        parsed_df = spark_session.read.format('xml').options(rowTag=&quot;tag&quot;).load(file_name)
30        parsed_dfs += [parsed_df]
31    output_df = union_many(*parsed_dfs, how=&quot;wide&quot;)
32    return output_df
33
34
35@transform(
36    the_output=Output(&quot;my.awesome.output&quot;),
37    the_input=Input(&quot;my.awesome.input&quot;),
38)
39def my_compute_function(the_input, the_output, ctx):
40    session = ctx.spark_session
41    input_filesystem = the_input.filesystem()
42    hadoop_path = input_filesystem.hadoop_path
43    files = [hadoop_path + &quot;/&quot; + file_name.path for file_name in input_filesystem.ls()]
44    output_df = read_files(session, files)
45    the_output.write_dataframe(output_df)
46

... an example file /transforms-python/test/myproject/datasets/sample.xml with contents:

1buildscript {
2    repositories {
3       // some other things
4    }
5
6    dependencies {
7        classpath &quot;com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}&quot;
8    }
9}
10
11apply plugin: 'com.palantir.transforms.lang.python'
12apply plugin: 'com.palantir.transforms.lang.python-defaults'
13
14dependencies {
15    condaJars &quot;com.databricks:spark-xml_2.13:0.14.0&quot;
16}
17
18// Apply the testing plugin
19apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
20
21// ... some other awesome features you should enable
22from transforms.api import transform, Output, Input
23from transforms.verbs.dataframes import union_many
24
25
26def read_files(spark_session, paths):
27    parsed_dfs = []
28    for file_name in paths:
29        parsed_df = spark_session.read.format('xml').options(rowTag=&quot;tag&quot;).load(file_name)
30        parsed_dfs += [parsed_df]
31    output_df = union_many(*parsed_dfs, how=&quot;wide&quot;)
32    return output_df
33
34
35@transform(
36    the_output=Output(&quot;my.awesome.output&quot;),
37    the_input=Input(&quot;my.awesome.input&quot;),
38)
39def my_compute_function(the_input, the_output, ctx):
40    session = ctx.spark_session
41    input_filesystem = the_input.filesystem()
42    hadoop_path = input_filesystem.hadoop_path
43    files = [hadoop_path + &quot;/&quot; + file_name.path for file_name in input_filesystem.ls()]
44    output_df = read_files(session, files)
45    the_output.write_dataframe(output_df)
46&lt;tag&gt;
47&lt;field1&gt;
48my_value
49&lt;/field1&gt;
50&lt;/tag&gt;
51

And a test file /transforms-python/test/myproject/datasets/test_xml_parse_transform.py:

1buildscript {
2    repositories {
3       // some other things
4    }
5
6    dependencies {
7        classpath &quot;com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}&quot;
8    }
9}
10
11apply plugin: 'com.palantir.transforms.lang.python'
12apply plugin: 'com.palantir.transforms.lang.python-defaults'
13
14dependencies {
15    condaJars &quot;com.databricks:spark-xml_2.13:0.14.0&quot;
16}
17
18// Apply the testing plugin
19apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
20
21// ... some other awesome features you should enable
22from transforms.api import transform, Output, Input
23from transforms.verbs.dataframes import union_many
24
25
26def read_files(spark_session, paths):
27    parsed_dfs = []
28    for file_name in paths:
29        parsed_df = spark_session.read.format('xml').options(rowTag=&quot;tag&quot;).load(file_name)
30        parsed_dfs += [parsed_df]
31    output_df = union_many(*parsed_dfs, how=&quot;wide&quot;)
32    return output_df
33
34
35@transform(
36    the_output=Output(&quot;my.awesome.output&quot;),
37    the_input=Input(&quot;my.awesome.input&quot;),
38)
39def my_compute_function(the_input, the_output, ctx):
40    session = ctx.spark_session
41    input_filesystem = the_input.filesystem()
42    hadoop_path = input_filesystem.hadoop_path
43    files = [hadoop_path + &quot;/&quot; + file_name.path for file_name in input_filesystem.ls()]
44    output_df = read_files(session, files)
45    the_output.write_dataframe(output_df)
46&lt;tag&gt;
47&lt;field1&gt;
48my_value
49&lt;/field1&gt;
50&lt;/tag&gt;
51from myproject.datasets import xml_parse_transform
52from pkg_resources import resource_filename
53
54
55def test_parse_xml(spark_session):
56    file_path = resource_filename(__name__, &quot;sample.xml&quot;)
57    parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
58    assert parsed_df.count() == 1
59    assert set(parsed_df.columns) == {&quot;field1&quot;}
60

We now have:

A distributed-compute, low-level .xml parser that is highly scalable
A test-driven setup that we can quickly iterate on to get our exact functionality right

Cheers

Source https://stackoverflow.com/questions/70220574

QUESTION

docker build vue3 not compatible with element-ui on node:16-buster-slim

Asked 2021-Dec-07 at 08:54

dockerfile:

1FROM node:16-buster-slim
2RUN apt-get update
3
4WORKDIR /app
5COPY package.json /home
6RUN npm install --prefix /home
7

package.json

1FROM node:16-buster-slim
2RUN apt-get update
3
4WORKDIR /app
5COPY package.json /home
6RUN npm install --prefix /home
7{
8  &quot;name&quot;: &quot;test&quot;,
9  &quot;version&quot;: &quot;0.1.0&quot;,
10  &quot;private&quot;: false,
11  &quot;scripts&quot;: {
12    &quot;dev&quot;: &quot;vue-cli-service serve --mode development --host 0.0.0.0&quot;,
13    &quot;serve&quot;: &quot;vue-cli-service serve --mode production --host 0.0.0.0&quot;,
14    &quot;build&quot;: &quot;vue-cli-service build&quot;,
15    &quot;jest&quot;: &quot;vue-cli-service test:unit --watchAll&quot;,
16    &quot;lint&quot;: &quot;vue-cli-service lint&quot;
17  },
18  &quot;dependencies&quot;: {
19    &quot;axios&quot;: &quot;^0.21.4&quot;,
20    &quot;core-js&quot;: &quot;^3.6.5&quot;,
21    &quot;dateformat&quot;: &quot;^5.0.2&quot;,
22    &quot;element-plus&quot;: &quot;^1.1.0-beta.9&quot;,
23    &quot;element-ui&quot;: &quot;^2.15.6&quot;,
24    &quot;lib-flexible&quot;: &quot;^0.3.2&quot;,
25    &quot;ol&quot;: &quot;^6.6.1&quot;,
26    &quot;spark-md5&quot;: &quot;^3.0.2&quot;,
27    &quot;vue&quot;: &quot;^3.0.0&quot;,
28    &quot;vue-router&quot;: &quot;^4.0.0-0&quot;,
29    &quot;vuelayers&quot;: &quot;^0.11.36&quot;,
30    &quot;vuex&quot;: &quot;^4.0.0-0&quot;
31  },
32  &quot;devDependencies&quot;: {
33    &quot;@vue/cli-plugin-babel&quot;: &quot;~4.5.0&quot;,
34    &quot;@vue/cli-plugin-eslint&quot;: &quot;~4.5.0&quot;,
35    &quot;@vue/cli-plugin-router&quot;: &quot;~4.5.0&quot;,
36    &quot;@vue/cli-plugin-unit-jest&quot;: &quot;~4.5.0&quot;,
37    &quot;@vue/cli-plugin-vuex&quot;: &quot;~4.5.0&quot;,
38    &quot;@vue/cli-service&quot;: &quot;~4.5.0&quot;,
39    &quot;@vue/compiler-sfc&quot;: &quot;^3.0.0&quot;,
40    &quot;@vue/eslint-config-standard&quot;: &quot;^5.1.2&quot;,
41    &quot;@vue/test-utils&quot;: &quot;^2.0.0-0&quot;,
42    &quot;babel-eslint&quot;: &quot;^10.1.0&quot;,
43    &quot;babel-plugin-component&quot;: &quot;^1.1.1&quot;,
44    &quot;eslint&quot;: &quot;^6.7.2&quot;,
45    &quot;eslint-plugin-import&quot;: &quot;^2.20.2&quot;,
46    &quot;eslint-plugin-node&quot;: &quot;^11.1.0&quot;,
47    &quot;eslint-plugin-promise&quot;: &quot;^4.2.1&quot;,
48    &quot;eslint-plugin-standard&quot;: &quot;^4.0.0&quot;,
49    &quot;eslint-plugin-vue&quot;: &quot;^7.0.0&quot;,
50    &quot;fs-extra&quot;: &quot;^10.0.0&quot;,
51    &quot;lint-staged&quot;: &quot;^9.5.0&quot;,
52    &quot;mockjs&quot;: &quot;^1.1.0&quot;,
53    &quot;node-sass&quot;: &quot;^4.14.1&quot;,
54    &quot;px2rem-loader&quot;: &quot;^0.1.9&quot;,
55    &quot;sass-loader&quot;: &quot;^8.0.2&quot;,
56    &quot;typescript&quot;: &quot;~3.9.3&quot;,
57    &quot;vue-jest&quot;: &quot;^5.0.0-0&quot;
58  },
59  &quot;gitHooks&quot;: {
60    &quot;pre-commit&quot;: &quot;lint-staged&quot;
61  },
62  &quot;lint-staged&quot;: {
63    &quot;*.{js,jsx,vue}&quot;: [
64      &quot;vue-cli-service lint&quot;,
65      &quot;git add&quot;
66    ]
67  }
68}
69
70

Step: RUN npm install --prefix /home
---> Running in fc08d7e933ed npm notice
npm notice New patch version of npm available! 8.1.0 -> 8.1.4
npm notice Changelog: https://github.com/npm/cli/releases/tag/v8.1.4 npm notice Run npm install -g npm@8.1.4 to update!
npm notice
npm ERR! code ERESOLVE
npm ERR! ERESOLVE unable to resolve dependency tree
npm ERR!
npm ERR! While resolving: artemis@0.1.0
npm ERR! Found: vue@3.2.22
npm ERR! node_modules/vue
npm ERR! vue@"^3.0.0" from the root project
npm ERR!
npm ERR! Could not resolve dependency:
npm ERR! peer vue@"^2.5.17" from element-ui@2.15.6
npm ERR! node_modules/element-ui
npm ERR! element-ui@"^2.15.6" from the root project
npm ERR!
npm ERR! Fix the upstream dependency conflict, or retry
npm ERR! this command with --force, or --legacy-peer-deps
npm ERR! to accept an incorrect (and potentially broken) dependency resolution.
npm ERR!
npm ERR! See /root/.npm/eresolve-report.txt for a full report.

1FROM node:16-buster-slim
2RUN apt-get update
3
4WORKDIR /app
5COPY package.json /home
6RUN npm install --prefix /home
7{
8  &quot;name&quot;: &quot;test&quot;,
9  &quot;version&quot;: &quot;0.1.0&quot;,
10  &quot;private&quot;: false,
11  &quot;scripts&quot;: {
12    &quot;dev&quot;: &quot;vue-cli-service serve --mode development --host 0.0.0.0&quot;,
13    &quot;serve&quot;: &quot;vue-cli-service serve --mode production --host 0.0.0.0&quot;,
14    &quot;build&quot;: &quot;vue-cli-service build&quot;,
15    &quot;jest&quot;: &quot;vue-cli-service test:unit --watchAll&quot;,
16    &quot;lint&quot;: &quot;vue-cli-service lint&quot;
17  },
18  &quot;dependencies&quot;: {
19    &quot;axios&quot;: &quot;^0.21.4&quot;,
20    &quot;core-js&quot;: &quot;^3.6.5&quot;,
21    &quot;dateformat&quot;: &quot;^5.0.2&quot;,
22    &quot;element-plus&quot;: &quot;^1.1.0-beta.9&quot;,
23    &quot;element-ui&quot;: &quot;^2.15.6&quot;,
24    &quot;lib-flexible&quot;: &quot;^0.3.2&quot;,
25    &quot;ol&quot;: &quot;^6.6.1&quot;,
26    &quot;spark-md5&quot;: &quot;^3.0.2&quot;,
27    &quot;vue&quot;: &quot;^3.0.0&quot;,
28    &quot;vue-router&quot;: &quot;^4.0.0-0&quot;,
29    &quot;vuelayers&quot;: &quot;^0.11.36&quot;,
30    &quot;vuex&quot;: &quot;^4.0.0-0&quot;
31  },
32  &quot;devDependencies&quot;: {
33    &quot;@vue/cli-plugin-babel&quot;: &quot;~4.5.0&quot;,
34    &quot;@vue/cli-plugin-eslint&quot;: &quot;~4.5.0&quot;,
35    &quot;@vue/cli-plugin-router&quot;: &quot;~4.5.0&quot;,
36    &quot;@vue/cli-plugin-unit-jest&quot;: &quot;~4.5.0&quot;,
37    &quot;@vue/cli-plugin-vuex&quot;: &quot;~4.5.0&quot;,
38    &quot;@vue/cli-service&quot;: &quot;~4.5.0&quot;,
39    &quot;@vue/compiler-sfc&quot;: &quot;^3.0.0&quot;,
40    &quot;@vue/eslint-config-standard&quot;: &quot;^5.1.2&quot;,
41    &quot;@vue/test-utils&quot;: &quot;^2.0.0-0&quot;,
42    &quot;babel-eslint&quot;: &quot;^10.1.0&quot;,
43    &quot;babel-plugin-component&quot;: &quot;^1.1.1&quot;,
44    &quot;eslint&quot;: &quot;^6.7.2&quot;,
45    &quot;eslint-plugin-import&quot;: &quot;^2.20.2&quot;,
46    &quot;eslint-plugin-node&quot;: &quot;^11.1.0&quot;,
47    &quot;eslint-plugin-promise&quot;: &quot;^4.2.1&quot;,
48    &quot;eslint-plugin-standard&quot;: &quot;^4.0.0&quot;,
49    &quot;eslint-plugin-vue&quot;: &quot;^7.0.0&quot;,
50    &quot;fs-extra&quot;: &quot;^10.0.0&quot;,
51    &quot;lint-staged&quot;: &quot;^9.5.0&quot;,
52    &quot;mockjs&quot;: &quot;^1.1.0&quot;,
53    &quot;node-sass&quot;: &quot;^4.14.1&quot;,
54    &quot;px2rem-loader&quot;: &quot;^0.1.9&quot;,
55    &quot;sass-loader&quot;: &quot;^8.0.2&quot;,
56    &quot;typescript&quot;: &quot;~3.9.3&quot;,
57    &quot;vue-jest&quot;: &quot;^5.0.0-0&quot;
58  },
59  &quot;gitHooks&quot;: {
60    &quot;pre-commit&quot;: &quot;lint-staged&quot;
61  },
62  &quot;lint-staged&quot;: {
63    &quot;*.{js,jsx,vue}&quot;: [
64      &quot;vue-cli-service lint&quot;,
65      &quot;git add&quot;
66    ]
67  }
68}
69
70                                                                                                                                                                                                                                                                                                        npm ERR! A complete log of this run can be found in:                  
71

npm ERR! /root/.npm/_logs/2021-11-25T01_40_08_953Z-debug.log

but this package.json work well on node:lts-alpine base image

ANSWER

Answered 2021-Dec-07 at 08:54

It seems that you have problems with peer dependencies, if you just set your npm to use legacy dependency logic to install your packages you will solve the problem.

Just add to your Dockerfile this setting before running npm install:

1FROM node:16-buster-slim
2RUN apt-get update
3
4WORKDIR /app
5COPY package.json /home
6RUN npm install --prefix /home
7{
8  &quot;name&quot;: &quot;test&quot;,
9  &quot;version&quot;: &quot;0.1.0&quot;,
10  &quot;private&quot;: false,
11  &quot;scripts&quot;: {
12    &quot;dev&quot;: &quot;vue-cli-service serve --mode development --host 0.0.0.0&quot;,
13    &quot;serve&quot;: &quot;vue-cli-service serve --mode production --host 0.0.0.0&quot;,
14    &quot;build&quot;: &quot;vue-cli-service build&quot;,
15    &quot;jest&quot;: &quot;vue-cli-service test:unit --watchAll&quot;,
16    &quot;lint&quot;: &quot;vue-cli-service lint&quot;
17  },
18  &quot;dependencies&quot;: {
19    &quot;axios&quot;: &quot;^0.21.4&quot;,
20    &quot;core-js&quot;: &quot;^3.6.5&quot;,
21    &quot;dateformat&quot;: &quot;^5.0.2&quot;,
22    &quot;element-plus&quot;: &quot;^1.1.0-beta.9&quot;,
23    &quot;element-ui&quot;: &quot;^2.15.6&quot;,
24    &quot;lib-flexible&quot;: &quot;^0.3.2&quot;,
25    &quot;ol&quot;: &quot;^6.6.1&quot;,
26    &quot;spark-md5&quot;: &quot;^3.0.2&quot;,
27    &quot;vue&quot;: &quot;^3.0.0&quot;,
28    &quot;vue-router&quot;: &quot;^4.0.0-0&quot;,
29    &quot;vuelayers&quot;: &quot;^0.11.36&quot;,
30    &quot;vuex&quot;: &quot;^4.0.0-0&quot;
31  },
32  &quot;devDependencies&quot;: {
33    &quot;@vue/cli-plugin-babel&quot;: &quot;~4.5.0&quot;,
34    &quot;@vue/cli-plugin-eslint&quot;: &quot;~4.5.0&quot;,
35    &quot;@vue/cli-plugin-router&quot;: &quot;~4.5.0&quot;,
36    &quot;@vue/cli-plugin-unit-jest&quot;: &quot;~4.5.0&quot;,
37    &quot;@vue/cli-plugin-vuex&quot;: &quot;~4.5.0&quot;,
38    &quot;@vue/cli-service&quot;: &quot;~4.5.0&quot;,
39    &quot;@vue/compiler-sfc&quot;: &quot;^3.0.0&quot;,
40    &quot;@vue/eslint-config-standard&quot;: &quot;^5.1.2&quot;,
41    &quot;@vue/test-utils&quot;: &quot;^2.0.0-0&quot;,
42    &quot;babel-eslint&quot;: &quot;^10.1.0&quot;,
43    &quot;babel-plugin-component&quot;: &quot;^1.1.1&quot;,
44    &quot;eslint&quot;: &quot;^6.7.2&quot;,
45    &quot;eslint-plugin-import&quot;: &quot;^2.20.2&quot;,
46    &quot;eslint-plugin-node&quot;: &quot;^11.1.0&quot;,
47    &quot;eslint-plugin-promise&quot;: &quot;^4.2.1&quot;,
48    &quot;eslint-plugin-standard&quot;: &quot;^4.0.0&quot;,
49    &quot;eslint-plugin-vue&quot;: &quot;^7.0.0&quot;,
50    &quot;fs-extra&quot;: &quot;^10.0.0&quot;,
51    &quot;lint-staged&quot;: &quot;^9.5.0&quot;,
52    &quot;mockjs&quot;: &quot;^1.1.0&quot;,
53    &quot;node-sass&quot;: &quot;^4.14.1&quot;,
54    &quot;px2rem-loader&quot;: &quot;^0.1.9&quot;,
55    &quot;sass-loader&quot;: &quot;^8.0.2&quot;,
56    &quot;typescript&quot;: &quot;~3.9.3&quot;,
57    &quot;vue-jest&quot;: &quot;^5.0.0-0&quot;
58  },
59  &quot;gitHooks&quot;: {
60    &quot;pre-commit&quot;: &quot;lint-staged&quot;
61  },
62  &quot;lint-staged&quot;: {
63    &quot;*.{js,jsx,vue}&quot;: [
64      &quot;vue-cli-service lint&quot;,
65      &quot;git add&quot;
66    ]
67  }
68}
69
70                                                                                                                                                                                                                                                                                                        npm ERR! A complete log of this run can be found in:                  
71...
72COPY package.json /home
73RUN npm config set legacy-peer-deps true
74RUN npm install --prefix /home
75