Explore all Hadoop open source software, libraries, packages, source code, cloud functions and APIs.

Popular New Releases in Hadoop

xgboost

Release candidate of version 1.6.0

luigi

3.0.3

alluxio

Alluxio v2.7.4

hazelcast

v4.1.9

hbase

Apache HBase 2.4.11 is now available for download

Popular Libraries in Hadoop

spark

by apache doticonscaladoticon

star image 32507 doticonApache-2.0

Apache Spark - A unified analytics engine for large-scale data processing

xgboost

by dmlc doticonc++doticon

star image 22464 doticonApache-2.0

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

kafka

by apache doticonjavadoticon

star image 21667 doticonApache-2.0

Mirror of Apache Kafka

data-science-ipython-notebooks

by donnemartin doticonpythondoticon

star image 21519 doticonNOASSERTION

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

flink

by apache doticonjavadoticon

star image 18609 doticonApache-2.0

Apache Flink

luigi

by spotify doticonpythondoticon

star image 14716 doticonApache-2.0

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

presto

by prestodb doticonjavadoticon

star image 13394 doticonApache-2.0

The official home of the Presto distributed SQL query engine for big data

attic-predictionio

by apache doticonscaladoticon

star image 12509 doticonApache-2.0

PredictionIO, a machine learning server for developers and ML engineers.

hadoop

by apache doticonjavadoticon

star image 12457 doticonApache-2.0

Apache Hadoop

Trending New libraries in Hadoop

school-of-sre

by linkedin doticonhtmldoticon

star image 4921 doticonNOASSERTION

At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.

SZT-bigdata

by geekyouth doticonscaladoticon

star image 1137 doticonGPL-3.0

深圳地铁大数据客流分析系统🚇🚄🌟

OpenMetadata

by open-metadata doticontypescriptdoticon

star image 901 doticonApache-2.0

Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.

cp-all-in-one

by confluentinc doticonshelldoticon

star image 459 doticon

docker-compose.yml files for cp-all-in-one , cp-all-in-one-community, cp-all-in-one-cloud

spiderman

by TurboWay doticonpythondoticon

star image 447 doticonMIT

基于 scrapy-redis 的通用分布式爬虫框架

relly

by KOBA789 doticonrustdoticon

star image 389 doticonMIT

RDBMS のしくみを学ぶための小さな RDBMS 実装

hyperspace

by microsoft doticonscaladoticon

star image 336 doticonApache-2.0

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.

Prophecis

by WeBankFinTech doticongodoticon

star image 317 doticonApache-2.0

Prophecis is a one-stop cloud native machine learning platform.

ckman

by housepower doticongodoticon

star image 277 doticonApache-2.0

This is a tool which used to manage and monitor ClickHouse database

Top Authors in Hadoop

1

apache

64 Libraries

star icon148879

2

PacktPublishing

15 Libraries

star icon131

3

cloudera

13 Libraries

star icon785

4

linkedin

10 Libraries

star icon8076

5

hortonworks

9 Libraries

star icon1005

6

spotify

8 Libraries

star icon15248

7

prestodb

8 Libraries

star icon13699

8

pranab

7 Libraries

star icon421

9

sequenceiq

7 Libraries

star icon117

10

mckeeh3

7 Libraries

star icon114

1

64 Libraries

star icon148879

2

15 Libraries

star icon131

3

13 Libraries

star icon785

4

10 Libraries

star icon8076

5

9 Libraries

star icon1005

6

8 Libraries

star icon15248

7

8 Libraries

star icon13699

8

7 Libraries

star icon421

9

7 Libraries

star icon117

10

7 Libraries

star icon114

Trending Kits in Hadoop

No Trending Kits are available at this moment for Hadoop

Trending Discussions on Hadoop

spark-shell throws java.lang.reflect.InvocationTargetException on running

spark-shell exception org.apache.spark.SparkException: Exception thrown in awaitResult

determine written object paths with Pyspark 3.2.1 + hadoop 3.3.2

How to read a csv file from s3 bucket using pyspark

Hadoop to SQL through SSIS Package : Data incorrect format

How to run spark 3.2.0 on google dataproc?

AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>

Cannot find conda info. Please verify your conda installation on EMR

PySpark runs in YARN client mode but fails in cluster mode for "User did not initialize spark context!"

Where to find spark log in dataproc when running job on cluster mode

QUESTION

spark-shell throws java.lang.reflect.InvocationTargetException on running

Asked 2022-Apr-01 at 19:53

When I execute run-example SparkPi, for example, it works perfectly, but when I run spark-shell, it throws these exceptions:

1WARNING: An illegal reflective access operation has occurred
2WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/big_data/spark-3.2.0-bin-hadoop3.2-scala2.13/jars/spark-unsafe_2.13-3.2.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
3WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
4WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
5WARNING: All illegal access operations will be denied in a future release
6Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
7Setting default log level to "WARN".
8To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
9Welcome to
10      ____              __
11     / __/__  ___ _____/ /__
12    _\ \/ _ \/ _ `/ __/  '_/
13   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0
14      /_/
15
16Using Scala version 2.13.5 (OpenJDK 64-Bit Server VM, Java 11.0.9.1)
17Type in expressions to have them evaluated.
18Type :help for more information.
1921/12/11 19:28:36 ERROR SparkContext: Error initializing SparkContext.
20java.lang.reflect.InvocationTargetException
21        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
22        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
23        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
24        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
25        at org.apache.spark.executor.Executor.addReplClassLoaderIfNeeded(Executor.scala:909)
26        at org.apache.spark.executor.Executor.<init>(Executor.scala:160)
27        at org.apache.spark.scheduler.local.LocalEndpoint.<init>(LocalSchedulerBackend.scala:64)
28        at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:132)
29        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
30        at org.apache.spark.SparkContext.<init>(SparkContext.scala:581)
31        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
32        at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
33        at scala.Option.getOrElse(Option.scala:201)
34        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
35        at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
36        at $line3.$read$$iw.<init>(<console>:5)
37        at $line3.$read.<init>(<console>:4)
38        at $line3.$read$.<clinit>(<console>)
39        at $line3.$eval$.$print$lzycompute(<synthetic>:6)
40        at $line3.$eval$.$print(<synthetic>:5)
41        at $line3.$eval.$print(<synthetic>)
42        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
43        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
44        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
45        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
46        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
47        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
48        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
49        at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
50        at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
51        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
52        at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
53        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
54        at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
55        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
56        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
57        at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
58        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
59        at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
60        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
61        at scala.collection.immutable.List.foreach(List.scala:333)
62        at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
63        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
64        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
65        at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
66        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
67        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
68        at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
69        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
70        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
71        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
72        at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
73        at org.apache.spark.repl.Main$.doMain(Main.scala:84)
74        at org.apache.spark.repl.Main$.main(Main.scala:59)
75        at org.apache.spark.repl.Main.main(Main.scala)
76        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
77        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
78        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
79        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
80        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
81        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
82        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
83        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
84        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
85        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
86        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
87        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
88Caused by: java.net.URISyntaxException: Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes
89        at java.base/java.net.URI$Parser.fail(URI.java:2913)
90        at java.base/java.net.URI$Parser.checkChars(URI.java:3084)
91        at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3166)
92        at java.base/java.net.URI$Parser.parse(URI.java:3114)
93        at java.base/java.net.URI.<init>(URI.java:600)
94        at org.apache.spark.repl.ExecutorClassLoader.<init>(ExecutorClassLoader.scala:57)
95        ... 67 more
9621/12/11 19:28:36 ERROR Utils: Uncaught exception in thread main
97java.lang.NullPointerException
98        at org.apache.spark.scheduler.local.LocalSchedulerBackend.org$apache$spark$scheduler$local$LocalSchedulerBackend$$stop(LocalSchedulerBackend.scala:173)
99        at org.apache.spark.scheduler.local.LocalSchedulerBackend.stop(LocalSchedulerBackend.scala:144)
100        at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:927)
101        at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2516)
102        at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2086)
103        at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1442)
104        at org.apache.spark.SparkContext.stop(SparkContext.scala:2086)
105        at org.apache.spark.SparkContext.<init>(SparkContext.scala:677)
106        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
107        at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
108        at scala.Option.getOrElse(Option.scala:201)
109        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
110        at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
111        at $line3.$read$$iw.<init>(<console>:5)
112        at $line3.$read.<init>(<console>:4)
113        at $line3.$read$.<clinit>(<console>)
114        at $line3.$eval$.$print$lzycompute(<synthetic>:6)
115        at $line3.$eval$.$print(<synthetic>:5)
116        at $line3.$eval.$print(<synthetic>)
117        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
118        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
119        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
120        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
121        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
122        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
123        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
124        at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
125        at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
126        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
127        at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
128        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
129        at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
130        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
131        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
132        at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
133        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
134        at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
135        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
136        at scala.collection.immutable.List.foreach(List.scala:333)
137        at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
138        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
139        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
140        at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
141        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
142        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
143        at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
144        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
145        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
146        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
147        at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
148        at org.apache.spark.repl.Main$.doMain(Main.scala:84)
149        at org.apache.spark.repl.Main$.main(Main.scala:59)
150        at org.apache.spark.repl.Main.main(Main.scala)
151        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
152        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
153        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
154        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
155        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
156        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
157        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
158        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
159        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
160        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
161        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
162        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16321/12/11 19:28:36 WARN MetricsSystem: Stopping a MetricsSystem that is not running
16421/12/11 19:28:36 ERROR Main: Failed to initialize Spark session.
165java.lang.reflect.InvocationTargetException
166        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
167        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
168        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
169        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
170        at org.apache.spark.executor.Executor.addReplClassLoaderIfNeeded(Executor.scala:909)
171        at org.apache.spark.executor.Executor.<init>(Executor.scala:160)
172        at org.apache.spark.scheduler.local.LocalEndpoint.<init>(LocalSchedulerBackend.scala:64)
173        at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:132)
174        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
175        at org.apache.spark.SparkContext.<init>(SparkContext.scala:581)
176        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
177        at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
178        at scala.Option.getOrElse(Option.scala:201)
179        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
180        at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
181        at $line3.$read$$iw.<init>(<console>:5)
182        at $line3.$read.<init>(<console>:4)
183        at $line3.$read$.<clinit>(<console>)
184        at $line3.$eval$.$print$lzycompute(<synthetic>:6)
185        at $line3.$eval$.$print(<synthetic>:5)
186        at $line3.$eval.$print(<synthetic>)
187        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
188        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
189        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
190        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
191        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
192        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
193        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
194        at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
195        at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
196        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
197        at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
198        at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
199        at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
200        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
201        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
202        at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
203        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
204        at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
205        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
206        at scala.collection.immutable.List.foreach(List.scala:333)
207        at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
208        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
209        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
210        at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
211        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
212        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
213        at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
214        at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
215        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
216        at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
217        at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
218        at org.apache.spark.repl.Main$.doMain(Main.scala:84)
219        at org.apache.spark.repl.Main$.main(Main.scala:59)
220        at org.apache.spark.repl.Main.main(Main.scala)
221        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
222        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
223        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
224        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
225        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
226        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
227        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
228        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
229        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
230        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
231        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
232        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
233Caused by: java.net.URISyntaxException: Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes
234        at java.base/java.net.URI$Parser.fail(URI.java:2913)
235        at java.base/java.net.URI$Parser.checkChars(URI.java:3084)
236        at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3166)
237        at java.base/java.net.URI$Parser.parse(URI.java:3114)
238        at java.base/java.net.URI.<init>(URI.java:600)
239        at org.apache.spark.repl.ExecutorClassLoader.<init>(ExecutorClassLoader.scala:57)
240        ... 67 more
24121/12/11 19:28:36 ERROR Utils: Uncaught exception in thread shutdown-hook-0
242java.lang.ExceptionInInitializerError
243        at org.apache.spark.executor.Executor.stop(Executor.scala:333)
244        at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
245        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
246        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
247        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
248        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
249        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
250        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
251        at scala.util.Try$.apply(Try.scala:210)
252        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
253        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
254        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
255        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
256        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
257        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
258        at java.base/java.lang.Thread.run(Thread.java:829)
259Caused by: java.lang.NullPointerException
260        at org.apache.spark.shuffle.ShuffleBlockPusher$.<clinit>(ShuffleBlockPusher.scala:465)
261        ... 16 more
26221/12/11 19:28:36 WARN ShutdownHookManager: ShutdownHook '' failed, java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError
263java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError
264        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
265        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
266        at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
267        at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
268Caused by: java.lang.ExceptionInInitializerError
269        at org.apache.spark.executor.Executor.stop(Executor.scala:333)
270        at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
271        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
272        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
273        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
274        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
275        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
276        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
277        at scala.util.Try$.apply(Try.scala:210)
278        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
279        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
280        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
281        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
282        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
283        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
284        at java.base/java.lang.Thread.run(Thread.java:829)
285Caused by: java.lang.NullPointerException
286        at org.apache.spark.shuffle.ShuffleBlockPusher$.<clinit>(ShuffleBlockPusher.scala:465)
287        ... 16 more
288

As I can see it caused by Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes, but I don't understand what does it mean exactly and how to deal with that

How can I solve this problem?

I use Spark 3.2.0 Pre-built for Apache Hadoop 3.3 and later (Scala 2.13)

JAVA_HOME, HADOOP_HOME, SPARK_HOME path variables are set.

ANSWER

Answered 2022-Jan-07 at 15:11

i face the same problem, i think Spark 3.2 is the problem itself

switched to Spark 3.1.2, it works fine

Source https://stackoverflow.com/questions/70317481

QUESTION

spark-shell exception org.apache.spark.SparkException: Exception thrown in awaitResult

Asked 2022-Mar-23 at 09:29

Facing below error while starting spark-shell with yarn master. Shell is working with spark local master.

1admin@XXXXXX:~$ spark-shell --master yarn 21/11/03 15:51:51 WARN Utils: Your hostname, XXXXXX resolves to a loopback address:
2127.0.1.1; using 192.168.29.57 instead (on interface wifi0) 21/11/03 15:51:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/11/03 15:52:01 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Spark context Web UI available at http://XX.XX.XX.XX:4040 Spark context available as 'sc' (master = yarn, app id = application_1635934709971_0001). Spark session available as 'spark'. Welcome to
3      ____              __
4     / __/__  ___ _____/ /__
5    _\ \/ _ \/ _ `/ __/  '_/    /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
6      /_/
7
8Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java
91.8.0_301) Type in expressions to have them evaluated. Type :help for more information.
10
11scala>
12
13scala> 21/11/03 15:52:35 ERROR YarnClientSchedulerBackend: YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details. 21/11/03 15:52:35 ERROR YarnClientSchedulerBackend: Diagnostics message: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult:
14        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
15        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
16        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
17        at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
18        at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:515)
19        at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:307)
20        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
21        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
22        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
23        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:780)
24        at java.security.AccessController.doPrivileged(Native Method)
25        at javax.security.auth.Subject.doAs(Subject.java:422)
26        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
27        at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:779)
28        at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
29        at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:804)
30        at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:834)
31        at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) Caused by: java.io.IOException: Failed to connect to /192.168.29.57:33333
32        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
33        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
34        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
35        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
36        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
37        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
38        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
39        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
40        at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.29.57:33333 Caused by: java.net.ConnectException: Connection refused
41        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
42        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)
43        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
44        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
45        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:688)
46        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
47        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
48        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
49        at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
50        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
51        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
52        at java.lang.Thread.run(Thread.java:748)
53
5421/11/03 15:52:35 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
55

Below is spark-defaults.conf

1admin@XXXXXX:~$ spark-shell --master yarn 21/11/03 15:51:51 WARN Utils: Your hostname, XXXXXX resolves to a loopback address:
2127.0.1.1; using 192.168.29.57 instead (on interface wifi0) 21/11/03 15:51:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/11/03 15:52:01 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Spark context Web UI available at http://XX.XX.XX.XX:4040 Spark context available as 'sc' (master = yarn, app id = application_1635934709971_0001). Spark session available as 'spark'. Welcome to
3      ____              __
4     / __/__  ___ _____/ /__
5    _\ \/ _ \/ _ `/ __/  '_/    /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
6      /_/
7
8Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java
91.8.0_301) Type in expressions to have them evaluated. Type :help for more information.
10
11scala>
12
13scala> 21/11/03 15:52:35 ERROR YarnClientSchedulerBackend: YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details. 21/11/03 15:52:35 ERROR YarnClientSchedulerBackend: Diagnostics message: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult:
14        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
15        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
16        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
17        at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
18        at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:515)
19        at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:307)
20        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
21        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
22        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
23        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:780)
24        at java.security.AccessController.doPrivileged(Native Method)
25        at javax.security.auth.Subject.doAs(Subject.java:422)
26        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
27        at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:779)
28        at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
29        at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:804)
30        at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:834)
31        at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) Caused by: java.io.IOException: Failed to connect to /192.168.29.57:33333
32        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
33        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
34        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
35        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
36        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
37        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
38        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
39        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
40        at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.29.57:33333 Caused by: java.net.ConnectException: Connection refused
41        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
42        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)
43        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
44        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
45        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:688)
46        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
47        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
48        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
49        at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
50        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
51        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
52        at java.lang.Thread.run(Thread.java:748)
53
5421/11/03 15:52:35 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
55spark.driver.memory              512m
56spark.yarn.am.memory             512m
57spark.executor.memory            512m
58spark.eventLog.enabled true
59spark.eventLog.dir file:////home/admin/spark_event_temp
60spark.history.fs.logDirectory hdfs://localhost:9000/spark-logs
61spark.history.fs.update.interval 10s
62spark.history.ui.port 18080
63spark.sql.warehouse.dir=file:////home/admin/spark_warehouse
64spark.shuffle.service.port              7337
65spark.ui.port                           4040
66spark.blockManager.port                 31111
67spark.driver.blockManager.port          32222
68spark.driver.port                       33333
69

spark version:- spark-2.4.5-bin-hadoop2.7

hadoop version:- hadoop-2.8.5

I can provide more information if needed. I have configured everything in the local machine.

ANSWER

Answered 2022-Mar-23 at 09:29

Adding these properties in spark-env.sh fixed the issue for me.

1admin@XXXXXX:~$ spark-shell --master yarn 21/11/03 15:51:51 WARN Utils: Your hostname, XXXXXX resolves to a loopback address:
2127.0.1.1; using 192.168.29.57 instead (on interface wifi0) 21/11/03 15:51:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/11/03 15:52:01 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Spark context Web UI available at http://XX.XX.XX.XX:4040 Spark context available as 'sc' (master = yarn, app id = application_1635934709971_0001). Spark session available as 'spark'. Welcome to
3      ____              __
4     / __/__  ___ _____/ /__
5    _\ \/ _ \/ _ `/ __/  '_/    /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
6      /_/
7
8Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java
91.8.0_301) Type in expressions to have them evaluated. Type :help for more information.
10
11scala>
12
13scala> 21/11/03 15:52:35 ERROR YarnClientSchedulerBackend: YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details. 21/11/03 15:52:35 ERROR YarnClientSchedulerBackend: Diagnostics message: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult:
14        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
15        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
16        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
17        at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
18        at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:515)
19        at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:307)
20        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
21        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
22        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
23        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:780)
24        at java.security.AccessController.doPrivileged(Native Method)
25        at javax.security.auth.Subject.doAs(Subject.java:422)
26        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
27        at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:779)
28        at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
29        at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:804)
30        at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:834)
31        at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) Caused by: java.io.IOException: Failed to connect to /192.168.29.57:33333
32        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
33        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
34        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
35        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
36        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
37        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
38        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
39        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
40        at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.29.57:33333 Caused by: java.net.ConnectException: Connection refused
41        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
42        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)
43        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
44        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
45        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:688)
46        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
47        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
48        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
49        at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
50        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
51        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
52        at java.lang.Thread.run(Thread.java:748)
53
5421/11/03 15:52:35 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
55spark.driver.memory              512m
56spark.yarn.am.memory             512m
57spark.executor.memory            512m
58spark.eventLog.enabled true
59spark.eventLog.dir file:////home/admin/spark_event_temp
60spark.history.fs.logDirectory hdfs://localhost:9000/spark-logs
61spark.history.fs.update.interval 10s
62spark.history.ui.port 18080
63spark.sql.warehouse.dir=file:////home/admin/spark_warehouse
64spark.shuffle.service.port              7337
65spark.ui.port                           4040
66spark.blockManager.port                 31111
67spark.driver.blockManager.port          32222
68spark.driver.port                       33333
69export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 
70export HADOOP_HOME=/mnt/d/soft/hadoop-2.8.5 
71export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop 
72export SPARK_HOME=$SPARK_HOME 
73export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop/ 
74export SPARK_MASTER_HOST=127.0.0.1 
75export SPARK_LOCAL_IP=127.0.0.1
76

Source https://stackoverflow.com/questions/69823486

QUESTION

determine written object paths with Pyspark 3.2.1 + hadoop 3.3.2

Asked 2022-Mar-21 at 11:50

When writing dataframes to S3 using the s3a connector, there seems to be no official way of determining the object paths on s3 that were written in the process. What I am trying to achieve is simply determining what objects have been written when writing to s3 (using pyspark 3.2.1 with hadoop 3.3.2 and the directory committer).

The reason this might be useful:

  • partitionBy might add a dynamic amount of new paths
  • spark creates it's own "part..." parquet files with cryptic names and number depending on the partitions when writing

With pyspark 3.1.2 and Hadoop 3.2.0 it used to be possible to use the not officially supported "_SUCCESS" file which was written at the path before the first partitioning on S3, which contained all the paths of all written files. Now however, the number of paths seems to be limited to 100 and this is not a option anymore.

Is there really no official, reasonable way of achieving this task?

ANSWER

Answered 2022-Mar-21 at 11:50

Now however, the number of paths seems to be limited to 100 and this is not a option anymore.

we had to cut that in HADOOP-16570...one of the scale problems which surfaced during terasorting at 10-100 TB. the time to write the _SUCCESS file started to slow down job commit times. it was only ever intended for testing. sorry.

it is just a constant in the source tree. if you were to provided a patch to make it configurable, I'll be happy to review and merge, provided you follow the "say which aws endpoint you ran all the tests or we ignore your patch" policy.

I don't know where else this stuff is collected. the spark driver is told of the number of files and their total size from each task commit, but isn't given the list by tasks, not AFAIK.

spark creates it's own "part..." parquet files with cryptic names and number depending on the partitions when writing

the part-0001- bit of the filename comes from the task id; the bit afterwards is a uuid created to ensure every filename is unique -see SPARK-8406 Adding UUID to output file name to avoid accidental overwriting. you can probably turn that off

Source https://stackoverflow.com/questions/71554579

QUESTION

How to read a csv file from s3 bucket using pyspark

Asked 2022-Mar-16 at 22:53

I'm using Apache Spark 3.1.0 with Python 3.9.6. I'm trying to read csv file from AWS S3 bucket something like this:

1spark = SparkSession.builder.getOrCreate()
2file = "s3://bucket/file.csv"
3
4c = spark.read\
5    .csv(file)\
6    .count()
7
8print(c)
9

But I'm getting the following error:

1spark = SparkSession.builder.getOrCreate()
2file = "s3://bucket/file.csv"
3
4c = spark.read\
5    .csv(file)\
6    .count()
7
8print(c)
9py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
10: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
11

I understand that I need add special libraries, but I didn't find any certain information which exactly and which versions. I've tried to add something like this to my code, but I'm still getting same error:

1spark = SparkSession.builder.getOrCreate()
2file = "s3://bucket/file.csv"
3
4c = spark.read\
5    .csv(file)\
6    .count()
7
8print(c)
9py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
10: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
11import os
12os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
13

How can I fix this?

ANSWER

Answered 2021-Aug-25 at 11:11

You need to use hadoop-aws version 3.2.0 for spark 3. In --packages specifying hadoop-aws library is enough to read files from S3.

1spark = SparkSession.builder.getOrCreate()
2file = "s3://bucket/file.csv"
3
4c = spark.read\
5    .csv(file)\
6    .count()
7
8print(c)
9py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
10: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
11import os
12os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
13--packages org.apache.hadoop:hadoop-aws:3.2.0
14

You need to set below configurations.

1spark = SparkSession.builder.getOrCreate()
2file = "s3://bucket/file.csv"
3
4c = spark.read\
5    .csv(file)\
6    .count()
7
8print(c)
9py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
10: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
11import os
12os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
13--packages org.apache.hadoop:hadoop-aws:3.2.0
14spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<access_key>")
15spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<secret_key>")
16

After that you can read CSV file.

1spark = SparkSession.builder.getOrCreate()
2file = "s3://bucket/file.csv"
3
4c = spark.read\
5    .csv(file)\
6    .count()
7
8print(c)
9py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
10: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
11import os
12os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
13--packages org.apache.hadoop:hadoop-aws:3.2.0
14spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<access_key>")
15spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<secret_key>")
16spark.read.csv("s3a://bucket/file.csv")
17

Source https://stackoverflow.com/questions/68921060

QUESTION

Hadoop to SQL through SSIS Package : Data incorrect format

Asked 2022-Mar-13 at 20:05

I am using an ODBC source connected to the Hadoop system and read the column PONum with value 4400023488 of datatype Text_Stream DT_Text]. Data is converted into string [DT_WSTR] using a data conversion transformation and then inserted into SQL Server using an OLE DB Destination. (destination column's type is a Unicode string DT_WSTR)

I am able to insert Value to SQL Server table but with an incorrect format 㐴〰㌵㠵㔹 expected value is 4400023488.

Any suggestions?

ANSWER

Answered 2022-Mar-13 at 20:04

I have two suggestions:

  1. Instead of using a data conversion transformation, use a derived column that convert the DT_TEXT value to DT_STR before converting it to unicode:
1(DT_WSTR, 4000)(DT_STR, 4000, 1252)[ColumnName]
2

Make sure that you replace 1252 with the appropriate encoding.

Also, you can use a script component: SSIS : Conversion text stream DT_TEXT to DT_WSTR

  1. Use the Hadoop SSIS connection manager and HDFS source instead of using ODBC:

Source https://stackoverflow.com/questions/71451745

QUESTION

How to run spark 3.2.0 on google dataproc?

Asked 2022-Mar-10 at 11:46

Currently, google dataproc does not have spark 3.2.0 as an image. The latest available is 3.1.2. I want to use the pandas on pyspark functionality that spark has released with 3.2.0.

I am doing the following steps to use spark 3.2.0

  1. Created an environment 'pyspark' locally with pyspark 3.2.0 in it
  2. Exported the environment yaml with conda env export > environment.yaml
  3. Created a dataproc cluster with this environment.yaml. The cluster gets created correctly and the environment is available on master and all the workers
  4. I then change environment variables. export SPARK_HOME=/opt/conda/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark (to point to pyspark 3.2.0), export SPARK_CONF_DIR=/usr/lib/spark/conf (to use dataproc's config file) and, export PYSPARK_PYTHON=/opt/conda/miniconda3/envs/pyspark/bin/python (to make the environment packages available)

Now if I try to run the pyspark shell I get:

121/12/07 01:25:16 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener AppStatusListener threw an exception
2java.lang.NumberFormatException: For input string: "null"
3        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
4        at java.lang.Integer.parseInt(Integer.java:580)
5        at java.lang.Integer.parseInt(Integer.java:615)
6        at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
7        at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
8        at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
9        at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1126)
10        at org.apache.spark.status.ProcessSummaryWrapper.<init>(storeTypes.scala:527)
11        at org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:924)
12        at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)
13        at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1213)
14        at org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1427)
15        at org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113)
16        at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
17        at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
18        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
19        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
20        at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
21        at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
22        at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
23        at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
24        at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
25        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
26        at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
27        at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
28        at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1404)
29        at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
30

However, the shell does start even after this. But, it does not execute code. Throws exceptions: I tried running: set(sc.parallelize(range(10),10).map(lambda x: socket.gethostname()).collect()) but, I am getting:

121/12/07 01:25:16 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener AppStatusListener threw an exception
2java.lang.NumberFormatException: For input string: "null"
3        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
4        at java.lang.Integer.parseInt(Integer.java:580)
5        at java.lang.Integer.parseInt(Integer.java:615)
6        at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
7        at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
8        at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
9        at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1126)
10        at org.apache.spark.status.ProcessSummaryWrapper.<init>(storeTypes.scala:527)
11        at org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:924)
12        at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)
13        at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1213)
14        at org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1427)
15        at org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113)
16        at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
17        at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
18        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
19        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
20        at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
21        at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
22        at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
23        at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
24        at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
25        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
26        at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
27        at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
28        at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1404)
29        at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
3021/12/07 01:32:15 WARN org.apache.spark.deploy.yarn.YarnAllocator: Container from a bad node: container_1638782400702_0003_01_000001 on host: monsoon-test1-w-2.us-central1-c.c.monsoon-credittech.internal. Exit status: 1. Diagnostics: [2021-12-07 
3101:32:13.672]Exception from container-launch.
32Container id: container_1638782400702_0003_01_000001
33Exit code: 1
34[2021-12-07 01:32:13.717]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
35Last 4096 bytes of prelaunch.err :
36Last 4096 bytes of stderr :
37ltChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
38        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
39        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
40        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
41        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
42        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
43        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
44        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
45        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
46        at java.lang.Thread.run(Thread.java:748)
4721/12/07 01:31:43 ERROR org.apache.spark.executor.YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Driver monsoon-test1-m.us-central1-c.c.monsoon-credittech.internal:44367 disassociated! Shutting down.
4821/12/07 01:32:13 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException
49java.util.concurrent.TimeoutException
50        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
51        at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
52        at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
5321/12/07 01:32:13 ERROR org.apache.spark.util.Utils: Uncaught exception in thread shutdown-hook-0
54java.lang.InterruptedException
55        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
56        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
57        at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
58        at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(Executors.java:675)
59        at org.apache.spark.rpc.netty.MessageLoop.stop(MessageLoop.scala:60)
60        at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1(Dispatcher.scala:197)
61        at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1$adapted(Dispatcher.scala:194)
62        at scala.collection.Iterator.foreach(Iterator.scala:943)
63        at scala.collection.Iterator.foreach$(Iterator.scala:943)
64        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
65        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
66        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
67        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
68        at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:194)
69        at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:331)
70        at org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:309)
71        at org.apache.spark.SparkEnv.stop(SparkEnv.scala:96)
72        at org.apache.spark.executor.Executor.stop(Executor.scala:335)
73        at org.apache.spark.executor.Executor.$anonfun$new$2(Executor.scala:76)
74        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
75        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
76        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
77        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
78        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
79        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
80        at scala.util.Try$.apply(Try.scala:213)
81        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
82        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
83        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
84        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
85        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
86        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
87        at java.lang.Thread.run(Thread.java:748)
88

and the same error repeats multiple times before coming to a stop.

What am I doing wrong and How can I use python 3.2.0 on google dataproc?

ANSWER

Answered 2022-Jan-15 at 07:17

One can achieve this by:

  1. Create a dataproc cluster with an environment (your_sample_env) that contains pyspark 3.2 as a package
  2. Modify /usr/lib/spark/conf/spark-env.sh by adding
121/12/07 01:25:16 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener AppStatusListener threw an exception
2java.lang.NumberFormatException: For input string: "null"
3        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
4        at java.lang.Integer.parseInt(Integer.java:580)
5        at java.lang.Integer.parseInt(Integer.java:615)
6        at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
7        at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
8        at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
9        at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1126)
10        at org.apache.spark.status.ProcessSummaryWrapper.<init>(storeTypes.scala:527)
11        at org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:924)
12        at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)
13        at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1213)
14        at org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1427)
15        at org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113)
16        at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
17        at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
18        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
19        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
20        at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
21        at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
22        at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
23        at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
24        at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
25        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
26        at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
27        at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
28        at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1404)
29        at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
3021/12/07 01:32:15 WARN org.apache.spark.deploy.yarn.YarnAllocator: Container from a bad node: container_1638782400702_0003_01_000001 on host: monsoon-test1-w-2.us-central1-c.c.monsoon-credittech.internal. Exit status: 1. Diagnostics: [2021-12-07 
3101:32:13.672]Exception from container-launch.
32Container id: container_1638782400702_0003_01_000001
33Exit code: 1
34[2021-12-07 01:32:13.717]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
35Last 4096 bytes of prelaunch.err :
36Last 4096 bytes of stderr :
37ltChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
38        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
39        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
40        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
41        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
42        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
43        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
44        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
45        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
46        at java.lang.Thread.run(Thread.java:748)
4721/12/07 01:31:43 ERROR org.apache.spark.executor.YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Driver monsoon-test1-m.us-central1-c.c.monsoon-credittech.internal:44367 disassociated! Shutting down.
4821/12/07 01:32:13 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException
49java.util.concurrent.TimeoutException
50        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
51        at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
52        at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
5321/12/07 01:32:13 ERROR org.apache.spark.util.Utils: Uncaught exception in thread shutdown-hook-0
54java.lang.InterruptedException
55        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
56        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
57        at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
58        at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(Executors.java:675)
59        at org.apache.spark.rpc.netty.MessageLoop.stop(MessageLoop.scala:60)
60        at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1(Dispatcher.scala:197)
61        at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1$adapted(Dispatcher.scala:194)
62        at scala.collection.Iterator.foreach(Iterator.scala:943)
63        at scala.collection.Iterator.foreach$(Iterator.scala:943)
64        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
65        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
66        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
67        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
68        at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:194)
69        at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:331)
70        at org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:309)
71        at org.apache.spark.SparkEnv.stop(SparkEnv.scala:96)
72        at org.apache.spark.executor.Executor.stop(Executor.scala:335)
73        at org.apache.spark.executor.Executor.$anonfun$new$2(Executor.scala:76)
74        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
75        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
76        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
77        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
78        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
79        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
80        at scala.util.Try$.apply(Try.scala:213)
81        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
82        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
83        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
84        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
85        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
86        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
87        at java.lang.Thread.run(Thread.java:748)
88SPARK_HOME="/opt/conda/miniconda3/envs/your_sample_env/lib/python/site-packages/pyspark"
89SPARK_CONF="/usr/lib/spark/conf"
90

at its end

  1. Modify /usr/lib/spark/conf/spark-defaults.conf by commenting out the following configurations
121/12/07 01:25:16 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener AppStatusListener threw an exception
2java.lang.NumberFormatException: For input string: "null"
3        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
4        at java.lang.Integer.parseInt(Integer.java:580)
5        at java.lang.Integer.parseInt(Integer.java:615)
6        at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
7        at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
8        at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
9        at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1126)
10        at org.apache.spark.status.ProcessSummaryWrapper.<init>(storeTypes.scala:527)
11        at org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:924)
12        at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)
13        at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1213)
14        at org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1427)
15        at org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113)
16        at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
17        at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
18        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
19        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
20        at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
21        at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
22        at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
23        at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
24        at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
25        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
26        at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
27        at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
28        at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1404)
29        at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
3021/12/07 01:32:15 WARN org.apache.spark.deploy.yarn.YarnAllocator: Container from a bad node: container_1638782400702_0003_01_000001 on host: monsoon-test1-w-2.us-central1-c.c.monsoon-credittech.internal. Exit status: 1. Diagnostics: [2021-12-07 
3101:32:13.672]Exception from container-launch.
32Container id: container_1638782400702_0003_01_000001
33Exit code: 1
34[2021-12-07 01:32:13.717]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
35Last 4096 bytes of prelaunch.err :
36Last 4096 bytes of stderr :
37ltChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
38        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
39        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
40        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
41        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
42        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
43        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
44        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
45        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
46        at java.lang.Thread.run(Thread.java:748)
4721/12/07 01:31:43 ERROR org.apache.spark.executor.YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Driver monsoon-test1-m.us-central1-c.c.monsoon-credittech.internal:44367 disassociated! Shutting down.
4821/12/07 01:32:13 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException
49java.util.concurrent.TimeoutException
50        at java.util.concurrent.FutureTask.get(FutureTask.java:205)
51        at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
52        at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
5321/12/07 01:32:13 ERROR org.apache.spark.util.Utils: Uncaught exception in thread shutdown-hook-0
54java.lang.InterruptedException
55        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
56        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
57        at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
58        at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(Executors.java:675)
59        at org.apache.spark.rpc.netty.MessageLoop.stop(MessageLoop.scala:60)
60        at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1(Dispatcher.scala:197)
61        at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1$adapted(Dispatcher.scala:194)
62        at scala.collection.Iterator.foreach(Iterator.scala:943)
63        at scala.collection.Iterator.foreach$(Iterator.scala:943)
64        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
65        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
66        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
67        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
68        at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:194)
69        at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:331)
70        at org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:309)
71        at org.apache.spark.SparkEnv.stop(SparkEnv.scala:96)
72        at org.apache.spark.executor.Executor.stop(Executor.scala:335)
73        at org.apache.spark.executor.Executor.$anonfun$new$2(Executor.scala:76)
74        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
75        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
76        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
77        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
78        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
79        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
80        at scala.util.Try$.apply(Try.scala:213)
81        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
82        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
83        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
84        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
85        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
86        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
87        at java.lang.Thread.run(Thread.java:748)
88SPARK_HOME="/opt/conda/miniconda3/envs/your_sample_env/lib/python/site-packages/pyspark"
89SPARK_CONF="/usr/lib/spark/conf"
90spark.yarn.jars=local:/usr/lib/spark/jars/*
91spark.yarn.unmanagedAM.enabled=true
92
93

Now, your spark jobs will use pyspark 3.2

Source https://stackoverflow.com/questions/70254378

QUESTION

AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>

Asked 2022-Feb-25 at 13:18

I was using pyspark on AWS EMR (4 r5.xlarge as 4 workers, each has one executor and 4 cores), and I got AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'. Below is a snippet of the code that threw this error:

1search =  SearchEngine(db_file_dir = &quot;/tmp/db&quot;)
2conn = sqlite3.connect(&quot;/tmp/db/simple_db.sqlite&quot;)
3pdf_ = pd.read_sql_query('''select  zipcode, lat, lng, 
4                        bounds_west, bounds_east, bounds_north, bounds_south from 
5                        simple_zipcode''',conn)
6brd_pdf = spark.sparkContext.broadcast(pdf_) 
7conn.close()
8
9
10@udf('string')
11def get_zip_b(lat, lng):
12    pdf = brd_pdf.value 
13    out = pdf[(np.array(pdf[&quot;bounds_north&quot;]) &gt;= lat) &amp; 
14              (np.array(pdf[&quot;bounds_south&quot;]) &lt;= lat) &amp; 
15              (np.array(pdf['bounds_west']) &lt;= lng) &amp; 
16              (np.array(pdf['bounds_east']) &gt;= lng) ]
17    if len(out):
18        min_index = np.argmin( (np.array(out[&quot;lat&quot;]) - lat)**2 + (np.array(out[&quot;lng&quot;]) - lng)**2)
19        zip_ = str(out[&quot;zipcode&quot;].iloc[min_index])
20    else:
21        zip_ = 'bad'
22    return zip_
23
24df = df.withColumn('zipcode', get_zip_b(col(&quot;latitude&quot;),col(&quot;longitude&quot;)))
25

Below is the traceback, where line 102, in get_zip_b refers to pdf = brd_pdf.value:

1search =  SearchEngine(db_file_dir = &quot;/tmp/db&quot;)
2conn = sqlite3.connect(&quot;/tmp/db/simple_db.sqlite&quot;)
3pdf_ = pd.read_sql_query('''select  zipcode, lat, lng, 
4                        bounds_west, bounds_east, bounds_north, bounds_south from 
5                        simple_zipcode''',conn)
6brd_pdf = spark.sparkContext.broadcast(pdf_) 
7conn.close()
8
9
10@udf('string')
11def get_zip_b(lat, lng):
12    pdf = brd_pdf.value 
13    out = pdf[(np.array(pdf[&quot;bounds_north&quot;]) &gt;= lat) &amp; 
14              (np.array(pdf[&quot;bounds_south&quot;]) &lt;= lat) &amp; 
15              (np.array(pdf['bounds_west']) &lt;= lng) &amp; 
16              (np.array(pdf['bounds_east']) &gt;= lng) ]
17    if len(out):
18        min_index = np.argmin( (np.array(out[&quot;lat&quot;]) - lat)**2 + (np.array(out[&quot;lng&quot;]) - lng)**2)
19        zip_ = str(out[&quot;zipcode&quot;].iloc[min_index])
20    else:
21        zip_ = 'bad'
22    return zip_
23
24df = df.withColumn('zipcode', get_zip_b(col(&quot;latitude&quot;),col(&quot;longitude&quot;)))
2521/08/02 06:18:19 WARN TaskSetManager: Lost task 12.0 in stage 7.0 (TID 1814, ip-10-22-17-94.pclc0.merkle.local, executor 6): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
26  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py&quot;, line 605, in main
27    process()
28  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py&quot;, line 597, in process
29    serializer.dump_stream(out_iter, outfile)
30  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py&quot;, line 223, in dump_stream
31    self.serializer.dump_stream(self._batched(iterator), stream)
32  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py&quot;, line 141, in dump_stream
33    for obj in iterator:
34  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py&quot;, line 212, in _batched
35    for item in iterator:
36  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py&quot;, line 450, in mapper
37    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
38  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py&quot;, line 450, in &lt;genexpr&gt;
39    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
40  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py&quot;, line 90, in &lt;lambda&gt;
41    return lambda *a: f(*a)
42  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/util.py&quot;, line 121, in wrapper
43    return f(*args, **kwargs)
44  File &quot;/mnt/var/lib/hadoop/steps/s-1IBFS0SYWA19Z/Mobile_ID_process_center.py&quot;, line 102, in get_zip_b
45  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py&quot;, line 146, in value
46    self._value = self.load_from_path(self._path)
47  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py&quot;, line 123, in load_from_path
48    return self.load(f)
49  File &quot;/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py&quot;, line 129, in load
50    return pickle.load(file)
51AttributeError: Can't get attribute 'new_block' on &lt;module 'pandas.core.internals.blocks' from '/mnt/miniconda/lib/python3.9/site-packages/pandas/core/internals/blocks.py'&gt;
52

Some observations and thought process:

1, After doing some search online, the AttributeError in pyspark seems to be caused by mismatched pandas versions between driver and workers?

2, But I ran the same code on two different datasets, one worked without any errors but the other didn't, which seems very strange and undeterministic, and it seems like the errors may not be caused by mismatched pandas versions. Otherwise, neither two datasets would succeed.

3, I then ran the same code on the successful dataset again, but this time with different spark configurations: setting spark.driver.memory from 2048M to 4192m, and it threw AttributeError.

4, In conclusion, I think the AttributeError has something to do with driver. But I can't tell how they are related from the error message, and how to fix it: AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'.

ANSWER

Answered 2021-Aug-26 at 14:53

I had the same error using pandas 1.3.2 in the server while 1.2 in my client. Downgrading pandas to 1.2 solved the problem.

Source https://stackoverflow.com/questions/68625748

QUESTION

Cannot find conda info. Please verify your conda installation on EMR

Asked 2022-Feb-05 at 00:17

I am trying to install conda on EMR and below is my bootstrap script, it looks like conda is getting installed but it is not getting added to environment variable. When I manually update the $PATH variable on EMR master node, it can identify conda. I want to use conda on Zeppelin.

I also tried adding condig into configuration like below while launching my EMR instance however I still get the below mentioned error.

1    &quot;classification&quot;: &quot;spark-env&quot;,
2    &quot;properties&quot;: {
3        &quot;conda&quot;: &quot;/home/hadoop/conda/bin&quot;
4    }
5
1    &quot;classification&quot;: &quot;spark-env&quot;,
2    &quot;properties&quot;: {
3        &quot;conda&quot;: &quot;/home/hadoop/conda/bin&quot;
4    }
5[hadoop@ip-172-30-5-150 ~]$ PATH=/home/hadoop/conda/bin:$PATH
6[hadoop@ip-172-30-5-150 ~]$ conda
7usage: conda [-h] [-V] command ...
8
9conda is a tool for managing and deploying applications, environments and packages.
10
1    &quot;classification&quot;: &quot;spark-env&quot;,
2    &quot;properties&quot;: {
3        &quot;conda&quot;: &quot;/home/hadoop/conda/bin&quot;
4    }
5[hadoop@ip-172-30-5-150 ~]$ PATH=/home/hadoop/conda/bin:$PATH
6[hadoop@ip-172-30-5-150 ~]$ conda
7usage: conda [-h] [-V] command ...
8
9conda is a tool for managing and deploying applications, environments and packages.
10#!/usr/bin/env bash
11
12
13# Install conda
14wget https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
15    &amp;&amp; /bin/bash ~/miniconda.sh -b -p $HOME/conda
16
17
18conda config --set always_yes yes --set changeps1 no
19conda install conda=4.2.13
20conda config -f --add channels conda-forge
21rm ~/miniconda.sh
22echo bootstrap_conda.sh completed. PATH now: $PATH
23export PYSPARK_PYTHON=&quot;/home/hadoop/conda/bin/python3.5&quot;
24
25echo -e '\nexport PATH=$HOME/conda/bin:$PATH' &gt;&gt; $HOME/.bashrc &amp;&amp; source $HOME/.bashrc
26
27
28conda create -n zoo python=3.7 # &quot;zoo&quot; is conda environment name, you can use any name you like.
29conda activate zoo
30sudo pip3 install tensorflow
31sudo pip3 install boto3
32sudo pip3 install botocore
33sudo pip3 install numpy
34sudo pip3 install pandas
35sudo pip3 install scipy
36sudo pip3 install s3fs
37sudo pip3 install matplotlib
38sudo pip3 install -U tqdm
39sudo pip3 install -U scikit-learn
40sudo pip3 install -U scikit-multilearn
41sudo pip3 install xlutils
42sudo pip3 install natsort
43sudo pip3 install pydot
44sudo pip3 install python-pydot
45sudo pip3 install python-pydot-ng
46sudo pip3 install pydotplus
47sudo pip3 install h5py
48sudo pip3 install graphviz
49sudo pip3 install recmetrics
50sudo pip3 install openpyxl
51sudo pip3 install xlrd
52sudo pip3 install xlwt
53sudo pip3 install tensorflow.io
54sudo pip3 install Cython
55sudo pip3 install ray
56sudo pip3 install zoo
57sudo pip3 install analytics-zoo
58sudo pip3 install analytics-zoo[ray]
59#sudo /usr/bin/pip-3.6 install -U imbalanced-learn
60
61
62

enter image description here

ANSWER

Answered 2022-Feb-05 at 00:17

I got the conda working by modifying the script as below, emr python versions were colliding with the conda version.:

1    &quot;classification&quot;: &quot;spark-env&quot;,
2    &quot;properties&quot;: {
3        &quot;conda&quot;: &quot;/home/hadoop/conda/bin&quot;
4    }
5[hadoop@ip-172-30-5-150 ~]$ PATH=/home/hadoop/conda/bin:$PATH
6[hadoop@ip-172-30-5-150 ~]$ conda
7usage: conda [-h] [-V] command ...
8
9conda is a tool for managing and deploying applications, environments and packages.
10#!/usr/bin/env bash
11
12
13# Install conda
14wget https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
15    &amp;&amp; /bin/bash ~/miniconda.sh -b -p $HOME/conda
16
17
18conda config --set always_yes yes --set changeps1 no
19conda install conda=4.2.13
20conda config -f --add channels conda-forge
21rm ~/miniconda.sh
22echo bootstrap_conda.sh completed. PATH now: $PATH
23export PYSPARK_PYTHON=&quot;/home/hadoop/conda/bin/python3.5&quot;
24
25echo -e '\nexport PATH=$HOME/conda/bin:$PATH' &gt;&gt; $HOME/.bashrc &amp;&amp; source $HOME/.bashrc
26
27
28conda create -n zoo python=3.7 # &quot;zoo&quot; is conda environment name, you can use any name you like.
29conda activate zoo
30sudo pip3 install tensorflow
31sudo pip3 install boto3
32sudo pip3 install botocore
33sudo pip3 install numpy
34sudo pip3 install pandas
35sudo pip3 install scipy
36sudo pip3 install s3fs
37sudo pip3 install matplotlib
38sudo pip3 install -U tqdm
39sudo pip3 install -U scikit-learn
40sudo pip3 install -U scikit-multilearn
41sudo pip3 install xlutils
42sudo pip3 install natsort
43sudo pip3 install pydot
44sudo pip3 install python-pydot
45sudo pip3 install python-pydot-ng
46sudo pip3 install pydotplus
47sudo pip3 install h5py
48sudo pip3 install graphviz
49sudo pip3 install recmetrics
50sudo pip3 install openpyxl
51sudo pip3 install xlrd
52sudo pip3 install xlwt
53sudo pip3 install tensorflow.io
54sudo pip3 install Cython
55sudo pip3 install ray
56sudo pip3 install zoo
57sudo pip3 install analytics-zoo
58sudo pip3 install analytics-zoo[ray]
59#sudo /usr/bin/pip-3.6 install -U imbalanced-learn
60
61
62wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh  -O /home/hadoop/miniconda.sh \
63    &amp;&amp; /bin/bash ~/miniconda.sh -b -p $HOME/conda
64
65echo -e '\n export PATH=$HOME/conda/bin:$PATH' &gt;&gt; $HOME/.bashrc &amp;&amp; source $HOME/.bashrc
66
67
68conda config --set always_yes yes --set changeps1 no
69conda config -f --add channels conda-forge
70
71
72conda create -n zoo python=3.7 # &quot;zoo&quot; is conda environment name
73conda init bash
74source activate zoo
75conda install python 3.7.0 -c conda-forge orca 
76sudo /home/hadoop/conda/envs/zoo/bin/python3.7 -m pip install virtualenv
77

and setting zeppelin python and pyspark parameters to:

1    &quot;classification&quot;: &quot;spark-env&quot;,
2    &quot;properties&quot;: {
3        &quot;conda&quot;: &quot;/home/hadoop/conda/bin&quot;
4    }
5[hadoop@ip-172-30-5-150 ~]$ PATH=/home/hadoop/conda/bin:$PATH
6[hadoop@ip-172-30-5-150 ~]$ conda
7usage: conda [-h] [-V] command ...
8
9conda is a tool for managing and deploying applications, environments and packages.
10#!/usr/bin/env bash
11
12
13# Install conda
14wget https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
15    &amp;&amp; /bin/bash ~/miniconda.sh -b -p $HOME/conda
16
17
18conda config --set always_yes yes --set changeps1 no
19conda install conda=4.2.13
20conda config -f --add channels conda-forge
21rm ~/miniconda.sh
22echo bootstrap_conda.sh completed. PATH now: $PATH
23export PYSPARK_PYTHON=&quot;/home/hadoop/conda/bin/python3.5&quot;
24
25echo -e '\nexport PATH=$HOME/conda/bin:$PATH' &gt;&gt; $HOME/.bashrc &amp;&amp; source $HOME/.bashrc
26
27
28conda create -n zoo python=3.7 # &quot;zoo&quot; is conda environment name, you can use any name you like.
29conda activate zoo
30sudo pip3 install tensorflow
31sudo pip3 install boto3
32sudo pip3 install botocore
33sudo pip3 install numpy
34sudo pip3 install pandas
35sudo pip3 install scipy
36sudo pip3 install s3fs
37sudo pip3 install matplotlib
38sudo pip3 install -U tqdm
39sudo pip3 install -U scikit-learn
40sudo pip3 install -U scikit-multilearn
41sudo pip3 install xlutils
42sudo pip3 install natsort
43sudo pip3 install pydot
44sudo pip3 install python-pydot
45sudo pip3 install python-pydot-ng
46sudo pip3 install pydotplus
47sudo pip3 install h5py
48sudo pip3 install graphviz
49sudo pip3 install recmetrics
50sudo pip3 install openpyxl
51sudo pip3 install xlrd
52sudo pip3 install xlwt
53sudo pip3 install tensorflow.io
54sudo pip3 install Cython
55sudo pip3 install ray
56sudo pip3 install zoo
57sudo pip3 install analytics-zoo
58sudo pip3 install analytics-zoo[ray]
59#sudo /usr/bin/pip-3.6 install -U imbalanced-learn
60
61
62wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh  -O /home/hadoop/miniconda.sh \
63    &amp;&amp; /bin/bash ~/miniconda.sh -b -p $HOME/conda
64
65echo -e '\n export PATH=$HOME/conda/bin:$PATH' &gt;&gt; $HOME/.bashrc &amp;&amp; source $HOME/.bashrc
66
67
68conda config --set always_yes yes --set changeps1 no
69conda config -f --add channels conda-forge
70
71
72conda create -n zoo python=3.7 # &quot;zoo&quot; is conda environment name
73conda init bash
74source activate zoo
75conda install python 3.7.0 -c conda-forge orca 
76sudo /home/hadoop/conda/envs/zoo/bin/python3.7 -m pip install virtualenv
77“spark.pyspark.python&quot;: &quot;/home/hadoop/conda/envs/zoo/bin/python3&quot;,
78&quot;spark.pyspark.virtualenv.enabled&quot;: &quot;true&quot;,
79&quot;spark.pyspark.virtualenv.type&quot;:&quot;native&quot;,
80&quot;spark.pyspark.virtualenv.bin.path&quot;:&quot;/home/hadoop/conda/envs/zoo/bin/,
81&quot;zeppelin.pyspark.python&quot; : &quot;/home/hadoop/conda/bin/python&quot;,
82&quot;zeppelin.python&quot;: &quot;/home/hadoop/conda/bin/python&quot;
83

Orca only support TF upto 1.5 hence it was not working as I am using TF2.

Source https://stackoverflow.com/questions/70901724

QUESTION

PySpark runs in YARN client mode but fails in cluster mode for &quot;User did not initialize spark context!&quot;

Asked 2022-Jan-19 at 21:28
  • standard dataproc image 2.0
  • Ubuntu 18.04 LTS
  • Hadoop 3.2
  • Spark 3.1

I am testing to run a very simple script on dataproc pyspark cluster:

testing_dep.py

1import os
2os.listdir('./')
3

I can run testing_dep.py in a client mode (default on dataproc) just fine:

1import os
2os.listdir('./')
3gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1
4

But, when I try to run the same job in cluster mode I get error:

1import os
2os.listdir('./')
3gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1
4gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1 --properties=spark.submit.deployMode=cluster
5

error logs:

1import os
2os.listdir('./')
3gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1
4gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1 --properties=spark.submit.deployMode=cluster
5Job [417443357bcd43f99ee3dc60f4e3bfea] submitted.
6Waiting for job output...
722/01/12 05:32:20 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at monsoon-testing-m/10.128.15.236:8032
822/01/12 05:32:20 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at monsoon-testing-m/10.128.15.236:10200
922/01/12 05:32:22 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
1022/01/12 05:32:22 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
1122/01/12 05:32:24 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1641965080466_0001
1222/01/12 05:32:42 ERROR org.apache.spark.deploy.yarn.Client: Application diagnostics message: Application application_1641965080466_0001 failed 2 times due to AM Container for appattempt_1641965080466_0001_000002 exited with  exitCode: 13
13Failing this attempt.Diagnostics: [2022-01-12 05:32:42.154]Exception from container-launch.
14Container id: container_1641965080466_0001_02_000001
15Exit code: 13
16
17[2022-01-12 05:32:42.203]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
18Last 4096 bytes of prelaunch.err :
19Last 4096 bytes of stderr :
2022/01/12 05:32:40 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception: 
21java.lang.IllegalStateException: User did not initialize spark context!
22    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:520)
23    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:268)
24    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:899)
25    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:898)
26    at java.security.AccessController.doPrivileged(Native Method)
27    at javax.security.auth.Subject.doAs(Subject.java:422)
28    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
29    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:898)
30    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
31
32
33[2022-01-12 05:32:42.203]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
34Last 4096 bytes of prelaunch.err :
35Last 4096 bytes of stderr :
3622/01/12 05:32:40 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception: 
37java.lang.IllegalStateException: User did not initialize spark context!
38    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:520)
39    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:268)
40    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:899)
41    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:898)
42    at java.security.AccessController.doPrivileged(Native Method)
43    at javax.security.auth.Subject.doAs(Subject.java:422)
44    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
45    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:898)
46    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
47
48
49For more detailed output, check the application tracking page: http://monsoon-testing-m:8188/applicationhistory/app/application_1641965080466_0001 Then click on links to logs of each attempt.
50. Failing the application.
51Exception in thread &quot;main&quot; org.apache.spark.SparkException: Application application_1641965080466_0001 finished with failed status
52    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1242)
53    at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1634)
54    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
55    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
56    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
57    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
58    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
59    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
60    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
61ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [417443357bcd43f99ee3dc60f4e3bfea] failed with error:
62Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
63https://console.cloud.google.com/dataproc/jobs/417443357bcd43f99ee3dc60f4e3bfea?project=monsoon-credittech&amp;region=us-central1
64gcloud dataproc jobs wait '417443357bcd43f99ee3dc60f4e3bfea' --region 'us-central1' --project 'monsoon-credittech'
65https://console.cloud.google.com/storage/browser/monsoon-credittech.appspot.com/google-cloud-dataproc-metainfo/64632294-3e9b-4c55-af8a-075fc7d6f412/jobs/417443357bcd43f99ee3dc60f4e3bfea/
66gs://monsoon-credittech.appspot.com/google-cloud-dataproc-metainfo/64632294-3e9b-4c55-af8a-075fc7d6f412/jobs/417443357bcd43f99ee3dc60f4e3bfea/driveroutput
67
68
69
70

Can you please help me understand what I am doing wrong and why this code is failing?

ANSWER

Answered 2022-Jan-19 at 21:26

The error is expected when running Spark in YARN cluster mode but the job doesn't create Spark context. See the source code of ApplicationMaster.scala.

To avoid this error, you need to create a SparkContext or SparkSession, e.g.:

1import os
2os.listdir('./')
3gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1
4gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1 --properties=spark.submit.deployMode=cluster
5Job [417443357bcd43f99ee3dc60f4e3bfea] submitted.
6Waiting for job output...
722/01/12 05:32:20 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at monsoon-testing-m/10.128.15.236:8032
822/01/12 05:32:20 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at monsoon-testing-m/10.128.15.236:10200
922/01/12 05:32:22 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
1022/01/12 05:32:22 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
1122/01/12 05:32:24 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1641965080466_0001
1222/01/12 05:32:42 ERROR org.apache.spark.deploy.yarn.Client: Application diagnostics message: Application application_1641965080466_0001 failed 2 times due to AM Container for appattempt_1641965080466_0001_000002 exited with  exitCode: 13
13Failing this attempt.Diagnostics: [2022-01-12 05:32:42.154]Exception from container-launch.
14Container id: container_1641965080466_0001_02_000001
15Exit code: 13
16
17[2022-01-12 05:32:42.203]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
18Last 4096 bytes of prelaunch.err :
19Last 4096 bytes of stderr :
2022/01/12 05:32:40 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception: 
21java.lang.IllegalStateException: User did not initialize spark context!
22    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:520)
23    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:268)
24    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:899)
25    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:898)
26    at java.security.AccessController.doPrivileged(Native Method)
27    at javax.security.auth.Subject.doAs(Subject.java:422)
28    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
29    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:898)
30    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
31
32
33[2022-01-12 05:32:42.203]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
34Last 4096 bytes of prelaunch.err :
35Last 4096 bytes of stderr :
3622/01/12 05:32:40 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception: 
37java.lang.IllegalStateException: User did not initialize spark context!
38    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:520)
39    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:268)
40    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:899)
41    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:898)
42    at java.security.AccessController.doPrivileged(Native Method)
43    at javax.security.auth.Subject.doAs(Subject.java:422)
44    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
45    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:898)
46    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
47
48
49For more detailed output, check the application tracking page: http://monsoon-testing-m:8188/applicationhistory/app/application_1641965080466_0001 Then click on links to logs of each attempt.
50. Failing the application.
51Exception in thread &quot;main&quot; org.apache.spark.SparkException: Application application_1641965080466_0001 finished with failed status
52    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1242)
53    at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1634)
54    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
55    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
56    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
57    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
58    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
59    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
60    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
61ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [417443357bcd43f99ee3dc60f4e3bfea] failed with error:
62Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
63https://console.cloud.google.com/dataproc/jobs/417443357bcd43f99ee3dc60f4e3bfea?project=monsoon-credittech&amp;region=us-central1
64gcloud dataproc jobs wait '417443357bcd43f99ee3dc60f4e3bfea' --region 'us-central1' --project 'monsoon-credittech'
65https://console.cloud.google.com/storage/browser/monsoon-credittech.appspot.com/google-cloud-dataproc-metainfo/64632294-3e9b-4c55-af8a-075fc7d6f412/jobs/417443357bcd43f99ee3dc60f4e3bfea/
66gs://monsoon-credittech.appspot.com/google-cloud-dataproc-metainfo/64632294-3e9b-4c55-af8a-075fc7d6f412/jobs/417443357bcd43f99ee3dc60f4e3bfea/driveroutput
67
68
69
70from pyspark.sql import SparkSession
71
72spark = SparkSession.builder \
73                    .appName('MySparkApp') \
74                    .getOrCreate()
75

Client mode doesn't go through the same code path and doesn't have a similar check.

Source https://stackoverflow.com/questions/70668449

QUESTION

Where to find spark log in dataproc when running job on cluster mode

Asked 2022-Jan-18 at 19:36

I am running the following code as job in dataproc. I could not find logs in console while running in 'cluster' mode.

1import sys
2import time
3from datetime import datetime
4
5from pyspark.sql import SparkSession
6
7start_time = datetime.utcnow()
8
9spark = SparkSession.builder.appName(&quot;check_confs&quot;).getOrCreate()
10
11all_conf = spark.sparkContext.getConf().getAll()
12print(&quot;\n\n=====\nExecuting at {}&quot;.format(datetime.utcnow()))
13print(all_conf)
14print(&quot;\n\n======================\n\n\n&quot;)
15incoming_args = sys.argv
16if len(incoming_args) &gt; 1:
17    sleep_time = int(incoming_args[1])
18    print(&quot;Sleep time is {} seconds&quot;.format(sleep_time))
19    if sleep_time &gt; 0:
20        time.sleep(sleep_time)
21
22end_time = datetime.utcnow()
23time_taken = (end_time - start_time).total_seconds()
24print(&quot;Script execution completed in {} seconds&quot;.format(time_taken))
25

If I trigger the job using the deployMode as cluster property, I could not see corresponding logs. But if the job is triggered in default mode which is client mode, able to see the respective logs. I have given the dictionary used for triggering the job.

"spark.submit.deployMode": "cluster"

1import sys
2import time
3from datetime import datetime
4
5from pyspark.sql import SparkSession
6
7start_time = datetime.utcnow()
8
9spark = SparkSession.builder.appName(&quot;check_confs&quot;).getOrCreate()
10
11all_conf = spark.sparkContext.getConf().getAll()
12print(&quot;\n\n=====\nExecuting at {}&quot;.format(datetime.utcnow()))
13print(all_conf)
14print(&quot;\n\n======================\n\n\n&quot;)
15incoming_args = sys.argv
16if len(incoming_args) &gt; 1:
17    sleep_time = int(incoming_args[1])
18    print(&quot;Sleep time is {} seconds&quot;.format(sleep_time))
19    if sleep_time &gt; 0:
20        time.sleep(sleep_time)
21
22end_time = datetime.utcnow()
23time_taken = (end_time - start_time).total_seconds()
24print(&quot;Script execution completed in {} seconds&quot;.format(time_taken))
25{
26        'placement': {
27            'cluster_name': dataproc_cluster
28        },
29        'pyspark_job': {
30            'main_python_file_uri': &quot;gs://&quot; + compute_storage + &quot;/&quot; + job_file,
31            'args': trigger_params,
32            &quot;properties&quot;: {
33                &quot;spark.submit.deployMode&quot;: &quot;cluster&quot;,
34                &quot;spark.executor.memory&quot;: &quot;3155m&quot;,
35                &quot;spark.scheduler.mode&quot;: &quot;FAIR&quot;,
36            }
37        }
38    }
39
1import sys
2import time
3from datetime import datetime
4
5from pyspark.sql import SparkSession
6
7start_time = datetime.utcnow()
8
9spark = SparkSession.builder.appName(&quot;check_confs&quot;).getOrCreate()
10
11all_conf = spark.sparkContext.getConf().getAll()
12print(&quot;\n\n=====\nExecuting at {}&quot;.format(datetime.utcnow()))
13print(all_conf)
14print(&quot;\n\n======================\n\n\n&quot;)
15incoming_args = sys.argv
16if len(incoming_args) &gt; 1:
17    sleep_time = int(incoming_args[1])
18    print(&quot;Sleep time is {} seconds&quot;.format(sleep_time))
19    if sleep_time &gt; 0:
20        time.sleep(sleep_time)
21
22end_time = datetime.utcnow()
23time_taken = (end_time - start_time).total_seconds()
24print(&quot;Script execution completed in {} seconds&quot;.format(time_taken))
25{
26        'placement': {
27            'cluster_name': dataproc_cluster
28        },
29        'pyspark_job': {
30            'main_python_file_uri': &quot;gs://&quot; + compute_storage + &quot;/&quot; + job_file,
31            'args': trigger_params,
32            &quot;properties&quot;: {
33                &quot;spark.submit.deployMode&quot;: &quot;cluster&quot;,
34                &quot;spark.executor.memory&quot;: &quot;3155m&quot;,
35                &quot;spark.scheduler.mode&quot;: &quot;FAIR&quot;,
36            }
37        }
38    }
3921/12/07 19:11:27 INFO org.sparkproject.jetty.util.log: Logging initialized @3350ms to org.sparkproject.jetty.util.log.Slf4jLog
4021/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_292-b10
4121/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: Started @3467ms
4221/12/07 19:11:27 INFO org.sparkproject.jetty.server.AbstractConnector: Started ServerConnector@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:40389}
4321/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
4421/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:10200
4521/12/07 19:11:29 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
4621/12/07 19:11:29 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
4721/12/07 19:11:30 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0014
4821/12/07 19:11:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8030
4921/12/07 19:11:33 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
50
51
52=====
53Executing at 2021-12-07 19:11:35.100277
54[....... ('spark.yarn.historyServer.address', '****-m:18080'), ('spark.ui.proxyBase', '/proxy/application_1638554180947_0014'), ('spark.driver.appUIAddress', 'http://***-m.c.***-123456.internal:40389'), ('spark.sql.cbo.enabled', 'true')]
55
56
57======================
58
59
60
61Sleep time is 1 seconds
62Script execution completed in 9.411261 seconds
6321/12/07 19:11:36 INFO org.sparkproject.jetty.server.AbstractConnector: Stopped Spark@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
64
65

Logs not coming in console while running in client mode

1import sys
2import time
3from datetime import datetime
4
5from pyspark.sql import SparkSession
6
7start_time = datetime.utcnow()
8
9spark = SparkSession.builder.appName(&quot;check_confs&quot;).getOrCreate()
10
11all_conf = spark.sparkContext.getConf().getAll()
12print(&quot;\n\n=====\nExecuting at {}&quot;.format(datetime.utcnow()))
13print(all_conf)
14print(&quot;\n\n======================\n\n\n&quot;)
15incoming_args = sys.argv
16if len(incoming_args) &gt; 1:
17    sleep_time = int(incoming_args[1])
18    print(&quot;Sleep time is {} seconds&quot;.format(sleep_time))
19    if sleep_time &gt; 0:
20        time.sleep(sleep_time)
21
22end_time = datetime.utcnow()
23time_taken = (end_time - start_time).total_seconds()
24print(&quot;Script execution completed in {} seconds&quot;.format(time_taken))
25{
26        'placement': {
27            'cluster_name': dataproc_cluster
28        },
29        'pyspark_job': {
30            'main_python_file_uri': &quot;gs://&quot; + compute_storage + &quot;/&quot; + job_file,
31            'args': trigger_params,
32            &quot;properties&quot;: {
33                &quot;spark.submit.deployMode&quot;: &quot;cluster&quot;,
34                &quot;spark.executor.memory&quot;: &quot;3155m&quot;,
35                &quot;spark.scheduler.mode&quot;: &quot;FAIR&quot;,
36            }
37        }
38    }
3921/12/07 19:11:27 INFO org.sparkproject.jetty.util.log: Logging initialized @3350ms to org.sparkproject.jetty.util.log.Slf4jLog
4021/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_292-b10
4121/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: Started @3467ms
4221/12/07 19:11:27 INFO org.sparkproject.jetty.server.AbstractConnector: Started ServerConnector@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:40389}
4321/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
4421/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:10200
4521/12/07 19:11:29 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
4621/12/07 19:11:29 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
4721/12/07 19:11:30 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0014
4821/12/07 19:11:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8030
4921/12/07 19:11:33 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
50
51
52=====
53Executing at 2021-12-07 19:11:35.100277
54[....... ('spark.yarn.historyServer.address', '****-m:18080'), ('spark.ui.proxyBase', '/proxy/application_1638554180947_0014'), ('spark.driver.appUIAddress', 'http://***-m.c.***-123456.internal:40389'), ('spark.sql.cbo.enabled', 'true')]
55
56
57======================
58
59
60
61Sleep time is 1 seconds
62Script execution completed in 9.411261 seconds
6321/12/07 19:11:36 INFO org.sparkproject.jetty.server.AbstractConnector: Stopped Spark@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
64
6521/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
6621/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:8032
6721/12/07 19:09:05 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
6821/12/07 19:09:05 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
6921/12/07 19:09:06 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0013
70

ANSWER

Answered 2021-Dec-15 at 17:30

When running jobs in cluster mode, the driver logs are in the Cloud Logging yarn-userlogs. See the doc:

By default, Dataproc runs Spark jobs in client mode, and streams the driver output for viewing as explained, below. However, if the user creates the Dataproc cluster by setting cluster properties to --properties spark:spark.submit.deployMode=cluster or submits the job in cluster mode by setting job properties to --properties spark.submit.deployMode=cluster, driver output is listed in YARN userlogs, which can be accessed in Logging.

Source https://stackoverflow.com/questions/70266214

Community Discussions contain sources that include Stack Exchange Network

Tutorials and Learning Resources in Hadoop

Tutorials and Learning Resources are not available at this moment for Hadoop

Share this Page

share link

Get latest updates on Hadoop