Popular New Releases in Hadoop
xgboost
Release candidate of version 1.6.0
luigi
3.0.3
alluxio
Alluxio v2.7.4
hazelcast
v4.1.9
hbase
Apache HBase 2.4.11 is now available for download
Popular Libraries in Hadoop
by apache scala
32507 Apache-2.0
Apache Spark - A unified analytics engine for large-scale data processing
by dmlc c++
22464 Apache-2.0
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
by apache java
21667 Apache-2.0
Mirror of Apache Kafka
by donnemartin python
21519 NOASSERTION
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
by apache java
18609 Apache-2.0
Apache Flink
by spotify python
14716 Apache-2.0
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
by prestodb java
13394 Apache-2.0
The official home of the Presto distributed SQL query engine for big data
by apache scala
12509 Apache-2.0
PredictionIO, a machine learning server for developers and ML engineers.
by apache java
12457 Apache-2.0
Apache Hadoop
Trending New libraries in Hadoop
by linkedin html
4921 NOASSERTION
At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.
by geekyouth scala
1137 GPL-3.0
深圳地铁大数据客流分析系统🚇🚄🌟
by open-metadata typescript
901 Apache-2.0
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
by confluentinc shell
459
docker-compose.yml files for cp-all-in-one , cp-all-in-one-community, cp-all-in-one-cloud
by TurboWay python
447 MIT
基于 scrapy-redis 的通用分布式爬虫框架
by KOBA789 rust
389 MIT
RDBMS のしくみを学ぶための小さな RDBMS 実装
by microsoft scala
336 Apache-2.0
An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
by WeBankFinTech go
317 Apache-2.0
Prophecis is a one-stop cloud native machine learning platform.
by housepower go
277 Apache-2.0
This is a tool which used to manage and monitor ClickHouse database
Top Authors in Hadoop
1
64 Libraries
148879
2
15 Libraries
131
3
13 Libraries
785
4
10 Libraries
8076
5
9 Libraries
1005
6
8 Libraries
15248
7
8 Libraries
13699
8
7 Libraries
421
9
7 Libraries
117
10
7 Libraries
114
1
64 Libraries
148879
2
15 Libraries
131
3
13 Libraries
785
4
10 Libraries
8076
5
9 Libraries
1005
6
8 Libraries
15248
7
8 Libraries
13699
8
7 Libraries
421
9
7 Libraries
117
10
7 Libraries
114
Trending Kits in Hadoop
No Trending Kits are available at this moment for Hadoop
Trending Discussions on Hadoop
spark-shell throws java.lang.reflect.InvocationTargetException on running
spark-shell exception org.apache.spark.SparkException: Exception thrown in awaitResult
determine written object paths with Pyspark 3.2.1 + hadoop 3.3.2
How to read a csv file from s3 bucket using pyspark
Hadoop to SQL through SSIS Package : Data incorrect format
How to run spark 3.2.0 on google dataproc?
AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>
Cannot find conda info. Please verify your conda installation on EMR
PySpark runs in YARN client mode but fails in cluster mode for "User did not initialize spark context!"
Where to find spark log in dataproc when running job on cluster mode
QUESTION
spark-shell throws java.lang.reflect.InvocationTargetException on running
Asked 2022-Apr-01 at 19:53When I execute run-example SparkPi
, for example, it works perfectly, but
when I run spark-shell
, it throws these exceptions:
1WARNING: An illegal reflective access operation has occurred
2WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/big_data/spark-3.2.0-bin-hadoop3.2-scala2.13/jars/spark-unsafe_2.13-3.2.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
3WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
4WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
5WARNING: All illegal access operations will be denied in a future release
6Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
7Setting default log level to "WARN".
8To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
9Welcome to
10 ____ __
11 / __/__ ___ _____/ /__
12 _\ \/ _ \/ _ `/ __/ '_/
13 /___/ .__/\_,_/_/ /_/\_\ version 3.2.0
14 /_/
15
16Using Scala version 2.13.5 (OpenJDK 64-Bit Server VM, Java 11.0.9.1)
17Type in expressions to have them evaluated.
18Type :help for more information.
1921/12/11 19:28:36 ERROR SparkContext: Error initializing SparkContext.
20java.lang.reflect.InvocationTargetException
21 at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
22 at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
23 at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
24 at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
25 at org.apache.spark.executor.Executor.addReplClassLoaderIfNeeded(Executor.scala:909)
26 at org.apache.spark.executor.Executor.<init>(Executor.scala:160)
27 at org.apache.spark.scheduler.local.LocalEndpoint.<init>(LocalSchedulerBackend.scala:64)
28 at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:132)
29 at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
30 at org.apache.spark.SparkContext.<init>(SparkContext.scala:581)
31 at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
32 at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
33 at scala.Option.getOrElse(Option.scala:201)
34 at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
35 at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
36 at $line3.$read$$iw.<init>(<console>:5)
37 at $line3.$read.<init>(<console>:4)
38 at $line3.$read$.<clinit>(<console>)
39 at $line3.$eval$.$print$lzycompute(<synthetic>:6)
40 at $line3.$eval$.$print(<synthetic>:5)
41 at $line3.$eval.$print(<synthetic>)
42 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
43 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
44 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
45 at java.base/java.lang.reflect.Method.invoke(Method.java:566)
46 at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
47 at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
48 at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
49 at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
50 at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
51 at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
52 at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
53 at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
54 at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
55 at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
56 at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
57 at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
58 at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
59 at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
60 at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
61 at scala.collection.immutable.List.foreach(List.scala:333)
62 at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
63 at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
64 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
65 at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
66 at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
67 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
68 at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
69 at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
70 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
71 at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
72 at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
73 at org.apache.spark.repl.Main$.doMain(Main.scala:84)
74 at org.apache.spark.repl.Main$.main(Main.scala:59)
75 at org.apache.spark.repl.Main.main(Main.scala)
76 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
77 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
78 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
79 at java.base/java.lang.reflect.Method.invoke(Method.java:566)
80 at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
81 at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
82 at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
83 at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
84 at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
85 at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
86 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
87 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
88Caused by: java.net.URISyntaxException: Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes
89 at java.base/java.net.URI$Parser.fail(URI.java:2913)
90 at java.base/java.net.URI$Parser.checkChars(URI.java:3084)
91 at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3166)
92 at java.base/java.net.URI$Parser.parse(URI.java:3114)
93 at java.base/java.net.URI.<init>(URI.java:600)
94 at org.apache.spark.repl.ExecutorClassLoader.<init>(ExecutorClassLoader.scala:57)
95 ... 67 more
9621/12/11 19:28:36 ERROR Utils: Uncaught exception in thread main
97java.lang.NullPointerException
98 at org.apache.spark.scheduler.local.LocalSchedulerBackend.org$apache$spark$scheduler$local$LocalSchedulerBackend$$stop(LocalSchedulerBackend.scala:173)
99 at org.apache.spark.scheduler.local.LocalSchedulerBackend.stop(LocalSchedulerBackend.scala:144)
100 at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:927)
101 at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2516)
102 at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2086)
103 at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1442)
104 at org.apache.spark.SparkContext.stop(SparkContext.scala:2086)
105 at org.apache.spark.SparkContext.<init>(SparkContext.scala:677)
106 at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
107 at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
108 at scala.Option.getOrElse(Option.scala:201)
109 at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
110 at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
111 at $line3.$read$$iw.<init>(<console>:5)
112 at $line3.$read.<init>(<console>:4)
113 at $line3.$read$.<clinit>(<console>)
114 at $line3.$eval$.$print$lzycompute(<synthetic>:6)
115 at $line3.$eval$.$print(<synthetic>:5)
116 at $line3.$eval.$print(<synthetic>)
117 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
118 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
119 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
120 at java.base/java.lang.reflect.Method.invoke(Method.java:566)
121 at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
122 at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
123 at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
124 at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
125 at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
126 at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
127 at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
128 at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
129 at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
130 at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
131 at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
132 at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
133 at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
134 at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
135 at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
136 at scala.collection.immutable.List.foreach(List.scala:333)
137 at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
138 at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
139 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
140 at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
141 at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
142 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
143 at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
144 at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
145 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
146 at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
147 at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
148 at org.apache.spark.repl.Main$.doMain(Main.scala:84)
149 at org.apache.spark.repl.Main$.main(Main.scala:59)
150 at org.apache.spark.repl.Main.main(Main.scala)
151 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
152 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
153 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
154 at java.base/java.lang.reflect.Method.invoke(Method.java:566)
155 at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
156 at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
157 at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
158 at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
159 at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
160 at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
161 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
162 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16321/12/11 19:28:36 WARN MetricsSystem: Stopping a MetricsSystem that is not running
16421/12/11 19:28:36 ERROR Main: Failed to initialize Spark session.
165java.lang.reflect.InvocationTargetException
166 at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
167 at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
168 at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
169 at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
170 at org.apache.spark.executor.Executor.addReplClassLoaderIfNeeded(Executor.scala:909)
171 at org.apache.spark.executor.Executor.<init>(Executor.scala:160)
172 at org.apache.spark.scheduler.local.LocalEndpoint.<init>(LocalSchedulerBackend.scala:64)
173 at org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:132)
174 at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
175 at org.apache.spark.SparkContext.<init>(SparkContext.scala:581)
176 at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
177 at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
178 at scala.Option.getOrElse(Option.scala:201)
179 at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
180 at org.apache.spark.repl.Main$.createSparkSession(Main.scala:114)
181 at $line3.$read$$iw.<init>(<console>:5)
182 at $line3.$read.<init>(<console>:4)
183 at $line3.$read$.<clinit>(<console>)
184 at $line3.$eval$.$print$lzycompute(<synthetic>:6)
185 at $line3.$eval$.$print(<synthetic>:5)
186 at $line3.$eval.$print(<synthetic>)
187 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
188 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
189 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
190 at java.base/java.lang.reflect.Method.invoke(Method.java:566)
191 at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:670)
192 at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1006)
193 at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$1(IMain.scala:506)
194 at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36)
195 at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116)
196 at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:43)
197 at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:505)
198 at scala.tools.nsc.interpreter.IMain.$anonfun$doInterpret$3(IMain.scala:519)
199 at scala.tools.nsc.interpreter.IMain.doInterpret(IMain.scala:519)
200 at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:503)
201 at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:501)
202 at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216)
203 at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
204 at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216)
205 at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$interpretPreamble$1(ILoop.scala:924)
206 at scala.collection.immutable.List.foreach(List.scala:333)
207 at scala.tools.nsc.interpreter.shell.ILoop.interpretPreamble(ILoop.scala:924)
208 at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$3(ILoop.scala:963)
209 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
210 at scala.tools.nsc.interpreter.shell.ILoop.echoOff(ILoop.scala:90)
211 at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$2(ILoop.scala:963)
212 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
213 at scala.tools.nsc.interpreter.IMain.withSuppressedSettings(IMain.scala:1406)
214 at scala.tools.nsc.interpreter.shell.ILoop.$anonfun$run$1(ILoop.scala:954)
215 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
216 at scala.tools.nsc.interpreter.shell.ReplReporterImpl.withoutPrintingResults(Reporter.scala:64)
217 at scala.tools.nsc.interpreter.shell.ILoop.run(ILoop.scala:954)
218 at org.apache.spark.repl.Main$.doMain(Main.scala:84)
219 at org.apache.spark.repl.Main$.main(Main.scala:59)
220 at org.apache.spark.repl.Main.main(Main.scala)
221 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
222 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
223 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
224 at java.base/java.lang.reflect.Method.invoke(Method.java:566)
225 at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
226 at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
227 at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
228 at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
229 at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
230 at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
231 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
232 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
233Caused by: java.net.URISyntaxException: Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes
234 at java.base/java.net.URI$Parser.fail(URI.java:2913)
235 at java.base/java.net.URI$Parser.checkChars(URI.java:3084)
236 at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3166)
237 at java.base/java.net.URI$Parser.parse(URI.java:3114)
238 at java.base/java.net.URI.<init>(URI.java:600)
239 at org.apache.spark.repl.ExecutorClassLoader.<init>(ExecutorClassLoader.scala:57)
240 ... 67 more
24121/12/11 19:28:36 ERROR Utils: Uncaught exception in thread shutdown-hook-0
242java.lang.ExceptionInInitializerError
243 at org.apache.spark.executor.Executor.stop(Executor.scala:333)
244 at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
245 at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
246 at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
247 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
248 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
249 at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
250 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
251 at scala.util.Try$.apply(Try.scala:210)
252 at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
253 at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
254 at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
255 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
256 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
257 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
258 at java.base/java.lang.Thread.run(Thread.java:829)
259Caused by: java.lang.NullPointerException
260 at org.apache.spark.shuffle.ShuffleBlockPusher$.<clinit>(ShuffleBlockPusher.scala:465)
261 ... 16 more
26221/12/11 19:28:36 WARN ShutdownHookManager: ShutdownHook '' failed, java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError
263java.util.concurrent.ExecutionException: java.lang.ExceptionInInitializerError
264 at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
265 at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
266 at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
267 at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
268Caused by: java.lang.ExceptionInInitializerError
269 at org.apache.spark.executor.Executor.stop(Executor.scala:333)
270 at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
271 at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
272 at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
273 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
274 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
275 at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
276 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
277 at scala.util.Try$.apply(Try.scala:210)
278 at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
279 at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
280 at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
281 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
282 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
283 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
284 at java.base/java.lang.Thread.run(Thread.java:829)
285Caused by: java.lang.NullPointerException
286 at org.apache.spark.shuffle.ShuffleBlockPusher$.<clinit>(ShuffleBlockPusher.scala:465)
287 ... 16 more
288
As I can see it caused by Illegal character in path at index 42: spark://DESKTOP-JO73CF4.mshome.net:2103/C:\classes
, but I don't understand what does it mean exactly and how to deal with that
How can I solve this problem?
I use Spark 3.2.0 Pre-built for Apache Hadoop 3.3 and later (Scala 2.13)
JAVA_HOME, HADOOP_HOME, SPARK_HOME path variables are set.
ANSWER
Answered 2022-Jan-07 at 15:11i face the same problem, i think Spark 3.2 is the problem itself
switched to Spark 3.1.2, it works fine
QUESTION
spark-shell exception org.apache.spark.SparkException: Exception thrown in awaitResult
Asked 2022-Mar-23 at 09:29Facing below error while starting spark-shell with yarn master. Shell is working with spark local master.
1admin@XXXXXX:~$ spark-shell --master yarn 21/11/03 15:51:51 WARN Utils: Your hostname, XXXXXX resolves to a loopback address:
2127.0.1.1; using 192.168.29.57 instead (on interface wifi0) 21/11/03 15:51:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/11/03 15:52:01 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Spark context Web UI available at http://XX.XX.XX.XX:4040 Spark context available as 'sc' (master = yarn, app id = application_1635934709971_0001). Spark session available as 'spark'. Welcome to
3 ____ __
4 / __/__ ___ _____/ /__
5 _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.5
6 /_/
7
8Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java
91.8.0_301) Type in expressions to have them evaluated. Type :help for more information.
10
11scala>
12
13scala> 21/11/03 15:52:35 ERROR YarnClientSchedulerBackend: YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details. 21/11/03 15:52:35 ERROR YarnClientSchedulerBackend: Diagnostics message: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult:
14 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
15 at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
16 at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
17 at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
18 at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:515)
19 at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:307)
20 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
21 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
22 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
23 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:780)
24 at java.security.AccessController.doPrivileged(Native Method)
25 at javax.security.auth.Subject.doAs(Subject.java:422)
26 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
27 at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:779)
28 at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
29 at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:804)
30 at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:834)
31 at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) Caused by: java.io.IOException: Failed to connect to /192.168.29.57:33333
32 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
33 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
34 at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
35 at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
36 at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
37 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
38 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
39 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
40 at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.29.57:33333 Caused by: java.net.ConnectException: Connection refused
41 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
42 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)
43 at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
44 at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
45 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:688)
46 at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
47 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
48 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
49 at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
50 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
51 at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
52 at java.lang.Thread.run(Thread.java:748)
53
5421/11/03 15:52:35 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
55
Below is spark-defaults.conf
1admin@XXXXXX:~$ spark-shell --master yarn 21/11/03 15:51:51 WARN Utils: Your hostname, XXXXXX resolves to a loopback address:
2127.0.1.1; using 192.168.29.57 instead (on interface wifi0) 21/11/03 15:51:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/11/03 15:52:01 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Spark context Web UI available at http://XX.XX.XX.XX:4040 Spark context available as 'sc' (master = yarn, app id = application_1635934709971_0001). Spark session available as 'spark'. Welcome to
3 ____ __
4 / __/__ ___ _____/ /__
5 _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.5
6 /_/
7
8Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java
91.8.0_301) Type in expressions to have them evaluated. Type :help for more information.
10
11scala>
12
13scala> 21/11/03 15:52:35 ERROR YarnClientSchedulerBackend: YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details. 21/11/03 15:52:35 ERROR YarnClientSchedulerBackend: Diagnostics message: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult:
14 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
15 at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
16 at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
17 at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
18 at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:515)
19 at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:307)
20 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
21 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
22 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
23 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:780)
24 at java.security.AccessController.doPrivileged(Native Method)
25 at javax.security.auth.Subject.doAs(Subject.java:422)
26 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
27 at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:779)
28 at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
29 at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:804)
30 at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:834)
31 at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) Caused by: java.io.IOException: Failed to connect to /192.168.29.57:33333
32 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
33 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
34 at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
35 at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
36 at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
37 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
38 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
39 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
40 at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.29.57:33333 Caused by: java.net.ConnectException: Connection refused
41 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
42 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)
43 at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
44 at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
45 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:688)
46 at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
47 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
48 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
49 at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
50 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
51 at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
52 at java.lang.Thread.run(Thread.java:748)
53
5421/11/03 15:52:35 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
55spark.driver.memory 512m
56spark.yarn.am.memory 512m
57spark.executor.memory 512m
58spark.eventLog.enabled true
59spark.eventLog.dir file:////home/admin/spark_event_temp
60spark.history.fs.logDirectory hdfs://localhost:9000/spark-logs
61spark.history.fs.update.interval 10s
62spark.history.ui.port 18080
63spark.sql.warehouse.dir=file:////home/admin/spark_warehouse
64spark.shuffle.service.port 7337
65spark.ui.port 4040
66spark.blockManager.port 31111
67spark.driver.blockManager.port 32222
68spark.driver.port 33333
69
spark version:- spark-2.4.5-bin-hadoop2.7
hadoop version:- hadoop-2.8.5
I can provide more information if needed. I have configured everything in the local machine.
ANSWER
Answered 2022-Mar-23 at 09:29Adding these properties in spark-env.sh fixed the issue for me.
1admin@XXXXXX:~$ spark-shell --master yarn 21/11/03 15:51:51 WARN Utils: Your hostname, XXXXXX resolves to a loopback address:
2127.0.1.1; using 192.168.29.57 instead (on interface wifi0) 21/11/03 15:51:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/11/03 15:52:01 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Spark context Web UI available at http://XX.XX.XX.XX:4040 Spark context available as 'sc' (master = yarn, app id = application_1635934709971_0001). Spark session available as 'spark'. Welcome to
3 ____ __
4 / __/__ ___ _____/ /__
5 _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.5
6 /_/
7
8Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java
91.8.0_301) Type in expressions to have them evaluated. Type :help for more information.
10
11scala>
12
13scala> 21/11/03 15:52:35 ERROR YarnClientSchedulerBackend: YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details. 21/11/03 15:52:35 ERROR YarnClientSchedulerBackend: Diagnostics message: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult:
14 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
15 at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
16 at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
17 at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
18 at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:515)
19 at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:307)
20 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
21 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
22 at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
23 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:780)
24 at java.security.AccessController.doPrivileged(Native Method)
25 at javax.security.auth.Subject.doAs(Subject.java:422)
26 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
27 at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:779)
28 at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
29 at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:804)
30 at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:834)
31 at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) Caused by: java.io.IOException: Failed to connect to /192.168.29.57:33333
32 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
33 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
34 at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
35 at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
36 at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
37 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
38 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
39 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
40 at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.29.57:33333 Caused by: java.net.ConnectException: Connection refused
41 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
42 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)
43 at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
44 at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
45 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:688)
46 at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
47 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
48 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
49 at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
50 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
51 at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
52 at java.lang.Thread.run(Thread.java:748)
53
5421/11/03 15:52:35 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
55spark.driver.memory 512m
56spark.yarn.am.memory 512m
57spark.executor.memory 512m
58spark.eventLog.enabled true
59spark.eventLog.dir file:////home/admin/spark_event_temp
60spark.history.fs.logDirectory hdfs://localhost:9000/spark-logs
61spark.history.fs.update.interval 10s
62spark.history.ui.port 18080
63spark.sql.warehouse.dir=file:////home/admin/spark_warehouse
64spark.shuffle.service.port 7337
65spark.ui.port 4040
66spark.blockManager.port 31111
67spark.driver.blockManager.port 32222
68spark.driver.port 33333
69export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
70export HADOOP_HOME=/mnt/d/soft/hadoop-2.8.5
71export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
72export SPARK_HOME=$SPARK_HOME
73export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop/
74export SPARK_MASTER_HOST=127.0.0.1
75export SPARK_LOCAL_IP=127.0.0.1
76
QUESTION
determine written object paths with Pyspark 3.2.1 + hadoop 3.3.2
Asked 2022-Mar-21 at 11:50When writing dataframes to S3 using the s3a connector, there seems to be no official way of determining the object paths on s3 that were written in the process. What I am trying to achieve is simply determining what objects have been written when writing to s3 (using pyspark 3.2.1 with hadoop 3.3.2 and the directory committer).
The reason this might be useful:
- partitionBy might add a dynamic amount of new paths
- spark creates it's own "part..." parquet files with cryptic names and number depending on the partitions when writing
With pyspark 3.1.2 and Hadoop 3.2.0 it used to be possible to use the not officially supported "_SUCCESS" file which was written at the path before the first partitioning on S3, which contained all the paths of all written files. Now however, the number of paths seems to be limited to 100 and this is not a option anymore.
Is there really no official, reasonable way of achieving this task?
ANSWER
Answered 2022-Mar-21 at 11:50Now however, the number of paths seems to be limited to 100 and this is not a option anymore.
we had to cut that in HADOOP-16570...one of the scale problems which surfaced during terasorting at 10-100 TB. the time to write the _SUCCESS file started to slow down job commit times. it was only ever intended for testing. sorry.
it is just a constant in the source tree. if you were to provided a patch to make it configurable, I'll be happy to review and merge, provided you follow the "say which aws endpoint you ran all the tests or we ignore your patch" policy.
I don't know where else this stuff is collected. the spark driver is told of the number of files and their total size from each task commit, but isn't given the list by tasks, not AFAIK.
spark creates it's own "part..." parquet files with cryptic names and number depending on the partitions when writing
the part-0001- bit of the filename comes from the task id; the bit afterwards is a uuid created to ensure every filename is unique -see SPARK-8406 Adding UUID to output file name to avoid accidental overwriting. you can probably turn that off
QUESTION
How to read a csv file from s3 bucket using pyspark
Asked 2022-Mar-16 at 22:53I'm using Apache Spark 3.1.0 with Python 3.9.6. I'm trying to read csv file from AWS S3 bucket something like this:
1spark = SparkSession.builder.getOrCreate()
2file = "s3://bucket/file.csv"
3
4c = spark.read\
5 .csv(file)\
6 .count()
7
8print(c)
9
But I'm getting the following error:
1spark = SparkSession.builder.getOrCreate()
2file = "s3://bucket/file.csv"
3
4c = spark.read\
5 .csv(file)\
6 .count()
7
8print(c)
9py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
10: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
11
I understand that I need add special libraries, but I didn't find any certain information which exactly and which versions. I've tried to add something like this to my code, but I'm still getting same error:
1spark = SparkSession.builder.getOrCreate()
2file = "s3://bucket/file.csv"
3
4c = spark.read\
5 .csv(file)\
6 .count()
7
8print(c)
9py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
10: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
11import os
12os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
13
How can I fix this?
ANSWER
Answered 2021-Aug-25 at 11:11You need to use hadoop-aws
version 3.2.0 for spark 3. In --packages
specifying hadoop-aws
library is enough to read files from S3.
1spark = SparkSession.builder.getOrCreate()
2file = "s3://bucket/file.csv"
3
4c = spark.read\
5 .csv(file)\
6 .count()
7
8print(c)
9py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
10: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
11import os
12os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
13--packages org.apache.hadoop:hadoop-aws:3.2.0
14
You need to set below configurations.
1spark = SparkSession.builder.getOrCreate()
2file = "s3://bucket/file.csv"
3
4c = spark.read\
5 .csv(file)\
6 .count()
7
8print(c)
9py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
10: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
11import os
12os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
13--packages org.apache.hadoop:hadoop-aws:3.2.0
14spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<access_key>")
15spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<secret_key>")
16
After that you can read CSV file.
1spark = SparkSession.builder.getOrCreate()
2file = "s3://bucket/file.csv"
3
4c = spark.read\
5 .csv(file)\
6 .count()
7
8print(c)
9py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
10: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
11import os
12os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
13--packages org.apache.hadoop:hadoop-aws:3.2.0
14spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<access_key>")
15spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<secret_key>")
16spark.read.csv("s3a://bucket/file.csv")
17
QUESTION
Hadoop to SQL through SSIS Package : Data incorrect format
Asked 2022-Mar-13 at 20:05I am using an ODBC source connected to the Hadoop system and read the column PONum
with value 4400023488
of datatype Text_Stream DT_Text]
. Data is converted into string [DT_WSTR]
using a data conversion transformation and then inserted into SQL Server using an OLE DB Destination. (destination column's type is a Unicode string DT_WSTR
)
I am able to insert Value to SQL Server table but with an incorrect format 㐴〰㌵㠵㔹
expected value is 4400023488
.
Any suggestions?
ANSWER
Answered 2022-Mar-13 at 20:04I have two suggestions:
- Instead of using a data conversion transformation, use a derived column that convert the
DT_TEXT
value toDT_STR
before converting it to unicode:
1(DT_WSTR, 4000)(DT_STR, 4000, 1252)[ColumnName]
2
Make sure that you replace 1252
with the appropriate encoding.
Also, you can use a script component: SSIS : Conversion text stream DT_TEXT to DT_WSTR
- Use the Hadoop SSIS connection manager and HDFS source instead of using ODBC:
QUESTION
How to run spark 3.2.0 on google dataproc?
Asked 2022-Mar-10 at 11:46Currently, google dataproc does not have spark 3.2.0 as an image. The latest available is 3.1.2. I want to use the pandas on pyspark functionality that spark has released with 3.2.0.
I am doing the following steps to use spark 3.2.0
- Created an environment 'pyspark' locally with pyspark 3.2.0 in it
- Exported the environment yaml with
conda env export > environment.yaml
- Created a dataproc cluster with this environment.yaml. The cluster gets created correctly and the environment is available on master and all the workers
- I then change environment variables.
export SPARK_HOME=/opt/conda/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark
(to point to pyspark 3.2.0),export SPARK_CONF_DIR=/usr/lib/spark/conf
(to use dataproc's config file) and,export PYSPARK_PYTHON=/opt/conda/miniconda3/envs/pyspark/bin/python
(to make the environment packages available)
Now if I try to run the pyspark shell I get:
121/12/07 01:25:16 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener AppStatusListener threw an exception
2java.lang.NumberFormatException: For input string: "null"
3 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
4 at java.lang.Integer.parseInt(Integer.java:580)
5 at java.lang.Integer.parseInt(Integer.java:615)
6 at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
7 at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
8 at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
9 at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1126)
10 at org.apache.spark.status.ProcessSummaryWrapper.<init>(storeTypes.scala:527)
11 at org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:924)
12 at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)
13 at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1213)
14 at org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1427)
15 at org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113)
16 at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
17 at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
18 at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
19 at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
20 at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
21 at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
22 at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
23 at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
24 at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
25 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
26 at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
27 at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
28 at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1404)
29 at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
30
However, the shell does start even after this. But, it does not execute code. Throws exceptions:
I tried running:
set(sc.parallelize(range(10),10).map(lambda x: socket.gethostname()).collect())
but, I am getting:
121/12/07 01:25:16 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener AppStatusListener threw an exception
2java.lang.NumberFormatException: For input string: "null"
3 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
4 at java.lang.Integer.parseInt(Integer.java:580)
5 at java.lang.Integer.parseInt(Integer.java:615)
6 at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
7 at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
8 at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
9 at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1126)
10 at org.apache.spark.status.ProcessSummaryWrapper.<init>(storeTypes.scala:527)
11 at org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:924)
12 at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)
13 at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1213)
14 at org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1427)
15 at org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113)
16 at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
17 at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
18 at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
19 at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
20 at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
21 at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
22 at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
23 at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
24 at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
25 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
26 at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
27 at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
28 at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1404)
29 at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
3021/12/07 01:32:15 WARN org.apache.spark.deploy.yarn.YarnAllocator: Container from a bad node: container_1638782400702_0003_01_000001 on host: monsoon-test1-w-2.us-central1-c.c.monsoon-credittech.internal. Exit status: 1. Diagnostics: [2021-12-07
3101:32:13.672]Exception from container-launch.
32Container id: container_1638782400702_0003_01_000001
33Exit code: 1
34[2021-12-07 01:32:13.717]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
35Last 4096 bytes of prelaunch.err :
36Last 4096 bytes of stderr :
37ltChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
38 at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
39 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
40 at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
41 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
42 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
43 at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
44 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
45 at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
46 at java.lang.Thread.run(Thread.java:748)
4721/12/07 01:31:43 ERROR org.apache.spark.executor.YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Driver monsoon-test1-m.us-central1-c.c.monsoon-credittech.internal:44367 disassociated! Shutting down.
4821/12/07 01:32:13 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException
49java.util.concurrent.TimeoutException
50 at java.util.concurrent.FutureTask.get(FutureTask.java:205)
51 at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
52 at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
5321/12/07 01:32:13 ERROR org.apache.spark.util.Utils: Uncaught exception in thread shutdown-hook-0
54java.lang.InterruptedException
55 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
56 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
57 at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
58 at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(Executors.java:675)
59 at org.apache.spark.rpc.netty.MessageLoop.stop(MessageLoop.scala:60)
60 at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1(Dispatcher.scala:197)
61 at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1$adapted(Dispatcher.scala:194)
62 at scala.collection.Iterator.foreach(Iterator.scala:943)
63 at scala.collection.Iterator.foreach$(Iterator.scala:943)
64 at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
65 at scala.collection.IterableLike.foreach(IterableLike.scala:74)
66 at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
67 at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
68 at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:194)
69 at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:331)
70 at org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:309)
71 at org.apache.spark.SparkEnv.stop(SparkEnv.scala:96)
72 at org.apache.spark.executor.Executor.stop(Executor.scala:335)
73 at org.apache.spark.executor.Executor.$anonfun$new$2(Executor.scala:76)
74 at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
75 at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
76 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
77 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
78 at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
79 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
80 at scala.util.Try$.apply(Try.scala:213)
81 at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
82 at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
83 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
84 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
85 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
86 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
87 at java.lang.Thread.run(Thread.java:748)
88
and the same error repeats multiple times before coming to a stop.
What am I doing wrong and How can I use python 3.2.0 on google dataproc?
ANSWER
Answered 2022-Jan-15 at 07:17One can achieve this by:
- Create a dataproc cluster with an environment (
your_sample_env
) that contains pyspark 3.2 as a package - Modify
/usr/lib/spark/conf/spark-env.sh
by adding
121/12/07 01:25:16 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener AppStatusListener threw an exception
2java.lang.NumberFormatException: For input string: "null"
3 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
4 at java.lang.Integer.parseInt(Integer.java:580)
5 at java.lang.Integer.parseInt(Integer.java:615)
6 at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
7 at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
8 at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
9 at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1126)
10 at org.apache.spark.status.ProcessSummaryWrapper.<init>(storeTypes.scala:527)
11 at org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:924)
12 at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)
13 at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1213)
14 at org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1427)
15 at org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113)
16 at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
17 at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
18 at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
19 at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
20 at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
21 at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
22 at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
23 at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
24 at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
25 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
26 at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
27 at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
28 at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1404)
29 at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
3021/12/07 01:32:15 WARN org.apache.spark.deploy.yarn.YarnAllocator: Container from a bad node: container_1638782400702_0003_01_000001 on host: monsoon-test1-w-2.us-central1-c.c.monsoon-credittech.internal. Exit status: 1. Diagnostics: [2021-12-07
3101:32:13.672]Exception from container-launch.
32Container id: container_1638782400702_0003_01_000001
33Exit code: 1
34[2021-12-07 01:32:13.717]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
35Last 4096 bytes of prelaunch.err :
36Last 4096 bytes of stderr :
37ltChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
38 at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
39 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
40 at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
41 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
42 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
43 at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
44 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
45 at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
46 at java.lang.Thread.run(Thread.java:748)
4721/12/07 01:31:43 ERROR org.apache.spark.executor.YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Driver monsoon-test1-m.us-central1-c.c.monsoon-credittech.internal:44367 disassociated! Shutting down.
4821/12/07 01:32:13 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException
49java.util.concurrent.TimeoutException
50 at java.util.concurrent.FutureTask.get(FutureTask.java:205)
51 at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
52 at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
5321/12/07 01:32:13 ERROR org.apache.spark.util.Utils: Uncaught exception in thread shutdown-hook-0
54java.lang.InterruptedException
55 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
56 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
57 at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
58 at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(Executors.java:675)
59 at org.apache.spark.rpc.netty.MessageLoop.stop(MessageLoop.scala:60)
60 at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1(Dispatcher.scala:197)
61 at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1$adapted(Dispatcher.scala:194)
62 at scala.collection.Iterator.foreach(Iterator.scala:943)
63 at scala.collection.Iterator.foreach$(Iterator.scala:943)
64 at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
65 at scala.collection.IterableLike.foreach(IterableLike.scala:74)
66 at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
67 at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
68 at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:194)
69 at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:331)
70 at org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:309)
71 at org.apache.spark.SparkEnv.stop(SparkEnv.scala:96)
72 at org.apache.spark.executor.Executor.stop(Executor.scala:335)
73 at org.apache.spark.executor.Executor.$anonfun$new$2(Executor.scala:76)
74 at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
75 at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
76 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
77 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
78 at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
79 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
80 at scala.util.Try$.apply(Try.scala:213)
81 at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
82 at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
83 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
84 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
85 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
86 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
87 at java.lang.Thread.run(Thread.java:748)
88SPARK_HOME="/opt/conda/miniconda3/envs/your_sample_env/lib/python/site-packages/pyspark"
89SPARK_CONF="/usr/lib/spark/conf"
90
at its end
- Modify
/usr/lib/spark/conf/spark-defaults.conf
by commenting out the following configurations
121/12/07 01:25:16 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener AppStatusListener threw an exception
2java.lang.NumberFormatException: For input string: "null"
3 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
4 at java.lang.Integer.parseInt(Integer.java:580)
5 at java.lang.Integer.parseInt(Integer.java:615)
6 at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
7 at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
8 at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
9 at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1126)
10 at org.apache.spark.status.ProcessSummaryWrapper.<init>(storeTypes.scala:527)
11 at org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:924)
12 at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)
13 at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1213)
14 at org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1427)
15 at org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113)
16 at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
17 at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
18 at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
19 at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
20 at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)
21 at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)
22 at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
23 at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
24 at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
25 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
26 at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
27 at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
28 at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1404)
29 at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
3021/12/07 01:32:15 WARN org.apache.spark.deploy.yarn.YarnAllocator: Container from a bad node: container_1638782400702_0003_01_000001 on host: monsoon-test1-w-2.us-central1-c.c.monsoon-credittech.internal. Exit status: 1. Diagnostics: [2021-12-07
3101:32:13.672]Exception from container-launch.
32Container id: container_1638782400702_0003_01_000001
33Exit code: 1
34[2021-12-07 01:32:13.717]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
35Last 4096 bytes of prelaunch.err :
36Last 4096 bytes of stderr :
37ltChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
38 at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
39 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
40 at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
41 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
42 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
43 at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
44 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
45 at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
46 at java.lang.Thread.run(Thread.java:748)
4721/12/07 01:31:43 ERROR org.apache.spark.executor.YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Driver monsoon-test1-m.us-central1-c.c.monsoon-credittech.internal:44367 disassociated! Shutting down.
4821/12/07 01:32:13 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException
49java.util.concurrent.TimeoutException
50 at java.util.concurrent.FutureTask.get(FutureTask.java:205)
51 at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
52 at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
5321/12/07 01:32:13 ERROR org.apache.spark.util.Utils: Uncaught exception in thread shutdown-hook-0
54java.lang.InterruptedException
55 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
56 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
57 at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
58 at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(Executors.java:675)
59 at org.apache.spark.rpc.netty.MessageLoop.stop(MessageLoop.scala:60)
60 at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1(Dispatcher.scala:197)
61 at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1$adapted(Dispatcher.scala:194)
62 at scala.collection.Iterator.foreach(Iterator.scala:943)
63 at scala.collection.Iterator.foreach$(Iterator.scala:943)
64 at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
65 at scala.collection.IterableLike.foreach(IterableLike.scala:74)
66 at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
67 at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
68 at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:194)
69 at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:331)
70 at org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:309)
71 at org.apache.spark.SparkEnv.stop(SparkEnv.scala:96)
72 at org.apache.spark.executor.Executor.stop(Executor.scala:335)
73 at org.apache.spark.executor.Executor.$anonfun$new$2(Executor.scala:76)
74 at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
75 at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
76 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
77 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
78 at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
79 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
80 at scala.util.Try$.apply(Try.scala:213)
81 at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
82 at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
83 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
84 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
85 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
86 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
87 at java.lang.Thread.run(Thread.java:748)
88SPARK_HOME="/opt/conda/miniconda3/envs/your_sample_env/lib/python/site-packages/pyspark"
89SPARK_CONF="/usr/lib/spark/conf"
90spark.yarn.jars=local:/usr/lib/spark/jars/*
91spark.yarn.unmanagedAM.enabled=true
92
93
Now, your spark jobs will use pyspark 3.2
QUESTION
AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>
Asked 2022-Feb-25 at 13:18I was using pyspark on AWS EMR (4 r5.xlarge as 4 workers, each has one executor and 4 cores), and I got AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'
. Below is a snippet of the code that threw this error:
1search = SearchEngine(db_file_dir = "/tmp/db")
2conn = sqlite3.connect("/tmp/db/simple_db.sqlite")
3pdf_ = pd.read_sql_query('''select zipcode, lat, lng,
4 bounds_west, bounds_east, bounds_north, bounds_south from
5 simple_zipcode''',conn)
6brd_pdf = spark.sparkContext.broadcast(pdf_)
7conn.close()
8
9
10@udf('string')
11def get_zip_b(lat, lng):
12 pdf = brd_pdf.value
13 out = pdf[(np.array(pdf["bounds_north"]) >= lat) &
14 (np.array(pdf["bounds_south"]) <= lat) &
15 (np.array(pdf['bounds_west']) <= lng) &
16 (np.array(pdf['bounds_east']) >= lng) ]
17 if len(out):
18 min_index = np.argmin( (np.array(out["lat"]) - lat)**2 + (np.array(out["lng"]) - lng)**2)
19 zip_ = str(out["zipcode"].iloc[min_index])
20 else:
21 zip_ = 'bad'
22 return zip_
23
24df = df.withColumn('zipcode', get_zip_b(col("latitude"),col("longitude")))
25
Below is the traceback, where line 102, in get_zip_b refers to pdf = brd_pdf.value
:
1search = SearchEngine(db_file_dir = "/tmp/db")
2conn = sqlite3.connect("/tmp/db/simple_db.sqlite")
3pdf_ = pd.read_sql_query('''select zipcode, lat, lng,
4 bounds_west, bounds_east, bounds_north, bounds_south from
5 simple_zipcode''',conn)
6brd_pdf = spark.sparkContext.broadcast(pdf_)
7conn.close()
8
9
10@udf('string')
11def get_zip_b(lat, lng):
12 pdf = brd_pdf.value
13 out = pdf[(np.array(pdf["bounds_north"]) >= lat) &
14 (np.array(pdf["bounds_south"]) <= lat) &
15 (np.array(pdf['bounds_west']) <= lng) &
16 (np.array(pdf['bounds_east']) >= lng) ]
17 if len(out):
18 min_index = np.argmin( (np.array(out["lat"]) - lat)**2 + (np.array(out["lng"]) - lng)**2)
19 zip_ = str(out["zipcode"].iloc[min_index])
20 else:
21 zip_ = 'bad'
22 return zip_
23
24df = df.withColumn('zipcode', get_zip_b(col("latitude"),col("longitude")))
2521/08/02 06:18:19 WARN TaskSetManager: Lost task 12.0 in stage 7.0 (TID 1814, ip-10-22-17-94.pclc0.merkle.local, executor 6): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
26 File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 605, in main
27 process()
28 File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 597, in process
29 serializer.dump_stream(out_iter, outfile)
30 File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
31 self.serializer.dump_stream(self._batched(iterator), stream)
32 File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
33 for obj in iterator:
34 File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/serializers.py", line 212, in _batched
35 for item in iterator:
36 File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 450, in mapper
37 result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
38 File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
39 result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
40 File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
41 return lambda *a: f(*a)
42 File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/util.py", line 121, in wrapper
43 return f(*args, **kwargs)
44 File "/mnt/var/lib/hadoop/steps/s-1IBFS0SYWA19Z/Mobile_ID_process_center.py", line 102, in get_zip_b
45 File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py", line 146, in value
46 self._value = self.load_from_path(self._path)
47 File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py", line 123, in load_from_path
48 return self.load(f)
49 File "/mnt/yarn/usercache/hadoop/appcache/application_1627867699893_0001/container_1627867699893_0001_01_000009/pyspark.zip/pyspark/broadcast.py", line 129, in load
50 return pickle.load(file)
51AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks' from '/mnt/miniconda/lib/python3.9/site-packages/pandas/core/internals/blocks.py'>
52
Some observations and thought process:
1, After doing some search online, the AttributeError in pyspark seems to be caused by mismatched pandas versions between driver and workers?
2, But I ran the same code on two different datasets, one worked without any errors but the other didn't, which seems very strange and undeterministic, and it seems like the errors may not be caused by mismatched pandas versions. Otherwise, neither two datasets would succeed.
3, I then ran the same code on the successful dataset again, but this time with different spark configurations: setting spark.driver.memory from 2048M to 4192m, and it threw AttributeError.
4, In conclusion, I think the AttributeError has something to do with driver. But I can't tell how they are related from the error message, and how to fix it: AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'.
ANSWER
Answered 2021-Aug-26 at 14:53I had the same error using pandas 1.3.2 in the server while 1.2 in my client. Downgrading pandas to 1.2 solved the problem.
QUESTION
Cannot find conda info. Please verify your conda installation on EMR
Asked 2022-Feb-05 at 00:17I am trying to install conda on EMR and below is my bootstrap script, it looks like conda is getting installed but it is not getting added to environment variable. When I manually update the $PATH
variable on EMR master node, it can identify conda
. I want to use conda on Zeppelin.
I also tried adding condig into configuration like below while launching my EMR instance however I still get the below mentioned error.
1 "classification": "spark-env",
2 "properties": {
3 "conda": "/home/hadoop/conda/bin"
4 }
5
1 "classification": "spark-env",
2 "properties": {
3 "conda": "/home/hadoop/conda/bin"
4 }
5[hadoop@ip-172-30-5-150 ~]$ PATH=/home/hadoop/conda/bin:$PATH
6[hadoop@ip-172-30-5-150 ~]$ conda
7usage: conda [-h] [-V] command ...
8
9conda is a tool for managing and deploying applications, environments and packages.
10
1 "classification": "spark-env",
2 "properties": {
3 "conda": "/home/hadoop/conda/bin"
4 }
5[hadoop@ip-172-30-5-150 ~]$ PATH=/home/hadoop/conda/bin:$PATH
6[hadoop@ip-172-30-5-150 ~]$ conda
7usage: conda [-h] [-V] command ...
8
9conda is a tool for managing and deploying applications, environments and packages.
10#!/usr/bin/env bash
11
12
13# Install conda
14wget https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
15 && /bin/bash ~/miniconda.sh -b -p $HOME/conda
16
17
18conda config --set always_yes yes --set changeps1 no
19conda install conda=4.2.13
20conda config -f --add channels conda-forge
21rm ~/miniconda.sh
22echo bootstrap_conda.sh completed. PATH now: $PATH
23export PYSPARK_PYTHON="/home/hadoop/conda/bin/python3.5"
24
25echo -e '\nexport PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc
26
27
28conda create -n zoo python=3.7 # "zoo" is conda environment name, you can use any name you like.
29conda activate zoo
30sudo pip3 install tensorflow
31sudo pip3 install boto3
32sudo pip3 install botocore
33sudo pip3 install numpy
34sudo pip3 install pandas
35sudo pip3 install scipy
36sudo pip3 install s3fs
37sudo pip3 install matplotlib
38sudo pip3 install -U tqdm
39sudo pip3 install -U scikit-learn
40sudo pip3 install -U scikit-multilearn
41sudo pip3 install xlutils
42sudo pip3 install natsort
43sudo pip3 install pydot
44sudo pip3 install python-pydot
45sudo pip3 install python-pydot-ng
46sudo pip3 install pydotplus
47sudo pip3 install h5py
48sudo pip3 install graphviz
49sudo pip3 install recmetrics
50sudo pip3 install openpyxl
51sudo pip3 install xlrd
52sudo pip3 install xlwt
53sudo pip3 install tensorflow.io
54sudo pip3 install Cython
55sudo pip3 install ray
56sudo pip3 install zoo
57sudo pip3 install analytics-zoo
58sudo pip3 install analytics-zoo[ray]
59#sudo /usr/bin/pip-3.6 install -U imbalanced-learn
60
61
62
ANSWER
Answered 2022-Feb-05 at 00:17I got the conda working by modifying the script as below, emr python versions were colliding with the conda version.:
1 "classification": "spark-env",
2 "properties": {
3 "conda": "/home/hadoop/conda/bin"
4 }
5[hadoop@ip-172-30-5-150 ~]$ PATH=/home/hadoop/conda/bin:$PATH
6[hadoop@ip-172-30-5-150 ~]$ conda
7usage: conda [-h] [-V] command ...
8
9conda is a tool for managing and deploying applications, environments and packages.
10#!/usr/bin/env bash
11
12
13# Install conda
14wget https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
15 && /bin/bash ~/miniconda.sh -b -p $HOME/conda
16
17
18conda config --set always_yes yes --set changeps1 no
19conda install conda=4.2.13
20conda config -f --add channels conda-forge
21rm ~/miniconda.sh
22echo bootstrap_conda.sh completed. PATH now: $PATH
23export PYSPARK_PYTHON="/home/hadoop/conda/bin/python3.5"
24
25echo -e '\nexport PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc
26
27
28conda create -n zoo python=3.7 # "zoo" is conda environment name, you can use any name you like.
29conda activate zoo
30sudo pip3 install tensorflow
31sudo pip3 install boto3
32sudo pip3 install botocore
33sudo pip3 install numpy
34sudo pip3 install pandas
35sudo pip3 install scipy
36sudo pip3 install s3fs
37sudo pip3 install matplotlib
38sudo pip3 install -U tqdm
39sudo pip3 install -U scikit-learn
40sudo pip3 install -U scikit-multilearn
41sudo pip3 install xlutils
42sudo pip3 install natsort
43sudo pip3 install pydot
44sudo pip3 install python-pydot
45sudo pip3 install python-pydot-ng
46sudo pip3 install pydotplus
47sudo pip3 install h5py
48sudo pip3 install graphviz
49sudo pip3 install recmetrics
50sudo pip3 install openpyxl
51sudo pip3 install xlrd
52sudo pip3 install xlwt
53sudo pip3 install tensorflow.io
54sudo pip3 install Cython
55sudo pip3 install ray
56sudo pip3 install zoo
57sudo pip3 install analytics-zoo
58sudo pip3 install analytics-zoo[ray]
59#sudo /usr/bin/pip-3.6 install -U imbalanced-learn
60
61
62wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
63 && /bin/bash ~/miniconda.sh -b -p $HOME/conda
64
65echo -e '\n export PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc
66
67
68conda config --set always_yes yes --set changeps1 no
69conda config -f --add channels conda-forge
70
71
72conda create -n zoo python=3.7 # "zoo" is conda environment name
73conda init bash
74source activate zoo
75conda install python 3.7.0 -c conda-forge orca
76sudo /home/hadoop/conda/envs/zoo/bin/python3.7 -m pip install virtualenv
77
and setting zeppelin python and pyspark parameters to:
1 "classification": "spark-env",
2 "properties": {
3 "conda": "/home/hadoop/conda/bin"
4 }
5[hadoop@ip-172-30-5-150 ~]$ PATH=/home/hadoop/conda/bin:$PATH
6[hadoop@ip-172-30-5-150 ~]$ conda
7usage: conda [-h] [-V] command ...
8
9conda is a tool for managing and deploying applications, environments and packages.
10#!/usr/bin/env bash
11
12
13# Install conda
14wget https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
15 && /bin/bash ~/miniconda.sh -b -p $HOME/conda
16
17
18conda config --set always_yes yes --set changeps1 no
19conda install conda=4.2.13
20conda config -f --add channels conda-forge
21rm ~/miniconda.sh
22echo bootstrap_conda.sh completed. PATH now: $PATH
23export PYSPARK_PYTHON="/home/hadoop/conda/bin/python3.5"
24
25echo -e '\nexport PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc
26
27
28conda create -n zoo python=3.7 # "zoo" is conda environment name, you can use any name you like.
29conda activate zoo
30sudo pip3 install tensorflow
31sudo pip3 install boto3
32sudo pip3 install botocore
33sudo pip3 install numpy
34sudo pip3 install pandas
35sudo pip3 install scipy
36sudo pip3 install s3fs
37sudo pip3 install matplotlib
38sudo pip3 install -U tqdm
39sudo pip3 install -U scikit-learn
40sudo pip3 install -U scikit-multilearn
41sudo pip3 install xlutils
42sudo pip3 install natsort
43sudo pip3 install pydot
44sudo pip3 install python-pydot
45sudo pip3 install python-pydot-ng
46sudo pip3 install pydotplus
47sudo pip3 install h5py
48sudo pip3 install graphviz
49sudo pip3 install recmetrics
50sudo pip3 install openpyxl
51sudo pip3 install xlrd
52sudo pip3 install xlwt
53sudo pip3 install tensorflow.io
54sudo pip3 install Cython
55sudo pip3 install ray
56sudo pip3 install zoo
57sudo pip3 install analytics-zoo
58sudo pip3 install analytics-zoo[ray]
59#sudo /usr/bin/pip-3.6 install -U imbalanced-learn
60
61
62wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
63 && /bin/bash ~/miniconda.sh -b -p $HOME/conda
64
65echo -e '\n export PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc
66
67
68conda config --set always_yes yes --set changeps1 no
69conda config -f --add channels conda-forge
70
71
72conda create -n zoo python=3.7 # "zoo" is conda environment name
73conda init bash
74source activate zoo
75conda install python 3.7.0 -c conda-forge orca
76sudo /home/hadoop/conda/envs/zoo/bin/python3.7 -m pip install virtualenv
77“spark.pyspark.python": "/home/hadoop/conda/envs/zoo/bin/python3",
78"spark.pyspark.virtualenv.enabled": "true",
79"spark.pyspark.virtualenv.type":"native",
80"spark.pyspark.virtualenv.bin.path":"/home/hadoop/conda/envs/zoo/bin/,
81"zeppelin.pyspark.python" : "/home/hadoop/conda/bin/python",
82"zeppelin.python": "/home/hadoop/conda/bin/python"
83
Orca only support TF upto 1.5 hence it was not working as I am using TF2.
QUESTION
PySpark runs in YARN client mode but fails in cluster mode for "User did not initialize spark context!"
Asked 2022-Jan-19 at 21:28- standard dataproc image 2.0
- Ubuntu 18.04 LTS
- Hadoop 3.2
- Spark 3.1
I am testing to run a very simple script on dataproc pyspark cluster:
testing_dep.py
1import os
2os.listdir('./')
3
I can run testing_dep.py in a client mode (default on dataproc) just fine:
1import os
2os.listdir('./')
3gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1
4
But, when I try to run the same job in cluster mode I get error:
1import os
2os.listdir('./')
3gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1
4gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1 --properties=spark.submit.deployMode=cluster
5
error logs:
1import os
2os.listdir('./')
3gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1
4gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1 --properties=spark.submit.deployMode=cluster
5Job [417443357bcd43f99ee3dc60f4e3bfea] submitted.
6Waiting for job output...
722/01/12 05:32:20 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at monsoon-testing-m/10.128.15.236:8032
822/01/12 05:32:20 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at monsoon-testing-m/10.128.15.236:10200
922/01/12 05:32:22 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
1022/01/12 05:32:22 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
1122/01/12 05:32:24 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1641965080466_0001
1222/01/12 05:32:42 ERROR org.apache.spark.deploy.yarn.Client: Application diagnostics message: Application application_1641965080466_0001 failed 2 times due to AM Container for appattempt_1641965080466_0001_000002 exited with exitCode: 13
13Failing this attempt.Diagnostics: [2022-01-12 05:32:42.154]Exception from container-launch.
14Container id: container_1641965080466_0001_02_000001
15Exit code: 13
16
17[2022-01-12 05:32:42.203]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
18Last 4096 bytes of prelaunch.err :
19Last 4096 bytes of stderr :
2022/01/12 05:32:40 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception:
21java.lang.IllegalStateException: User did not initialize spark context!
22 at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:520)
23 at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:268)
24 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:899)
25 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:898)
26 at java.security.AccessController.doPrivileged(Native Method)
27 at javax.security.auth.Subject.doAs(Subject.java:422)
28 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
29 at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:898)
30 at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
31
32
33[2022-01-12 05:32:42.203]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
34Last 4096 bytes of prelaunch.err :
35Last 4096 bytes of stderr :
3622/01/12 05:32:40 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception:
37java.lang.IllegalStateException: User did not initialize spark context!
38 at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:520)
39 at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:268)
40 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:899)
41 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:898)
42 at java.security.AccessController.doPrivileged(Native Method)
43 at javax.security.auth.Subject.doAs(Subject.java:422)
44 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
45 at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:898)
46 at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
47
48
49For more detailed output, check the application tracking page: http://monsoon-testing-m:8188/applicationhistory/app/application_1641965080466_0001 Then click on links to logs of each attempt.
50. Failing the application.
51Exception in thread "main" org.apache.spark.SparkException: Application application_1641965080466_0001 finished with failed status
52 at org.apache.spark.deploy.yarn.Client.run(Client.scala:1242)
53 at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1634)
54 at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
55 at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
56 at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
57 at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
58 at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
59 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
60 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
61ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [417443357bcd43f99ee3dc60f4e3bfea] failed with error:
62Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
63https://console.cloud.google.com/dataproc/jobs/417443357bcd43f99ee3dc60f4e3bfea?project=monsoon-credittech&region=us-central1
64gcloud dataproc jobs wait '417443357bcd43f99ee3dc60f4e3bfea' --region 'us-central1' --project 'monsoon-credittech'
65https://console.cloud.google.com/storage/browser/monsoon-credittech.appspot.com/google-cloud-dataproc-metainfo/64632294-3e9b-4c55-af8a-075fc7d6f412/jobs/417443357bcd43f99ee3dc60f4e3bfea/
66gs://monsoon-credittech.appspot.com/google-cloud-dataproc-metainfo/64632294-3e9b-4c55-af8a-075fc7d6f412/jobs/417443357bcd43f99ee3dc60f4e3bfea/driveroutput
67
68
69
70
Can you please help me understand what I am doing wrong and why this code is failing?
ANSWER
Answered 2022-Jan-19 at 21:26The error is expected when running Spark in YARN cluster mode but the job doesn't create Spark context. See the source code of ApplicationMaster.scala.
To avoid this error, you need to create a SparkContext or SparkSession, e.g.:
1import os
2os.listdir('./')
3gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1
4gcloud dataproc jobs submit pyspark ./testing_dep.py --cluster=pyspark-monsoon --region=us-central1 --properties=spark.submit.deployMode=cluster
5Job [417443357bcd43f99ee3dc60f4e3bfea] submitted.
6Waiting for job output...
722/01/12 05:32:20 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at monsoon-testing-m/10.128.15.236:8032
822/01/12 05:32:20 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at monsoon-testing-m/10.128.15.236:10200
922/01/12 05:32:22 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
1022/01/12 05:32:22 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
1122/01/12 05:32:24 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1641965080466_0001
1222/01/12 05:32:42 ERROR org.apache.spark.deploy.yarn.Client: Application diagnostics message: Application application_1641965080466_0001 failed 2 times due to AM Container for appattempt_1641965080466_0001_000002 exited with exitCode: 13
13Failing this attempt.Diagnostics: [2022-01-12 05:32:42.154]Exception from container-launch.
14Container id: container_1641965080466_0001_02_000001
15Exit code: 13
16
17[2022-01-12 05:32:42.203]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
18Last 4096 bytes of prelaunch.err :
19Last 4096 bytes of stderr :
2022/01/12 05:32:40 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception:
21java.lang.IllegalStateException: User did not initialize spark context!
22 at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:520)
23 at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:268)
24 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:899)
25 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:898)
26 at java.security.AccessController.doPrivileged(Native Method)
27 at javax.security.auth.Subject.doAs(Subject.java:422)
28 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
29 at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:898)
30 at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
31
32
33[2022-01-12 05:32:42.203]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
34Last 4096 bytes of prelaunch.err :
35Last 4096 bytes of stderr :
3622/01/12 05:32:40 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: Uncaught exception:
37java.lang.IllegalStateException: User did not initialize spark context!
38 at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:520)
39 at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:268)
40 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:899)
41 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:898)
42 at java.security.AccessController.doPrivileged(Native Method)
43 at javax.security.auth.Subject.doAs(Subject.java:422)
44 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
45 at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:898)
46 at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
47
48
49For more detailed output, check the application tracking page: http://monsoon-testing-m:8188/applicationhistory/app/application_1641965080466_0001 Then click on links to logs of each attempt.
50. Failing the application.
51Exception in thread "main" org.apache.spark.SparkException: Application application_1641965080466_0001 finished with failed status
52 at org.apache.spark.deploy.yarn.Client.run(Client.scala:1242)
53 at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1634)
54 at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
55 at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
56 at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
57 at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
58 at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
59 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
60 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
61ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [417443357bcd43f99ee3dc60f4e3bfea] failed with error:
62Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
63https://console.cloud.google.com/dataproc/jobs/417443357bcd43f99ee3dc60f4e3bfea?project=monsoon-credittech&region=us-central1
64gcloud dataproc jobs wait '417443357bcd43f99ee3dc60f4e3bfea' --region 'us-central1' --project 'monsoon-credittech'
65https://console.cloud.google.com/storage/browser/monsoon-credittech.appspot.com/google-cloud-dataproc-metainfo/64632294-3e9b-4c55-af8a-075fc7d6f412/jobs/417443357bcd43f99ee3dc60f4e3bfea/
66gs://monsoon-credittech.appspot.com/google-cloud-dataproc-metainfo/64632294-3e9b-4c55-af8a-075fc7d6f412/jobs/417443357bcd43f99ee3dc60f4e3bfea/driveroutput
67
68
69
70from pyspark.sql import SparkSession
71
72spark = SparkSession.builder \
73 .appName('MySparkApp') \
74 .getOrCreate()
75
Client mode doesn't go through the same code path and doesn't have a similar check.
QUESTION
Where to find spark log in dataproc when running job on cluster mode
Asked 2022-Jan-18 at 19:36I am running the following code as job in dataproc. I could not find logs in console while running in 'cluster' mode.
1import sys
2import time
3from datetime import datetime
4
5from pyspark.sql import SparkSession
6
7start_time = datetime.utcnow()
8
9spark = SparkSession.builder.appName("check_confs").getOrCreate()
10
11all_conf = spark.sparkContext.getConf().getAll()
12print("\n\n=====\nExecuting at {}".format(datetime.utcnow()))
13print(all_conf)
14print("\n\n======================\n\n\n")
15incoming_args = sys.argv
16if len(incoming_args) > 1:
17 sleep_time = int(incoming_args[1])
18 print("Sleep time is {} seconds".format(sleep_time))
19 if sleep_time > 0:
20 time.sleep(sleep_time)
21
22end_time = datetime.utcnow()
23time_taken = (end_time - start_time).total_seconds()
24print("Script execution completed in {} seconds".format(time_taken))
25
If I trigger the job using the deployMode
as cluster
property, I could not see corresponding logs.
But if the job is triggered in default mode which is client
mode, able to see the respective logs.
I have given the dictionary used for triggering the job.
"spark.submit.deployMode": "cluster"
1import sys
2import time
3from datetime import datetime
4
5from pyspark.sql import SparkSession
6
7start_time = datetime.utcnow()
8
9spark = SparkSession.builder.appName("check_confs").getOrCreate()
10
11all_conf = spark.sparkContext.getConf().getAll()
12print("\n\n=====\nExecuting at {}".format(datetime.utcnow()))
13print(all_conf)
14print("\n\n======================\n\n\n")
15incoming_args = sys.argv
16if len(incoming_args) > 1:
17 sleep_time = int(incoming_args[1])
18 print("Sleep time is {} seconds".format(sleep_time))
19 if sleep_time > 0:
20 time.sleep(sleep_time)
21
22end_time = datetime.utcnow()
23time_taken = (end_time - start_time).total_seconds()
24print("Script execution completed in {} seconds".format(time_taken))
25{
26 'placement': {
27 'cluster_name': dataproc_cluster
28 },
29 'pyspark_job': {
30 'main_python_file_uri': "gs://" + compute_storage + "/" + job_file,
31 'args': trigger_params,
32 "properties": {
33 "spark.submit.deployMode": "cluster",
34 "spark.executor.memory": "3155m",
35 "spark.scheduler.mode": "FAIR",
36 }
37 }
38 }
39
1import sys
2import time
3from datetime import datetime
4
5from pyspark.sql import SparkSession
6
7start_time = datetime.utcnow()
8
9spark = SparkSession.builder.appName("check_confs").getOrCreate()
10
11all_conf = spark.sparkContext.getConf().getAll()
12print("\n\n=====\nExecuting at {}".format(datetime.utcnow()))
13print(all_conf)
14print("\n\n======================\n\n\n")
15incoming_args = sys.argv
16if len(incoming_args) > 1:
17 sleep_time = int(incoming_args[1])
18 print("Sleep time is {} seconds".format(sleep_time))
19 if sleep_time > 0:
20 time.sleep(sleep_time)
21
22end_time = datetime.utcnow()
23time_taken = (end_time - start_time).total_seconds()
24print("Script execution completed in {} seconds".format(time_taken))
25{
26 'placement': {
27 'cluster_name': dataproc_cluster
28 },
29 'pyspark_job': {
30 'main_python_file_uri': "gs://" + compute_storage + "/" + job_file,
31 'args': trigger_params,
32 "properties": {
33 "spark.submit.deployMode": "cluster",
34 "spark.executor.memory": "3155m",
35 "spark.scheduler.mode": "FAIR",
36 }
37 }
38 }
3921/12/07 19:11:27 INFO org.sparkproject.jetty.util.log: Logging initialized @3350ms to org.sparkproject.jetty.util.log.Slf4jLog
4021/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_292-b10
4121/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: Started @3467ms
4221/12/07 19:11:27 INFO org.sparkproject.jetty.server.AbstractConnector: Started ServerConnector@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:40389}
4321/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
4421/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:10200
4521/12/07 19:11:29 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
4621/12/07 19:11:29 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
4721/12/07 19:11:30 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0014
4821/12/07 19:11:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8030
4921/12/07 19:11:33 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
50
51
52=====
53Executing at 2021-12-07 19:11:35.100277
54[....... ('spark.yarn.historyServer.address', '****-m:18080'), ('spark.ui.proxyBase', '/proxy/application_1638554180947_0014'), ('spark.driver.appUIAddress', 'http://***-m.c.***-123456.internal:40389'), ('spark.sql.cbo.enabled', 'true')]
55
56
57======================
58
59
60
61Sleep time is 1 seconds
62Script execution completed in 9.411261 seconds
6321/12/07 19:11:36 INFO org.sparkproject.jetty.server.AbstractConnector: Stopped Spark@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
64
65
Logs not coming in console while running in client mode
1import sys
2import time
3from datetime import datetime
4
5from pyspark.sql import SparkSession
6
7start_time = datetime.utcnow()
8
9spark = SparkSession.builder.appName("check_confs").getOrCreate()
10
11all_conf = spark.sparkContext.getConf().getAll()
12print("\n\n=====\nExecuting at {}".format(datetime.utcnow()))
13print(all_conf)
14print("\n\n======================\n\n\n")
15incoming_args = sys.argv
16if len(incoming_args) > 1:
17 sleep_time = int(incoming_args[1])
18 print("Sleep time is {} seconds".format(sleep_time))
19 if sleep_time > 0:
20 time.sleep(sleep_time)
21
22end_time = datetime.utcnow()
23time_taken = (end_time - start_time).total_seconds()
24print("Script execution completed in {} seconds".format(time_taken))
25{
26 'placement': {
27 'cluster_name': dataproc_cluster
28 },
29 'pyspark_job': {
30 'main_python_file_uri': "gs://" + compute_storage + "/" + job_file,
31 'args': trigger_params,
32 "properties": {
33 "spark.submit.deployMode": "cluster",
34 "spark.executor.memory": "3155m",
35 "spark.scheduler.mode": "FAIR",
36 }
37 }
38 }
3921/12/07 19:11:27 INFO org.sparkproject.jetty.util.log: Logging initialized @3350ms to org.sparkproject.jetty.util.log.Slf4jLog
4021/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_292-b10
4121/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: Started @3467ms
4221/12/07 19:11:27 INFO org.sparkproject.jetty.server.AbstractConnector: Started ServerConnector@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:40389}
4321/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
4421/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:10200
4521/12/07 19:11:29 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
4621/12/07 19:11:29 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
4721/12/07 19:11:30 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0014
4821/12/07 19:11:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8030
4921/12/07 19:11:33 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
50
51
52=====
53Executing at 2021-12-07 19:11:35.100277
54[....... ('spark.yarn.historyServer.address', '****-m:18080'), ('spark.ui.proxyBase', '/proxy/application_1638554180947_0014'), ('spark.driver.appUIAddress', 'http://***-m.c.***-123456.internal:40389'), ('spark.sql.cbo.enabled', 'true')]
55
56
57======================
58
59
60
61Sleep time is 1 seconds
62Script execution completed in 9.411261 seconds
6321/12/07 19:11:36 INFO org.sparkproject.jetty.server.AbstractConnector: Stopped Spark@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
64
6521/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
6621/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:8032
6721/12/07 19:09:05 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
6821/12/07 19:09:05 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
6921/12/07 19:09:06 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0013
70
ANSWER
Answered 2021-Dec-15 at 17:30When running jobs in cluster mode, the driver logs are in the Cloud Logging yarn-userlogs
. See the doc:
By default, Dataproc runs Spark jobs in client mode, and streams the driver output for viewing as explained, below. However, if the user creates the Dataproc cluster by setting cluster properties to
--properties spark:spark.submit.deployMode=cluster
or submits the job in cluster mode by setting job properties to--properties spark.submit.deployMode=cluster
, driver output is listed in YARN userlogs, which can be accessed in Logging.
Community Discussions contain sources that include Stack Exchange Network
Tutorials and Learning Resources in Hadoop
Tutorials and Learning Resources are not available at this moment for Hadoop