Spark.NET | Wicked in your C # programs | GPU library
kandi X-RAY | Spark.NET Summary
kandi X-RAY | Spark.NET Summary
Wicked in your C# programs
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of Spark.NET
Spark.NET Key Features
Spark.NET Examples and Code Snippets
Community Discussions
Trending Discussions on Spark.NET
QUESTION
The goal is to have a Spark Streaming application that read data from Kafka and use Delta Lake to create store data. The granularity of the delta table is pretty granular, the first partition is the organization_id (there are more than 5000 organizations) and the second partition is the date.
The application has a expected latency, but it does not last more than one day up. The error is always about memory as I'll show below.
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006f8000000, 671088640, 0) failed; error='Cannot allocate memory' (errno=12)
There is no persistence and the memory is already high for the whole application.
What I've triedIncreasing memory and workes were the first things I've tried, but the number of partitions were changed as well, from 4 to 16.
Script of Execution ...ANSWER
Answered 2021-Jun-08 at 11:11Just upgraded the version to Delta.io 1.0.0 and it stopped happening.
QUESTION
I am trying to create EMR step-functions where I want to specify my EMR cluster that is always running. All the examples I've come across online, tell you how to create a cluster and then terminate it once the job is done.
My EMR step function is as follows:
...ANSWER
Answered 2021-May-23 at 22:50The solution was to remove the $
from the cluster variable definition.
Change
QUESTION
i am getting this error when i do spark submit. java.lang.IllegalArgumentException: Can not create a Path from an empty string i am using spark version 2.4.7 hadoop version 3.3.0 intellji ide jdk 8 first i was getting class not found error which i solved now i am getting this error Is it because of the dataset or something else. https://www.kaggle.com/datasnaek/youtube-new?select=INvideos.csv link to dataset
error:
...ANSWER
Answered 2021-May-04 at 06:03It just seems as output_dir
variable contains incorrect path:
QUESTION
I'm using Spark 3.1 in Databricks (Databricks Runtime 8) with a very large cluster (25 workers with 112 Gb of memory and 16 cores each) to replicate several SAP tables in an Azure Data Lake Storage (ADLS gen2). For doing this, a tool is writting the deltas of all these tables into an intermediate system (SQL Server) and then, if I have new data for a certain table, I execute a Databricks job to merge the new data with the existing data available in ADLS.
This process is working fine for most of the tables, but some of them (the biggest ones) take a lot of time to be merged (I merge the data using the PK of each table) and the biggest one has started failing since a week ago (When a big delta of the table was generated). Trace of the error that I can see in the job:
Py4JJavaError: An error occurred while calling o233.sql. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:234) at com.databricks.sql.transaction.tahoe.files.TransactionalWriteEdge.$anonfun$writeFiles$5(TransactionalWriteEdge.scala:246) ... .. ............................................................................................................................................................................................................................................................................................................................................................................ Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:428) at com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.awaitShuffleMapStage$1(DeltaOptimizedWriterExec.scala:153) at com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.getShuffleStats(DeltaOptimizedWriterExec.scala:158) at com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.computeBins(DeltaOptimizedWriterExec.scala:106) at com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.doExecute(DeltaOptimizedWriterExec.scala:174) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:180) ... 141 more Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 68 (execute at DeltaOptimizedWriterExec.scala:97) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Connection from /XXX.XX.XX.XX:4048 closed at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:769) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:684) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:69) at .................................................................................................................................................................................................................................................................................................................................... ... java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Connection from /XXX.XX.XX.XX:4048 closed at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146) at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81) at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81) at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901) at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more
As the error is non descriptive, I have taken a look to each executor log and I have seen following message:
21/04/07 09:11:24 ERROR OneForOneBlockFetcher: Failed while starting block fetches java.io.IOException: Connection from /XXX.XX.XX.XX:4048 closed
And in the executor that seems to be unable to connect, I see the following error message:
21/04/06 09:30:46 ERROR SparkThreadLocalCapturingRunnable: Exception in thread Task reaper-7 org.apache.spark.SparkException: Killing executor JVM because killed task 5912 could not be stopped within 60000 ms. at org.apache.spark.executor.Executor$TaskReaper.run(Executor.scala:1119) at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:104) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:68) at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:54) at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:101) at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:104) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(
I have tried increasing the default shuffle parallelism (From 200 to 1200 as It's suggested here Spark application kills executor) and it seems that the job is more time in execution, but it fails again.
I have tried to monitor the SparkUI meanwhile the job is in execution:
But as you can see, the problem is the same: Some stages are failing because an executor its unreachable because a task has failed more than X times.
The big delta that I mentioned above has more or less 4-5 billion rows and the big dump that I want to merge has, more or less, 100 million rows. The table is not partitioned (yet) so the process is very work-intensive. What is failing is the merge part, not the process to copy the data from SQL Server to ADLS, so the merge is being done once the data to be merge is already in Parquet format.
Any idea of what is happening or what can I do in order to finish this merge?
Thanks in advance.
...ANSWER
Answered 2021-Apr-12 at 07:56Finally, I reviewed the cluster and I changed the spark.sql.shuffle.partitions property to 1600 in the code of the job that I wanted to execute with this configuration (Instead than changing this directly on the cluster). In my cluster I have 400 cores so I chose a multiple (1600) of that number.
After that, the execution finished in two hours. I came to this conclusion because, in my logs and Spark UI I observed a lot of disk spilling so I thought that the partitions wasn't fitting in the worker nodes.
QUESTION
I'm trying to upgrade from Spark 3.0.1 to 3.1.1.
I am running PySpark 3.1.1 in client mode on Jupyter notebook.
The following ran on 3.0.1 but fails after upgrading spark:
...ANSWER
Answered 2021-Mar-07 at 21:58QUESTION
I'm trying to follow this guide:
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
But I don't realize why I'm it's most of the time not writing data to the console, and why its spamming execution thread logging?
Do I need to configure something?
This is my code:
ANSWER
Answered 2021-Feb-16 at 19:48you are getting logger information as you have used default logging level as INFO. set logging level to WARN by spark.sparkContext.setLogLevel("WARN")
.
QUESTION
I am trying to run a simple spark job on a kubernetes cluster. I deployed a pod that starts a pyspark shell and in that shell I am changing the spark configuration as specified below:
...ANSWER
Answered 2021-Feb-01 at 09:42I don't have much experience with PySpark but I once setup Java Spark to run on a Kubernetes cluster in client mode, like you are trying now.. and I believe the configuration should mostly be the same.
First of all, you should check if the headless service is working as expected or not. First with a:
QUESTION
I have a dataset that is around 190GB that was partitioned into 1000 partitions.
my EMR cluster allows a maximum of 10 r5a.2xlarge
TASK nodes and 2 CORE nodes. Each node having 64GB mem and 128GB EBS storage.
In my spark job execution, I have set it to use executor-cores 5
, driver cores 5
,executor-memory 40g
, driver-memory 50g
, spark.yarn.executor.memoryOverhead=10g
, spark.sql.shuffle.partitions=500
, spark.dynamicAllocation.enabled=true
But my job keeps failing with errors like
...ANSWER
Answered 2021-Jan-24 at 05:48You can try either of the below steps:
Memory overhead
should be10%
of the Executor memory or328 MB
. Don't increase it to any value.- Remove Driver Cores.
- If you have 10 nodes, then specify the
number of executors
. You have to calculate it in such a way that you leave some space for YARN and background processes. Also, you can try increasing 1 or 2 more cores. - Run it on a
cluster
mode and whatever number you assign to executors, add +1 to it since 1 executor will be treated as driver executor in the cluster mode. - Also, the last thing is nothing but your code written to submit / process that 190GB of file. Go through your code and find ways of optimizing it. Look for collect methods, or unnecessary use of joins, coalesce / repartition. Find some alternatives to it if it isn't needed.
- Use persist(Memory and Disk only) option for the data frames that you are using frequently in the code.
- Also the last thing which I tried is to execute the steps manually on the
spark-shell on EMR
and you will come to know which part of the code is taking much time to run.
You can also refer to this official blog for some of the tips.
QUESTION
I give the command for spark2 submit as:
...ANSWER
Answered 2021-Jan-04 at 13:49The options that the SparkSubmitOperator
in Airflow requires can be sent in a the dictionary. Keep in mind that the keys in the dictionary should be the same as teh parameter names to the function.
Create the following two dictionaries:
QUESTION
I've installed Spark and components locally and I'm able to execute PySpark code in Jupyter, iPython and via spark-submit - however receiving the following WARNING's:
...ANSWER
Answered 2020-Dec-27 at 08:14Install Java 8 instead of Java 11, which is known to give this sort of warnings with Spark.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install Spark.NET
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page