Spark.NET | Wicked in your C # programs | GPU library

by ahmetb C# Version: Current License: Non-SPDX

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | Spark.NET Summary

Spark.NET is a C# library typically used in Hardware, GPU applications. Spark.NET has no bugs, it has no vulnerabilities and it has low support. However Spark.NET has a Non-SPDX License. You can download it from GitHub.

Wicked in your C# programs

Support

Quality

Security

License

Reuse

Support

Spark.NET has a low active ecosystem.

It has 11 star(s) with 4 fork(s). There are 2 watchers for this library.

It had no major release in the last 6 months.

Spark.NET has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of Spark.NET is current.

Quality

Spark.NET has no bugs reported.

Security

Spark.NET has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

Spark.NET has a Non-SPDX License.

Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

Reuse

Spark.NET releases are not available. You will need to build from source code and install.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of Spark.NET

Get all kandi verified functions for this library.

Spark.NET Key Features

No Key Features are available at this moment for Spark.NET.

Spark.NET Examples and Code Snippets

No Code Snippets are available at this moment for Spark.NET.

Community Discussions

Trending Discussions on Spark.NET

Cannot Allocate Memory in Delta Lake

How to hardcode a cluster ID in EMR step function

spark submit java.lang.IllegalArgumentException: Can not create a Path from an empty string

Spark non-descriptive error in DELTA MERGE

Spark executors fails on Kubernetes - resourceProfileId is missing

How to avoid continuous "Resetting offset" and "Seeking to LATEST offset"?

Spark executors fails to run on kubernetes cluster

Optimizing Spark resources to avoid memory and space usage

Airflow spark submit operator

PySpark "illegal reflective access operation" when executed in terminal

QUESTION

Cannot Allocate Memory in Delta Lake

Asked 2021-Jun-08 at 11:11

Problem

The goal is to have a Spark Streaming application that read data from Kafka and use Delta Lake to create store data. The granularity of the delta table is pretty granular, the first partition is the organization_id (there are more than 5000 organizations) and the second partition is the date.

The application has a expected latency, but it does not last more than one day up. The error is always about memory as I'll show below.

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006f8000000, 671088640, 0) failed; error='Cannot allocate memory' (errno=12)

There is no persistence and the memory is already high for the whole application.

What I've tried

Increasing memory and workes were the first things I've tried, but the number of partitions were changed as well, from 4 to 16.

Script of Execution ...

ANSWER

Answered 2021-Jun-08 at 11:11

Just upgraded the version to Delta.io 1.0.0 and it stopped happening.

Source https://stackoverflow.com/questions/67519651

QUESTION

How to hardcode a cluster ID in EMR step function

Asked 2021-May-23 at 22:50

I am trying to create EMR step-functions where I want to specify my EMR cluster that is always running. All the examples I've come across online, tell you how to create a cluster and then terminate it once the job is done.

My EMR step function is as follows:

...

ANSWER

Answered 2021-May-23 at 22:50

The solution was to remove the $ from the cluster variable definition.

Change

Source https://stackoverflow.com/questions/67626883

QUESTION

spark submit java.lang.IllegalArgumentException: Can not create a Path from an empty string

Asked 2021-May-04 at 06:03

i am getting this error when i do spark submit. java.lang.IllegalArgumentException: Can not create a Path from an empty string i am using spark version 2.4.7 hadoop version 3.3.0 intellji ide jdk 8 first i was getting class not found error which i solved now i am getting this error Is it because of the dataset or something else. https://www.kaggle.com/datasnaek/youtube-new?select=INvideos.csv link to dataset

error:

...

ANSWER

Answered 2021-May-04 at 06:03

It just seems as output_dir variable contains incorrect path:

Source https://stackoverflow.com/questions/67377790

QUESTION

Spark non-descriptive error in DELTA MERGE

Asked 2021-Apr-12 at 07:56

I'm using Spark 3.1 in Databricks (Databricks Runtime 8) with a very large cluster (25 workers with 112 Gb of memory and 16 cores each) to replicate several SAP tables in an Azure Data Lake Storage (ADLS gen2). For doing this, a tool is writting the deltas of all these tables into an intermediate system (SQL Server) and then, if I have new data for a certain table, I execute a Databricks job to merge the new data with the existing data available in ADLS.

This process is working fine for most of the tables, but some of them (the biggest ones) take a lot of time to be merged (I merge the data using the PK of each table) and the biggest one has started failing since a week ago (When a big delta of the table was generated). Trace of the error that I can see in the job:

Py4JJavaError: An error occurred while calling o233.sql. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:234) at com.databricks.sql.transaction.tahoe.files.TransactionalWriteEdge.$anonfun$writeFiles$5(TransactionalWriteEdge.scala:246) ... .. ............................................................................................................................................................................................................................................................................................................................................................................ Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:428) at com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.awaitShuffleMapStage$1(DeltaOptimizedWriterExec.scala:153) at com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.getShuffleStats(DeltaOptimizedWriterExec.scala:158) at com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.computeBins(DeltaOptimizedWriterExec.scala:106) at com.databricks.sql.transaction.tahoe.perf.DeltaOptimizedWriterExec.doExecute(DeltaOptimizedWriterExec.scala:174) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:180) ... 141 more Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 68 (execute at DeltaOptimizedWriterExec.scala:97) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Connection from /XXX.XX.XX.XX:4048 closed at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:769) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:684) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:69) at .................................................................................................................................................................................................................................................................................................................................... ... java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Connection from /XXX.XX.XX.XX:4048 closed at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146) at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81) at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81) at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901) at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more

As the error is non descriptive, I have taken a look to each executor log and I have seen following message:

21/04/07 09:11:24 ERROR OneForOneBlockFetcher: Failed while starting block fetches java.io.IOException: Connection from /XXX.XX.XX.XX:4048 closed

And in the executor that seems to be unable to connect, I see the following error message:

21/04/06 09:30:46 ERROR SparkThreadLocalCapturingRunnable: Exception in thread Task reaper-7 org.apache.spark.SparkException: Killing executor JVM because killed task 5912 could not be stopped within 60000 ms. at org.apache.spark.executor.Executor$TaskReaper.run(Executor.scala:1119) at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:104) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:68) at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:54) at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:101) at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:104) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(

I have tried increasing the default shuffle parallelism (From 200 to 1200 as It's suggested here Spark application kills executor) and it seems that the job is more time in execution, but it fails again.

I have tried to monitor the SparkUI meanwhile the job is in execution:

But as you can see, the problem is the same: Some stages are failing because an executor its unreachable because a task has failed more than X times.

The big delta that I mentioned above has more or less 4-5 billion rows and the big dump that I want to merge has, more or less, 100 million rows. The table is not partitioned (yet) so the process is very work-intensive. What is failing is the merge part, not the process to copy the data from SQL Server to ADLS, so the merge is being done once the data to be merge is already in Parquet format.

Any idea of what is happening or what can I do in order to finish this merge?

Thanks in advance.

...

ANSWER

Answered 2021-Apr-12 at 07:56

Finally, I reviewed the cluster and I changed the spark.sql.shuffle.partitions property to 1600 in the code of the job that I wanted to execute with this configuration (Instead than changing this directly on the cluster). In my cluster I have 400 cores so I chose a multiple (1600) of that number.

After that, the execution finished in two hours. I came to this conclusion because, in my logs and Spark UI I observed a lot of disk spilling so I thought that the partitions wasn't fitting in the worker nodes.

Source https://stackoverflow.com/questions/67019196

QUESTION

Spark executors fails on Kubernetes - resourceProfileId is missing

Asked 2021-Mar-08 at 15:43

I'm trying to upgrade from Spark 3.0.1 to 3.1.1.

I am running PySpark 3.1.1 in client mode on Jupyter notebook.

The following ran on 3.0.1 but fails after upgrading spark:

...

ANSWER

Answered 2021-Mar-07 at 21:58

See the discussion at the spark pr page - https://issues.apache.org/jira/browse/SPARK-33288?focusedCommentId=17296060&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17296060

Source https://stackoverflow.com/questions/66482218

QUESTION

How to avoid continuous "Resetting offset" and "Seeking to LATEST offset"?

Asked 2021-Feb-17 at 07:07

I'm trying to follow this guide: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html But I don't realize why I'm it's most of the time not writing data to the console, and why its spamming execution thread logging?
Do I need to configure something? This is my code:

...

ANSWER

Answered 2021-Feb-16 at 19:48

you are getting logger information as you have used default logging level as INFO. set logging level to WARN by spark.sparkContext.setLogLevel("WARN").

Source https://stackoverflow.com/questions/66228274

QUESTION

Spark executors fails to run on kubernetes cluster

Asked 2021-Feb-02 at 13:31

I am trying to run a simple spark job on a kubernetes cluster. I deployed a pod that starts a pyspark shell and in that shell I am changing the spark configuration as specified below:

...

ANSWER

Answered 2021-Feb-01 at 09:42

I don't have much experience with PySpark but I once setup Java Spark to run on a Kubernetes cluster in client mode, like you are trying now.. and I believe the configuration should mostly be the same.

First of all, you should check if the headless service is working as expected or not. First with a:

Source https://stackoverflow.com/questions/65980391

QUESTION

Optimizing Spark resources to avoid memory and space usage

Asked 2021-Jan-26 at 23:49

I have a dataset that is around 190GB that was partitioned into 1000 partitions.

my EMR cluster allows a maximum of 10 r5a.2xlarge TASK nodes and 2 CORE nodes. Each node having 64GB mem and 128GB EBS storage.

In my spark job execution, I have set it to use executor-cores 5, driver cores 5,executor-memory 40g, driver-memory 50g, spark.yarn.executor.memoryOverhead=10g, spark.sql.shuffle.partitions=500, spark.dynamicAllocation.enabled=true

But my job keeps failing with errors like

...

ANSWER

Answered 2021-Jan-24 at 05:48

You can try either of the below steps:

Memory overhead should be 10% of the Executor memory or 328 MB. Don't increase it to any value.
Remove Driver Cores.
If you have 10 nodes, then specify the number of executors. You have to calculate it in such a way that you leave some space for YARN and background processes. Also, you can try increasing 1 or 2 more cores.
Run it on a cluster mode and whatever number you assign to executors, add +1 to it since 1 executor will be treated as driver executor in the cluster mode.
Also, the last thing is nothing but your code written to submit / process that 190GB of file. Go through your code and find ways of optimizing it. Look for collect methods, or unnecessary use of joins, coalesce / repartition. Find some alternatives to it if it isn't needed.
Use persist(Memory and Disk only) option for the data frames that you are using frequently in the code.
Also the last thing which I tried is to execute the steps manually on the spark-shell on EMR and you will come to know which part of the code is taking much time to run.

You can also refer to this official blog for some of the tips.

Source https://stackoverflow.com/questions/65866586

QUESTION

Airflow spark submit operator

Asked 2021-Jan-04 at 13:49

I give the command for spark2 submit as:

...

ANSWER

Answered 2021-Jan-04 at 13:49

The options that the SparkSubmitOperator in Airflow requires can be sent in a the dictionary. Keep in mind that the keys in the dictionary should be the same as teh parameter names to the function.

Create the following two dictionaries:

Source https://stackoverflow.com/questions/65560603

QUESTION

PySpark "illegal reflective access operation" when executed in terminal

Asked 2020-Dec-27 at 15:24

I've installed Spark and components locally and I'm able to execute PySpark code in Jupyter, iPython and via spark-submit - however receiving the following WARNING's:

...

ANSWER

Answered 2020-Dec-27 at 08:14

Install Java 8 instead of Java 11, which is known to give this sort of warnings with Spark.

Source https://stackoverflow.com/questions/65463877

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install Spark.NET

Using NuGet package manager, run the following command in the Package Manager Console. or alternativelty, right-click to your project in Visual Studio → Manage NuGet Packages → Search "Spark.NET" → click Install. Reference will be added to your project automatically.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: