spark-structured-streaming | Spark structured streaming with Kafka data source | Pub Sub library

by ansrivas Scala Version: Current License: MIT

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | spark-structured-streaming Summary

spark-structured-streaming is a Scala library typically used in Messaging, Pub Sub, Docker, Kafka applications. spark-structured-streaming has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

Inside setup directory, run docker-compose up -d to launch instances of zookeeper, kafka and cassandra. Wait for a few seconds and then run docker ps to make sure all the three services are running. Then run pip install -r requirements.txt. main.py generates some random data and publishes it to a topic in kafka. Run the spark-app using sbt clean compile run in a console. This app will listen on topic (check Main.scala) and writes it to Cassandra. Again run main.py to write some test data on a kafka topic. Finally check if the data has been published in cassandra.

Support

Quality

Security

License

Reuse

Support

spark-structured-streaming has a low active ecosystem.

It has 66 star(s) with 34 fork(s). There are 6 watchers for this library.

It had no major release in the last 6 months.

There are 0 open issues and 5 have been closed. On average issues are closed in 8 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of spark-structured-streaming is current.

Quality

spark-structured-streaming has 0 bugs and 2 code smells.

Security

spark-structured-streaming has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

spark-structured-streaming code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

spark-structured-streaming is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

spark-structured-streaming releases are not available. You will need to build from source code and install.

It has 135 lines of code, 10 functions and 4 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-structured-streaming

Get all kandi verified functions for this library.

spark-structured-streaming Key Features

No Key Features are available at this moment for spark-structured-streaming.

spark-structured-streaming Examples and Code Snippets

No Code Snippets are available at this moment for spark-structured-streaming.

Community Discussions

Trending Discussions on spark-structured-streaming

Spark structured streaming from JDBC source

spark streaming rate source generate rows too slow

How to properly save Kafka offset checkpoints for application restart after a join in Spark SQL

How to use kafka.group.id and checkpoints in spark 3.0 structured streaming to continue to read from Kafka where it left off after restart?

Unable to start Spark application with Bahir

spark streaming understanding timeout setup in mapGroupsWithState

Writing Spark streaming PySpark dataframe to Cassandra overwrites table instead of appending

unable to read avro message via kafka-avro-console-consumer (end goal read it via spark streaming)

java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2 for Spark 3.0.0

How to resolve current committed offsets differing from current available offsets?

QUESTION

Spark structured streaming from JDBC source

Asked 2022-Feb-26 at 12:10

Can someone let me know if its possible to to Spark structured streaming from a JDBC source? E.g SQL DB or any RDBMS.

I have looked at a few similar questions on SO, e.g

Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading

jdbc source and spark structured streaming

However, I would like to know if its officially supported on Apache Spark?

If there is any sample code that would be helpful.

Thanks

...

ANSWER

Answered 2022-Feb-26 at 07:43

No, there is no such built-in support in Spark Structured Streaming. The main reason is that most of databases doesn't provided an unified interface for obtaining the changes.

It's possible to get changes from some databases using archive logs, write-ahead logs, etc. But it's database-specific. For many databases the popular choice is Debezium that can read such logs and push list of changes into a Kafka, or something similar, from which it could be consumed by Spark.

Source https://stackoverflow.com/questions/71273386

QUESTION

spark streaming rate source generate rows too slow

Asked 2021-Dec-13 at 13:54

I am using Spark RateStreamSource to generate massive data per second for a performance test.

To test I actually get the amount of concurrency I want, I have set the rowPerSecond option to a high number 10000,

...

ANSWER

Answered 2021-Dec-13 at 13:54

You have a typo in the option - it should be rowsPerSecond.

Source https://stackoverflow.com/questions/70252887

QUESTION

How to properly save Kafka offset checkpoints for application restart after a join in Spark SQL

Asked 2021-Mar-21 at 11:01

I am new to Spark and have a setup where I want to read in two streams of data, each from Kafka topics, using Spark structured streaming 2.4. I then want to join these two streams, likely with a very large window of time.

...

ANSWER

Answered 2021-Mar-21 at 11:01

You need to use checkpointing on the writeStream - it will track offsets for all sources that are used for your operations, and store them in the checkpoint directory, so when you restart application, it will read the offsets for all sources and continue from them. The offset that you specify in readStream is just for case when you don't have checkpoint directory yet - after the first query it will be filled with real offsets, and the value specified in options won't be used (until you remove checkpoint directory).

Read the Structured Streaming documentation to understand how it works.

P.S. The checkpoint that you're using in the last example, is another thing - not for Structured Streaming.

Source https://stackoverflow.com/questions/66724579

QUESTION

How to use kafka.group.id and checkpoints in spark 3.0 structured streaming to continue to read from Kafka where it left off after restart?

Asked 2020-Dec-22 at 07:59

Based on the introduction in Spark 3.0, https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. It should be possible to set "kafka.group.id" to track the offset. For our use case, I want to avoid the potential data loss if the streaming spark job failed and restart. Based on my previous questions, I have a feeling that kafka.group.id in Spark 3.0 is something that will help.

How to specify the group id of kafka consumer for spark structured streaming?

How to ensure no data loss for kafka data ingestion through Spark Structured Streaming?

However, I tried the settings in spark 3.0 as below.

...

ANSWER

Answered 2020-Sep-25 at 11:18

According to the Spark Structured Integration Guide, Spark itself is keeping track of the offsets and there are no offsets committed back to Kafka. That means if your Spark Streaming job fails and you restart it all necessary information on the offsets is stored in Spark's checkpointing files.

Even if you set the ConsumerGroup name with kafka.group.id, your application will still not commit the messages back to Kafka. The information on the next offset to read is only available in the checkpointing files of your Spark application.

If you stop and restart your application without a re-deployment and ensure that you do not delete old checkpoint files, your application will continue reading from where it left off.

In the Spark Structured Streaming documentation on Recovering from Failures with Checkpointing it is written that:

"In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) [...]"

This can be achieved by setting the following option in your writeStream query (it is not sufficient to set the checkpoint directory in your SparkContext configurations):

Source https://stackoverflow.com/questions/64003405

QUESTION

Unable to start Spark application with Bahir

Asked 2020-Dec-09 at 18:38

I am trying to run a Spark application in Scala to connect to ActiveMQ. I am using Bahir for this purpose format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider"). When I am using Bahir2.2 in my built.sbt the application is running fine but on changing it to Bahir3.0 or Bahir4.0 the application is not starting and it is giving an error:

...

ANSWER

Answered 2020-Dec-09 at 18:38

Okay, So it seems some kind of compatibility issue between spark2.4 and bahir2.4. I fixed it by rolling back both of them to ver 2.3.

Here is my build.sbt

Source https://stackoverflow.com/questions/65219283

QUESTION

spark streaming understanding timeout setup in mapGroupsWithState

Asked 2020-Nov-30 at 10:06

I am trying very hard to understand the timeout setup when using the mapGroupsWithState for spark structured streaming.

below link has very detailed specification, but I am not sure i understood it properly, especially the GroupState.setTimeoutTimeStamp() option. Meaning when setting up the state expiry to be sort of related to the event time. https://spark.apache.org/docs/3.0.0-preview/api/scala/org/apache/spark/sql/streaming/GroupState.html

I copied them out here:

...

ANSWER

Answered 2020-Oct-11 at 18:46

What is this timestamp in the sentence and the timeout would occur when the watermark advances beyond the set timestamp?

This is the timestamp you set by GroupState.setTimeoutTimestamp().

is it an absolute time or is it a relative time duration to the current event time in the state?

This is a relative time (not duration) based on the current batch window.

say I have some data state (column timestamp=2020-08-02 22:02:00), when will it expire by setting up what value in what settings?

Let's assume your sink query has a defined processing trigger (set by trigger()) of 5 minutes. Also, let us assume that you have used a watermark before applying the groupByKey and the mapGroupsWithState. I understand you want to use timeouts based on event times (as opposed to processing times, so your query will be like:

Source https://stackoverflow.com/questions/63917648

QUESTION

Writing Spark streaming PySpark dataframe to Cassandra overwrites table instead of appending

Asked 2020-Oct-21 at 19:42

I'm running a 1-node cluster of Kafka, Spark and Cassandra. All locally on the same machine.

From a simple Python script I'm streaming some dummy data every 5 seconds into a Kafka topic. Then using Spark structured streaming, I'm reading this data stream (one row at a time) into a PySpark DataFrame with startingOffset = latest. Finally, I'm trying to append this row to an already existing Cassandra table.

I've been following (How to write streaming Dataset to Cassandra?) and (Cassandra Sink for PySpark Structured Streaming from Kafka topic).

One row of data is being successfully written into the Cassandra table but my problem is it's being overwritten every time rather than appended to the end of the table. What might I be doing wrong?

Here's my code:

CQL DDL for creating kafkaspark keyspace followed by randintstream table in Cassandra:

...

ANSWER

Answered 2020-Oct-21 at 14:08

If the row is always rewritten in Cassandra, then you may have incorrect primary key in the table - you need to make sure that every row will have an unique primary key. If you're creating Cassandra table from Spark, then by default it just takes first column as partition key, and it alone may not be unique.

Update after schema was provided:

Yes, that's the case that I was referring - you have a primary key of (partition, topic), but every row from specific partition that you read from that topic will have the same value for primary key, so it will overwrite previous versions. You need to make your primary key unique - for example, add the offset or timestamp columns to the primary key (although timestamp may not be unique if you have data produced inside the same millisecond).

P.S. Also, in connector 3.0.0 you don't need foreachBatch:

Source https://stackoverflow.com/questions/64463238

QUESTION

unable to read avro message via kafka-avro-console-consumer (end goal read it via spark streaming)

Asked 2020-Sep-11 at 12:36

(end goal) before trying out whether i could eventually read avro data, usng spark stream, out of the Confluent Platform like some described here: Integrating Spark Structured Streaming with the Confluent Schema Registry

I'd to verify whether I could use below command to read them:

...

ANSWER

Answered 2020-Sep-10 at 20:11

If you are getting Unknown Magic Byte with the consumer, then the producer didn't use the Confluent AvroSerializer, and might have pushed Avro data that doesn't use the Schema Registry.

Without seeing the Producer code or consuming and inspecting the data in binary format, it is difficult to know which is the case.

The message was produced using confluent connect file-pulse

Did you use value.converter with the AvroConverter class?

Source https://stackoverflow.com/questions/63828704

QUESTION

java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2 for Spark 3.0.0

Asked 2020-Sep-01 at 20:59

Brief

What are possible paths that can make me process data by pyspark 3.0.0 with success from the pure pip installation, well, at least loading data without downgrading the version of Spark?

When I attempted to load datasets of parquet and csv, I would get the exception message as the content below Exception Message displays. The initialization of Spark session is fine, yet when I wanted to load datasets, it just went wrong.

Some Information

Java: openjdk 11
Python: 3.8.5
Mode: local mode
Operating System: Ubuntu 16.04.6 LTS
Notes:
1. I executed python3.8 -m pip install pyspark to install Spark.
2. When I looked up the jar of spark-sql_2.12-3.0.0.jar (which is under the Python site-package path, i.e., ~/.local/lib/python3.8/site-packages/pyspark/jars in my case), there is no v2 under spark.sql.sources, the most similar one I found is an interface called DatSourceRegister under the same package.
3. The most similar question I found on Stackoverflow is PySpark structured Streaming + Kafka Error (Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.StreamWriteSupport ) where downgrading the Spark version is recommended throughout the information on that page.

Exception Message ...

ANSWER

Answered 2020-Jul-30 at 10:00

currently, I have got a way out for manipulating data via Python function APIs for Spark.

workaround

Source https://stackoverflow.com/questions/63149228

QUESTION

How to resolve current committed offsets differing from current available offsets?

Asked 2020-Jul-31 at 15:58

I am attempting to read avro data from Kafka using Spark Streaming but I receive the following error message:

...

ANSWER

Answered 2020-Jul-31 at 15:58

Figured it out - the problem was not as I had thought with the Spark-Kafka integration directly, but with the checkpoint information inside the hdfs filesystem instead. Deleting and recreating the checkpoint folder in hdfs solved it for me.

Source https://stackoverflow.com/questions/63191950

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install spark-structured-streaming

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: