spark-structured-streaming | Spark structured streaming with Kafka data source | Pub Sub library
kandi X-RAY | spark-structured-streaming Summary
kandi X-RAY | spark-structured-streaming Summary
Inside setup directory, run docker-compose up -d to launch instances of zookeeper, kafka and cassandra. Wait for a few seconds and then run docker ps to make sure all the three services are running. Then run pip install -r requirements.txt. main.py generates some random data and publishes it to a topic in kafka. Run the spark-app using sbt clean compile run in a console. This app will listen on topic (check Main.scala) and writes it to Cassandra. Again run main.py to write some test data on a kafka topic. Finally check if the data has been published in cassandra.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-structured-streaming
spark-structured-streaming Key Features
spark-structured-streaming Examples and Code Snippets
Community Discussions
Trending Discussions on spark-structured-streaming
QUESTION
Can someone let me know if its possible to to Spark structured streaming from a JDBC source? E.g SQL DB or any RDBMS.
I have looked at a few similar questions on SO, e.g
jdbc source and spark structured streaming
However, I would like to know if its officially supported on Apache Spark?
If there is any sample code that would be helpful.
Thanks
...ANSWER
Answered 2022-Feb-26 at 07:43No, there is no such built-in support in Spark Structured Streaming. The main reason is that most of databases doesn't provided an unified interface for obtaining the changes.
It's possible to get changes from some databases using archive logs, write-ahead logs, etc. But it's database-specific. For many databases the popular choice is Debezium that can read such logs and push list of changes into a Kafka, or something similar, from which it could be consumed by Spark.
QUESTION
I am using Spark RateStreamSource to generate massive data per second for a performance test.
To test I actually get the amount of concurrency I want, I have set the rowPerSecond option to a high number 10000,
...ANSWER
Answered 2021-Dec-13 at 13:54You have a typo in the option - it should be rowsPerSecond
.
QUESTION
I am new to Spark and have a setup where I want to read in two streams of data, each from Kafka topics, using Spark structured streaming 2.4. I then want to join these two streams, likely with a very large window of time.
...ANSWER
Answered 2021-Mar-21 at 11:01You need to use checkpointing on the writeStream
- it will track offsets for all sources that are used for your operations, and store them in the checkpoint directory, so when you restart application, it will read the offsets for all sources and continue from them. The offset that you specify in readStream
is just for case when you don't have checkpoint directory yet - after the first query it will be filled with real offsets, and the value specified in options won't be used (until you remove checkpoint directory).
Read the Structured Streaming documentation to understand how it works.
P.S. The checkpoint
that you're using in the last example, is another thing - not for Structured Streaming.
QUESTION
Based on the introduction in Spark 3.0, https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. It should be possible to set "kafka.group.id" to track the offset. For our use case, I want to avoid the potential data loss if the streaming spark job failed and restart. Based on my previous questions, I have a feeling that kafka.group.id in Spark 3.0 is something that will help.
How to specify the group id of kafka consumer for spark structured streaming?
How to ensure no data loss for kafka data ingestion through Spark Structured Streaming?
However, I tried the settings in spark 3.0 as below.
...ANSWER
Answered 2020-Sep-25 at 11:18According to the Spark Structured Integration Guide, Spark itself is keeping track of the offsets and there are no offsets committed back to Kafka. That means if your Spark Streaming job fails and you restart it all necessary information on the offsets is stored in Spark's checkpointing files.
Even if you set the ConsumerGroup name with kafka.group.id
, your application will still not commit the messages back to Kafka. The information on the next offset to read is only available in the checkpointing files of your Spark application.
If you stop and restart your application without a re-deployment and ensure that you do not delete old checkpoint files, your application will continue reading from where it left off.
In the Spark Structured Streaming documentation on Recovering from Failures with Checkpointing it is written that:
"In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) [...]"
This can be achieved by setting the following option in your writeStream
query (it is not sufficient to set the checkpoint directory in your SparkContext configurations):
QUESTION
I am trying to run a Spark application in Scala to connect to ActiveMQ. I am using Bahir for this purpose format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
. When I am using Bahir2.2
in my built.sbt
the application is running fine but on changing it to Bahir3.0
or Bahir4.0
the application is not starting and it is giving an error:
ANSWER
Answered 2020-Dec-09 at 18:38Okay, So it seems some kind of compatibility issue between spark2.4
and bahir2.4
. I fixed it by rolling back both of them to ver 2.3
.
Here is my build.sbt
QUESTION
I am trying very hard to understand the timeout setup when using the mapGroupsWithState
for spark structured streaming.
below link has very detailed specification, but I am not sure i understood it properly, especially the GroupState.setTimeoutTimeStamp()
option. Meaning when setting up the state expiry to be sort of related to the event time.
https://spark.apache.org/docs/3.0.0-preview/api/scala/org/apache/spark/sql/streaming/GroupState.html
I copied them out here:
...ANSWER
Answered 2020-Oct-11 at 18:46What is this
timestamp
in the sentenceand the timeout would occur when the watermark advances beyond the set timestamp
?
This is the timestamp you set by GroupState.setTimeoutTimestamp()
.
is it an absolute time or is it a relative time duration to the current event time in the state?
This is a relative time (not duration) based on the current batch window.
say I have some data state (column
timestamp=2020-08-02 22:02:00
), when will it expire by setting up what value in what settings?
Let's assume your sink query has a defined processing trigger (set by trigger()
) of 5 minutes. Also, let us assume that you have used a watermark before applying the groupByKey
and the mapGroupsWithState
. I understand you want to use timeouts based on event times (as opposed to processing times, so your query will be like:
QUESTION
I'm running a 1-node cluster of Kafka, Spark and Cassandra. All locally on the same machine.
From a simple Python script I'm streaming some dummy data every 5 seconds into a Kafka topic. Then using Spark structured streaming, I'm reading this data stream (one row at a time) into a PySpark DataFrame with startingOffset
= latest
. Finally, I'm trying to append this row to an already existing Cassandra table.
I've been following (How to write streaming Dataset to Cassandra?) and (Cassandra Sink for PySpark Structured Streaming from Kafka topic).
One row of data is being successfully written into the Cassandra table but my problem is it's being overwritten every time rather than appended to the end of the table. What might I be doing wrong?
Here's my code:
CQL DDL for creating kafkaspark
keyspace followed by randintstream
table in Cassandra:
ANSWER
Answered 2020-Oct-21 at 14:08If the row is always rewritten in Cassandra, then you may have incorrect primary key in the table - you need to make sure that every row will have an unique primary key. If you're creating Cassandra table from Spark, then by default it just takes first column as partition key, and it alone may not be unique.
Update after schema was provided:
Yes, that's the case that I was referring - you have a primary key of (partition, topic)
, but every row from specific partition that you read from that topic will have the same value for primary key, so it will overwrite previous versions. You need to make your primary key unique - for example, add the offset
or timestamp
columns to the primary key (although timestamp
may not be unique if you have data produced inside the same millisecond).
P.S. Also, in connector 3.0.0 you don't need foreachBatch
:
QUESTION
(end goal) before trying out whether i could eventually read avro data, usng spark stream, out of the Confluent Platform like some described here: Integrating Spark Structured Streaming with the Confluent Schema Registry
I'd to verify whether I could use below command to read them:
...ANSWER
Answered 2020-Sep-10 at 20:11If you are getting Unknown Magic Byte with the consumer, then the producer didn't use the Confluent AvroSerializer, and might have pushed Avro data that doesn't use the Schema Registry.
Without seeing the Producer code or consuming and inspecting the data in binary format, it is difficult to know which is the case.
The message was produced using confluent connect file-pulse
Did you use value.converter
with the AvroConverter class?
QUESTION
What are possible paths that can make me process data by pyspark 3.0.0 with success from the pure pip
installation, well, at least loading data without downgrading the version of Spark?
When I attempted to load datasets of parquet
and csv
, I would get the exception message as the content below Exception Message displays. The initialization of Spark session is fine, yet when I wanted to load datasets, it just went wrong.
- Java: openjdk 11
- Python: 3.8.5
- Mode: local mode
- Operating System: Ubuntu 16.04.6 LTS
- Notes:
- I executed
python3.8 -m pip install pyspark
to install Spark. - When I looked up the jar of
spark-sql_2.12-3.0.0.jar
(which is under the Python site-package path, i.e.,~/.local/lib/python3.8/site-packages/pyspark/jars
in my case), there is nov2
underspark.sql.sources
, the most similar one I found is an interface calledDatSourceRegister
under the same package. - The most similar question I found on Stackoverflow is PySpark structured Streaming + Kafka Error (Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.StreamWriteSupport ) where downgrading the Spark version is recommended throughout the information on that page.
- I executed
ANSWER
Answered 2020-Jul-30 at 10:00currently, I have got a way out for manipulating data via Python function APIs for Spark.
workaround1
QUESTION
I am attempting to read avro data from Kafka using Spark Streaming but I receive the following error message:
...ANSWER
Answered 2020-Jul-31 at 15:58Figured it out - the problem was not as I had thought with the Spark-Kafka integration directly, but with the checkpoint information inside the hdfs filesystem instead. Deleting and recreating the checkpoint folder in hdfs solved it for me.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install spark-structured-streaming
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page