spark-kafka | Low level integration of Spark and Kafka
kandi X-RAY | spark-kafka Summary
kandi X-RAY | spark-kafka Summary
Low level integration of Spark and Kafka
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-kafka
spark-kafka Key Features
spark-kafka Examples and Code Snippets
Community Discussions
Trending Discussions on spark-kafka
QUESTION
I am attempting to read from a kafka topic, aggregate some data over a tumbling window and write that to a sink (I've been trying both with Kafka and console).
The problems I'm seeing are
- a long delay between sending data and receiving aggregate records for a window on the sink (minutes after the expected triggers should fire)
- duplicate records from previous window aggregations appearing in subsequent windows
Why is the delay so long, and what can I do to reduce it?
Why are duplicates records from previous windows showing up and how can I remove them?
The delays seem to be especially bad as the window gets shorter - it was 3+ minutes when I had the window duration set to 10 seconds, around 2 minutes when the window duration was set to 60 seconds.
With the shortest window times I'm also seeing the records getting "bunched up" so that when records are received by the sink I receive those for several windows at a time.
On the duplicate aggregate records I do have the output mode set to complete but my understanding is records should only be repeated withing the current window assuming the trigger fires multiple times within it, which mine shouldn't be.
I have a processing trigger set up matching the window time and a watermark threshold of 10% (1 or 6 seconds) and I know the stream itself works fine if I remove the tumbling window.
I get why spark might not be able to hit a certain frequency of triggers but I'd think 10 and certainly 60 seconds would be more than enough time to process the very limited amount of data I am testing with.
An example of sending data with a 60 second tumbling window and processing time trigger
- send 6 payloads
- wait a minute
- send 1 payload
- wait a while
- send 3 payloads
(CreateTime is coming from kafka-console-consumer with --property print.timestamp=true). These arrive a couple of minutes after I would expect the trigger to fire based on the CreateTime timestamp and window.
...ANSWER
Answered 2022-Feb-09 at 14:49For long delays, it's likely caused by not enough resources to process messages according to the warning message. You could check spark UI to understand why. It could be data skew between partitions or more memory or cores needed.
For duplicate records, you may want to try update
or append
mode. Complete
mode means the whole Result Table will be outputted to the sink after every trigger. That's the reason why you have deplicates. You can refer https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
QUESTION
I am fairly new to Spark and am still learning. One of the more difficult concepts I have come across is checkpointing and how Spark uses it to recover from failures. I am doing batch reads from Kafka using Structured Streaming and writing them to S3 as Parquet file as:
...ANSWER
Answered 2021-Oct-15 at 09:04I am not sure why batch
Spark Structured Streaming with Kafka still exists now. If you wish to use it, then you must code your own Offset management
. See the guide, but it is badly explained.
I would say Trigger.Once
is a better use case for you; Offset management
is provided by Spark as it is thus not batch mode.
QUESTION
I have a long-running PySpark Structured Streaming job, which reads a Kafka topic, does some processing and writes the result back to another Kafka topic. Our Kafka server runs on another cluster.
It's running fine but every few hours it freezes, even though in the web UI the YARN application still has status "running". After inspecting the logs, it seems due to some transient connectivity problem with the Kafka source. Indeed, all tasks of the problematic micro-batch have completed correctly, except one which shows:
...ANSWER
Answered 2021-Aug-24 at 08:26I haven't found a solution to do it with YARN, but a workaround using a monitoring loop in the Pyspark driver. The loop will check status regularly and fail the streaming app if status hasn't been updated for 10 minutes
QUESTION
I am trying to execute a simple spark structured streaming application which for now does not do much expect for pulling from a local Kafka cluster and writing to local file system. The code looks as follows:
...ANSWER
Answered 2021-Feb-25 at 08:15As it turns out, this behaviour of seeking and resetting is perfectly desirable in case that one does not read to the topic from beginning, but from latest offset. The pipeline then only reads new data that gets sent to the Kafka topic whilst it is running and since no new data was sent, the infinite loop of seeking (new data) and resetting (to latest offset).
Bottom line, just read from beginning or send new data and the problem is solved.
QUESTION
I plan to extract the data from Kafka(self-signed certificate).
My consumer is the following
...ANSWER
Answered 2021-Feb-06 at 01:14I append another option to tell the Kafka broker communication by SSL.
QUESTION
I have a spark structured streaming application consuming from kafka, for this application I would like to monitor the consumer lag. I 'm using below command to check consumer lag. However I don't get the CURRENT-OFFSET and hence LAG is blank too. Is this expected ? It works for other python based consumers.
Command
...ANSWER
Answered 2021-Jan-22 at 19:18"However I don't get the CURRENT-OFFSET and hence LAG is blank too. Is this expected?"
Yes, this is the expected behavior as Spark Structured Streaming applications are not committing any offsets back to Kafka. Therefore, the current offset and the lag of this consumer group will not be stored in Kafka and you will see exactly the result of the consumer-groups tool what you have shown.
I have written a more comprehensive answer on Consumer Group and how Spark Structured Streaming applications manage Kafka offsets here.
QUESTION
I am able to run my program in standalone mode. But when I am trying to run in Dataproc in cluster mode, getting following error. PLs help. My build.sbt
...ANSWER
Answered 2020-Jul-17 at 20:21Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.internal.connector.SimpleTableProvider
org.apache.spark.sql.internal.connector.SimpleTableProvider was added in v3.0.0-rc1 so you're using spark-submit
from Spark 3.0.0 (I guess).
I only now noticed that you use --master yarn
and the exception is thrown at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:686)
.
I know nothing about Dataproc, but you should review the configuration of YARN / Dataproc and make sure they don't use Spark 3 perhaps.
QUESTION
im using structured streaming to read from the kafka topic, using spark 2.4
and scala 2.12
im using a checkpoint to make my query fault-tolerant.
however everytime i start the query it jumps to the current offset without reading the exisitng data before it connected onto the topic.
is there a config for the kafka stream im missing?
READ:
...ANSWER
Answered 2020-Jul-11 at 15:22So annonying... i mis-spelled the option startingOffset
the correct way to spell it is:
QUESTION
There are some examples of how to create MQTT sources [1] [2] for Spark Streaming. However, I want to create an MQTT sink where I can publish the results instead of using the print()
method. I tried to create one MqttSink but I am getting object not serializable
error. Then I am basing the code on this blog but I cannot find the method send
that I created on the MqttSink
object.
ANSWER
Answered 2020-Jun-18 at 16:39this is a working example based on the blog entry Spark and Kafka integration patterns.
QUESTION
I have data like below in one of the topics which I created named "sampleTopic"
ANSWER
Answered 2020-May-30 at 17:03spark-sql-kafka jar is missing, which is having the implementation of 'kafka' datasource.
you can add the jar using config option or build fat jar which includes spark-sql-kafka jar. Please use relevant version of jar
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install spark-kafka
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page