spark-structured-streaming | Spark structured streaming with Kafka data source | Pub Sub library

 by   ansrivas Scala Version: Current License: MIT

kandi X-RAY | spark-structured-streaming Summary

kandi X-RAY | spark-structured-streaming Summary

spark-structured-streaming is a Scala library typically used in Messaging, Pub Sub, Docker, Kafka applications. spark-structured-streaming has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

Inside setup directory, run docker-compose up -d to launch instances of zookeeper, kafka and cassandra. Wait for a few seconds and then run docker ps to make sure all the three services are running. Then run pip install -r requirements.txt. main.py generates some random data and publishes it to a topic in kafka. Run the spark-app using sbt clean compile run in a console. This app will listen on topic (check Main.scala) and writes it to Cassandra. Again run main.py to write some test data on a kafka topic. Finally check if the data has been published in cassandra.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              spark-structured-streaming has a low active ecosystem.
              It has 66 star(s) with 34 fork(s). There are 6 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 0 open issues and 5 have been closed. On average issues are closed in 8 days. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of spark-structured-streaming is current.

            kandi-Quality Quality

              spark-structured-streaming has 0 bugs and 2 code smells.

            kandi-Security Security

              spark-structured-streaming has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              spark-structured-streaming code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              spark-structured-streaming is licensed under the MIT License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              spark-structured-streaming releases are not available. You will need to build from source code and install.
              It has 135 lines of code, 10 functions and 4 files.
              It has medium code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-structured-streaming
            Get all kandi verified functions for this library.

            spark-structured-streaming Key Features

            No Key Features are available at this moment for spark-structured-streaming.

            spark-structured-streaming Examples and Code Snippets

            No Code Snippets are available at this moment for spark-structured-streaming.

            Community Discussions

            QUESTION

            Spark structured streaming from JDBC source
            Asked 2022-Feb-26 at 12:10

            Can someone let me know if its possible to to Spark structured streaming from a JDBC source? E.g SQL DB or any RDBMS.

            I have looked at a few similar questions on SO, e.g

            Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading

            jdbc source and spark structured streaming

            However, I would like to know if its officially supported on Apache Spark?

            If there is any sample code that would be helpful.

            Thanks

            ...

            ANSWER

            Answered 2022-Feb-26 at 07:43

            No, there is no such built-in support in Spark Structured Streaming. The main reason is that most of databases doesn't provided an unified interface for obtaining the changes.

            It's possible to get changes from some databases using archive logs, write-ahead logs, etc. But it's database-specific. For many databases the popular choice is Debezium that can read such logs and push list of changes into a Kafka, or something similar, from which it could be consumed by Spark.

            Source https://stackoverflow.com/questions/71273386

            QUESTION

            spark streaming rate source generate rows too slow
            Asked 2021-Dec-13 at 13:54

            I am using Spark RateStreamSource to generate massive data per second for a performance test.

            To test I actually get the amount of concurrency I want, I have set the rowPerSecond option to a high number 10000,

            ...

            ANSWER

            Answered 2021-Dec-13 at 13:54

            You have a typo in the option - it should be rowsPerSecond.

            Source https://stackoverflow.com/questions/70252887

            QUESTION

            How to properly save Kafka offset checkpoints for application restart after a join in Spark SQL
            Asked 2021-Mar-21 at 11:01

            I am new to Spark and have a setup where I want to read in two streams of data, each from Kafka topics, using Spark structured streaming 2.4. I then want to join these two streams, likely with a very large window of time.

            ...

            ANSWER

            Answered 2021-Mar-21 at 11:01

            You need to use checkpointing on the writeStream - it will track offsets for all sources that are used for your operations, and store them in the checkpoint directory, so when you restart application, it will read the offsets for all sources and continue from them. The offset that you specify in readStream is just for case when you don't have checkpoint directory yet - after the first query it will be filled with real offsets, and the value specified in options won't be used (until you remove checkpoint directory).

            Read the Structured Streaming documentation to understand how it works.

            P.S. The checkpoint that you're using in the last example, is another thing - not for Structured Streaming.

            Source https://stackoverflow.com/questions/66724579

            QUESTION

            How to use kafka.group.id and checkpoints in spark 3.0 structured streaming to continue to read from Kafka where it left off after restart?
            Asked 2020-Dec-22 at 07:59

            Based on the introduction in Spark 3.0, https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. It should be possible to set "kafka.group.id" to track the offset. For our use case, I want to avoid the potential data loss if the streaming spark job failed and restart. Based on my previous questions, I have a feeling that kafka.group.id in Spark 3.0 is something that will help.

            How to specify the group id of kafka consumer for spark structured streaming?

            How to ensure no data loss for kafka data ingestion through Spark Structured Streaming?

            However, I tried the settings in spark 3.0 as below.

            ...

            ANSWER

            Answered 2020-Sep-25 at 11:18

            According to the Spark Structured Integration Guide, Spark itself is keeping track of the offsets and there are no offsets committed back to Kafka. That means if your Spark Streaming job fails and you restart it all necessary information on the offsets is stored in Spark's checkpointing files.

            Even if you set the ConsumerGroup name with kafka.group.id, your application will still not commit the messages back to Kafka. The information on the next offset to read is only available in the checkpointing files of your Spark application.

            If you stop and restart your application without a re-deployment and ensure that you do not delete old checkpoint files, your application will continue reading from where it left off.

            In the Spark Structured Streaming documentation on Recovering from Failures with Checkpointing it is written that:

            "In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) [...]"

            This can be achieved by setting the following option in your writeStream query (it is not sufficient to set the checkpoint directory in your SparkContext configurations):

            Source https://stackoverflow.com/questions/64003405

            QUESTION

            Unable to start Spark application with Bahir
            Asked 2020-Dec-09 at 18:38

            I am trying to run a Spark application in Scala to connect to ActiveMQ. I am using Bahir for this purpose format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider"). When I am using Bahir2.2 in my built.sbt the application is running fine but on changing it to Bahir3.0 or Bahir4.0 the application is not starting and it is giving an error:

            ...

            ANSWER

            Answered 2020-Dec-09 at 18:38

            Okay, So it seems some kind of compatibility issue between spark2.4 and bahir2.4. I fixed it by rolling back both of them to ver 2.3.

            Here is my build.sbt

            Source https://stackoverflow.com/questions/65219283

            QUESTION

            spark streaming understanding timeout setup in mapGroupsWithState
            Asked 2020-Nov-30 at 10:06

            I am trying very hard to understand the timeout setup when using the mapGroupsWithState for spark structured streaming.

            below link has very detailed specification, but I am not sure i understood it properly, especially the GroupState.setTimeoutTimeStamp() option. Meaning when setting up the state expiry to be sort of related to the event time. https://spark.apache.org/docs/3.0.0-preview/api/scala/org/apache/spark/sql/streaming/GroupState.html

            I copied them out here:

            ...

            ANSWER

            Answered 2020-Oct-11 at 18:46

            What is this timestamp in the sentence and the timeout would occur when the watermark advances beyond the set timestamp?

            This is the timestamp you set by GroupState.setTimeoutTimestamp().

            is it an absolute time or is it a relative time duration to the current event time in the state?

            This is a relative time (not duration) based on the current batch window.

            say I have some data state (column timestamp=2020-08-02 22:02:00), when will it expire by setting up what value in what settings?

            Let's assume your sink query has a defined processing trigger (set by trigger()) of 5 minutes. Also, let us assume that you have used a watermark before applying the groupByKey and the mapGroupsWithState. I understand you want to use timeouts based on event times (as opposed to processing times, so your query will be like:

            Source https://stackoverflow.com/questions/63917648

            QUESTION

            Writing Spark streaming PySpark dataframe to Cassandra overwrites table instead of appending
            Asked 2020-Oct-21 at 19:42

            I'm running a 1-node cluster of Kafka, Spark and Cassandra. All locally on the same machine.

            From a simple Python script I'm streaming some dummy data every 5 seconds into a Kafka topic. Then using Spark structured streaming, I'm reading this data stream (one row at a time) into a PySpark DataFrame with startingOffset = latest. Finally, I'm trying to append this row to an already existing Cassandra table.

            I've been following (How to write streaming Dataset to Cassandra?) and (Cassandra Sink for PySpark Structured Streaming from Kafka topic).

            One row of data is being successfully written into the Cassandra table but my problem is it's being overwritten every time rather than appended to the end of the table. What might I be doing wrong?

            Here's my code:

            CQL DDL for creating kafkaspark keyspace followed by randintstream table in Cassandra:

            ...

            ANSWER

            Answered 2020-Oct-21 at 14:08

            If the row is always rewritten in Cassandra, then you may have incorrect primary key in the table - you need to make sure that every row will have an unique primary key. If you're creating Cassandra table from Spark, then by default it just takes first column as partition key, and it alone may not be unique.

            Update after schema was provided:

            Yes, that's the case that I was referring - you have a primary key of (partition, topic), but every row from specific partition that you read from that topic will have the same value for primary key, so it will overwrite previous versions. You need to make your primary key unique - for example, add the offset or timestamp columns to the primary key (although timestamp may not be unique if you have data produced inside the same millisecond).

            P.S. Also, in connector 3.0.0 you don't need foreachBatch:

            Source https://stackoverflow.com/questions/64463238

            QUESTION

            unable to read avro message via kafka-avro-console-consumer (end goal read it via spark streaming)
            Asked 2020-Sep-11 at 12:36

            (end goal) before trying out whether i could eventually read avro data, usng spark stream, out of the Confluent Platform like some described here: Integrating Spark Structured Streaming with the Confluent Schema Registry

            I'd to verify whether I could use below command to read them:

            ...

            ANSWER

            Answered 2020-Sep-10 at 20:11

            If you are getting Unknown Magic Byte with the consumer, then the producer didn't use the Confluent AvroSerializer, and might have pushed Avro data that doesn't use the Schema Registry.

            Without seeing the Producer code or consuming and inspecting the data in binary format, it is difficult to know which is the case.

            The message was produced using confluent connect file-pulse

            Did you use value.converter with the AvroConverter class?

            Source https://stackoverflow.com/questions/63828704

            QUESTION

            java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2 for Spark 3.0.0
            Asked 2020-Sep-01 at 20:59
            Brief

            What are possible paths that can make me process data by pyspark 3.0.0 with success from the pure pip installation, well, at least loading data without downgrading the version of Spark?

            When I attempted to load datasets of parquet and csv, I would get the exception message as the content below Exception Message displays. The initialization of Spark session is fine, yet when I wanted to load datasets, it just went wrong.

            Some Information
            • Java: openjdk 11
            • Python: 3.8.5
            • Mode: local mode
            • Operating System: Ubuntu 16.04.6 LTS
            • Notes:
              1. I executed python3.8 -m pip install pyspark to install Spark.
              2. When I looked up the jar of spark-sql_2.12-3.0.0.jar (which is under the Python site-package path, i.e., ~/.local/lib/python3.8/site-packages/pyspark/jars in my case), there is no v2 under spark.sql.sources, the most similar one I found is an interface called DatSourceRegister under the same package.
              3. The most similar question I found on Stackoverflow is PySpark structured Streaming + Kafka Error (Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.StreamWriteSupport ) where downgrading the Spark version is recommended throughout the information on that page.
            Exception Message ...

            ANSWER

            Answered 2020-Jul-30 at 10:00

            currently, I have got a way out for manipulating data via Python function APIs for Spark.

            workaround

            1

            Source https://stackoverflow.com/questions/63149228

            QUESTION

            How to resolve current committed offsets differing from current available offsets?
            Asked 2020-Jul-31 at 15:58

            I am attempting to read avro data from Kafka using Spark Streaming but I receive the following error message:

            ...

            ANSWER

            Answered 2020-Jul-31 at 15:58

            Figured it out - the problem was not as I had thought with the Spark-Kafka integration directly, but with the checkpoint information inside the hdfs filesystem instead. Deleting and recreating the checkpoint folder in hdfs solved it for me.

            Source https://stackoverflow.com/questions/63191950

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install spark-structured-streaming

            You can download it from GitHub.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/ansrivas/spark-structured-streaming.git

          • CLI

            gh repo clone ansrivas/spark-structured-streaming

          • sshUrl

            git@github.com:ansrivas/spark-structured-streaming.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link

            Explore Related Topics

            Consider Popular Pub Sub Libraries

            EventBus

            by greenrobot

            kafka

            by apache

            celery

            by celery

            rocketmq

            by apache

            pulsar

            by apache

            Try Top Libraries by ansrivas

            angular2-flask

            by ansrivasJavaScript

            fiberprometheus

            by ansrivasGo

            keras-rest-server

            by ansrivasPython

            GNG

            by ansrivasPython

            pylogging

            by ansrivasPython