spark-bigquery-connector | BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames int

 by   GoogleCloudDataproc Java Version: 0.31.1 License: Apache-2.0

kandi X-RAY | spark-bigquery-connector Summary

kandi X-RAY | spark-bigquery-connector Summary

spark-bigquery-connector is a Java library typically used in Big Data, Spark applications. spark-bigquery-connector has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can download it from GitHub.

The connector supports reading Google BigQuery tables into Spark's DataFrames, and writing DataFrames back into BigQuery. This is done by using the Spark SQL Data Source API to communicate with BigQuery.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              spark-bigquery-connector has a low active ecosystem.
              It has 298 star(s) with 177 fork(s). There are 42 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              There are 47 open issues and 356 have been closed. On average issues are closed in 408 days. There are 7 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of spark-bigquery-connector is 0.31.1

            kandi-Quality Quality

              spark-bigquery-connector has 0 bugs and 0 code smells.

            kandi-Security Security

              spark-bigquery-connector has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              spark-bigquery-connector code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              spark-bigquery-connector is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              spark-bigquery-connector releases are available to install and integrate.
              Build file is available. You can build the component from source.
              Installation instructions are not available. Examples and code snippets are available.
              spark-bigquery-connector saves you 3850 person hours of effort in developing the same functionality from scratch.
              It has 22790 lines of code, 1339 functions and 228 files.
              It has medium code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed spark-bigquery-connector and discovered the below as its top functions. This is intended to give you an instant insight into spark-bigquery-connector implemented functionality, and help decide if they suit your requirements.
            • Creates a read session
            • Gets the actual table
            • Checks if the input table is a view
            • Creates a table with a materialization project
            • Takes a list of URIs and returns a list of URIs to optimize performance reasons
            • Given a list of uri and a pattern returns a Multimap with the uri
            • Load data into a table
            • Create and wait for the given job
            • Entry point for the Maven wrapper
            • Downloads a file from the given URL string
            • Tries to consume readers
            • Returns next read result
            • Converts to a BigQueryClient
            • Adds a single row to the buffer
            • Converts SparkRows to protoRows
            • Provide the implementation of the BigQuery client
            • Compares two BigQueryFactory instances
            • Closes the worker thread
            • Read bytes from the stream
            • Stops the reader
            • Verifies that the object is correct
            • Read more items from the iterator
            • Returns the standard data type for the given field
            • Advances to the next batch
            • Gets the read table
            • Commits the output stream
            Get all kandi verified functions for this library.

            spark-bigquery-connector Key Features

            No Key Features are available at this moment for spark-bigquery-connector.

            spark-bigquery-connector Examples and Code Snippets

            No Code Snippets are available at this moment for spark-bigquery-connector.

            Community Discussions

            QUESTION

            Bigquery as metastore for Dataproc
            Asked 2022-Apr-01 at 04:00

            We are trying to migrate pyspark script from on-premise which creates and drops tables in Hive with data transformations to GCP platform.

            Hive is replaced by BigQuery. In this case, the hive reads and writes is converted to bigquery reads and writes using spark-bigquery-connector.

            However the problem lies with creation and dropping of bigquery tables via spark sql as spark sql will default run the create and drop queries on hive backed by hive metastore not on big query.

            I wanted to check if there is plan to incorporate DDL statements support as well as part of spark-bigquery-connector.

            Also, from architecture perspective is it possible to base the metastore for spark sql on bigquery so that any create or drop statement can be run on bigquery from spark.

            ...

            ANSWER

            Answered 2022-Apr-01 at 04:00

            I don't think Spark SQL will support BigQuery as metastore, nor BQ connector will support BQ DDL. On Dataproc, Dataproc Metastore (DPMS) is the recommended solution for Hive and Spark SQL metastore.

            In particular, for no-prem to Dataproc migration, it is more straightforward to migrate to DPMS, see this doc.

            Source https://stackoverflow.com/questions/71676161

            QUESTION

            Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated while reading from bigquery in Jupyter lab
            Asked 2022-Feb-19 at 21:47

            I have followed this post pyspark error reading bigquery: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class

            and followed the resolution provided but still getting the same error. Please help.

            I am trying to run this using Jupyter lab created using data proc cluster in GCP.

            I am using Python 3 kernel (not PySpark) to allow you to configure the SparkSession in the notebook and include the spark-bigquery-connector required to use the BigQuery Storage API.

            ...

            ANSWER

            Answered 2021-Dec-16 at 17:59

            Please switch to gs://spark-lib/bigquery/spark-bigquery-latest_2.11.jar. The number after the _ is the Scala binary version.

            Source https://stackoverflow.com/questions/70379400

            QUESTION

            Dataproc Cluster creation is failing with PIP error "Could not build wheels"
            Asked 2022-Jan-24 at 13:04

            We use to spin cluster with below configurations. It used to run fine till last week but now failing with error ERROR: Failed cleaning build dir for libcst Failed to build libcst ERROR: Could not build wheels for libcst which use PEP 517 and cannot be installed directly

            ...

            ANSWER

            Answered 2022-Jan-19 at 21:50

            Seems you need to upgrade pip, see this question.

            But there can be multiple pips in a Dataproc cluster, you need to choose the right one.

            1. For init actions, at cluster creation time, /opt/conda/default is a symbolic link to either /opt/conda/miniconda3 or /opt/conda/anaconda, depending on which Conda env you choose, the default is Miniconda3, but in your case it is Anaconda. So you can run either /opt/conda/default/bin/pip install --upgrade pip or /opt/conda/anaconda/bin/pip install --upgrade pip.

            2. For custom images, at image creation time, you want to use the explicit full path, /opt/conda/anaconda/bin/pip install --upgrade pip for Anaconda, or /opt/conda/miniconda3/bin/pip install --upgrade pip for Miniconda3.

            So, you can simply use /opt/conda/anaconda/bin/pip install --upgrade pip for both init actions and custom images.

            Source https://stackoverflow.com/questions/70743642

            QUESTION

            Facing Issue with DataprocCreateClusterOperator (Airflow 2.0)
            Asked 2022-Jan-04 at 22:26

            I'm trying to migrate from airflow 1.10 to Airflow 2 which has a change of name for some operators which includes - DataprocClusterCreateOperator. Here is an extract of the code.

            ...

            ANSWER

            Answered 2022-Jan-04 at 22:26

            It seems that in this version the type of metadata parameter is no longer dict. From the docs:

            metadata (Sequence[Tuple[str, str]]) -- Additional metadata that is provided to the method.

            Try with:

            Source https://stackoverflow.com/questions/70423687

            QUESTION

            Facing Issue in passing metadata field with DataprocCreateClusterOperator (Airflow 2.0)
            Asked 2021-Dec-22 at 20:29

            I am facing some issues while installing Packages in the Dataproc cluster using DataprocCreateClusterOperator I am trying to upgrade to Airflow 2.0

            Error Message:

            ...

            ANSWER

            Answered 2021-Dec-22 at 20:29

            the following dag is working as expected, changed:

            • the cluster name (cluster_name -> cluster-name).
            • path for scripts.
            • Dag definition.

            Source https://stackoverflow.com/questions/70434588

            QUESTION

            How to include DeltaLake Files from GCS to BigQuery
            Asked 2021-Dec-16 at 10:28

            Is there a library/connector available to import Delta Lake files stored in Google Cloud Storage (GCS) directly to BigQuery ?

            I have managed to write BigQuery tables using a Spark Dataframe as intermediary but I can't find any direct connector or BigQuery Library that does this without transitionning through spark dataframes.

            Update 1: I tried using the official connector spark-bigquery-connector but documentation is lacking on how to point to a specific project in BigQuery so I couldn't go further than loading the DeltaLake files from GCS in a Dataframe.

            Update 2: using Javier's comment, I managed to write to BQ but this solution isn't optimized and as much as I can optimize the spark Job, it won't be as direct as using a Google Bigquery library that does it under the hood

            Update 3 and Temporary Solution: Not finding any direct solution, I ended up using spark-bigquery-connector to ingest Delta files as following :

            ...

            ANSWER

            Answered 2021-Nov-15 at 22:38

            There is no way to ingest a Datalake file in GCS to BigQuery without going through some intermediary.

            You could setup a GCE VM that downloads the Datalake file from GCS, reads it using the Datalake Standalone Connector and then write to BigQuery (either by the streaming API or writing to a supported format like Parquet and importing).

            However this is essentially doing manually the same thing that Spark would be doing.

            Source https://stackoverflow.com/questions/69738942

            QUESTION

            How to add bigquery-connector to an existing cluster on dataproc
            Asked 2021-Dec-14 at 12:14

            I've just started to use dataproc for doing machine learning on big data in bigquery.When i try to run this code :

            ...

            ANSWER

            Answered 2021-Dec-14 at 12:14

            While creating a cluster, i opened the gcp console and type this script

            Source https://stackoverflow.com/questions/70107737

            QUESTION

            creating dataproc cluster with multiple jars
            Asked 2021-Nov-27 at 22:40

            I am trying to create a dataproc cluster that will connect dataproc to pubsub. I need to add multiple jars on cluster creation in the spark.jars flag

            ...

            ANSWER

            Answered 2021-Nov-27 at 22:40

            The answer you linked is the correct way to do it: How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?

            If you also post the command you tried with the escaping syntax and the resulting error message then others could more easily verify what you did wrong. It looks like you're specifying an additional spark property in addition to your list of jars spark:spark.driver.memory=3000m, and tried to just space-separate that from your jars flag, which isn't allowed.

            Per the linked result, you'd need to use the newly assigned separator character to separate the second spark property:

            Source https://stackoverflow.com/questions/70139181

            QUESTION

            Pyspark is unable to find bigquery datasource
            Asked 2021-Nov-12 at 18:59

            This is my pyspark configuration. Ive followed the steps mentioned here and didnt create a sparkcontext.

            ...

            ANSWER

            Answered 2021-Sep-20 at 16:43

            My problem was with faulty jar versions. I am using spark 3.1.2 and hadoop 3.2 this was the maven jars with code which worked for me.

            Source https://stackoverflow.com/questions/69253892

            QUESTION

            How does google Spark-BigQuery-Connector leverage BigQuery Storage API?
            Asked 2020-Mar-15 at 17:17

            According to https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery the connector uses BigQuery Storage API to read data using gRPC. However, I couldn't find any Storage API/gRPC usage in the source code here: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/tree/master/connector/src/main/scala

            My questions are: 1. could anyone show me the source code where uses storage API & gprc call? 2. Does Dataset df = session.read().format("bigquery").load() work through GBQ storage API? if not, how to read from GBQ to Spark using BigQuery Storage API?

            ...

            ANSWER

            Answered 2020-Mar-15 at 17:17
            1. Spark BigQuery Connector uses only BigQuery Storage API for reads, you can see it here, for example.

            2. Yes, Dataset df = session.read().format("bigquery").load() works through BigQuery Storage API.

            Source https://stackoverflow.com/questions/60689102

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install spark-bigquery-connector

            You can download it from GitHub.
            You can use spark-bigquery-connector like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the spark-bigquery-connector component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/GoogleCloudDataproc/spark-bigquery-connector.git

          • CLI

            gh repo clone GoogleCloudDataproc/spark-bigquery-connector

          • sshUrl

            git@github.com:GoogleCloudDataproc/spark-bigquery-connector.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link