spark-bigquery-connector | BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames int

by GoogleCloudDataproc Java Version: 0.31.1 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | spark-bigquery-connector Summary

spark-bigquery-connector is a Java library typically used in Big Data, Spark applications. spark-bigquery-connector has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can download it from GitHub.

The connector supports reading Google BigQuery tables into Spark's DataFrames, and writing DataFrames back into BigQuery. This is done by using the Spark SQL Data Source API to communicate with BigQuery.

Support

Quality

Security

License

Reuse

Support

spark-bigquery-connector has a low active ecosystem.

It has 298 star(s) with 177 fork(s). There are 42 watchers for this library.

It had no major release in the last 12 months.

There are 47 open issues and 356 have been closed. On average issues are closed in 408 days. There are 7 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of spark-bigquery-connector is 0.31.1

Quality

spark-bigquery-connector has 0 bugs and 0 code smells.

Security

spark-bigquery-connector has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

spark-bigquery-connector code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

spark-bigquery-connector is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

spark-bigquery-connector releases are available to install and integrate.

Build file is available. You can build the component from source.

Installation instructions are not available. Examples and code snippets are available.

spark-bigquery-connector saves you 3850 person hours of effort in developing the same functionality from scratch.

It has 22790 lines of code, 1339 functions and 228 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed spark-bigquery-connector and discovered the below as its top functions. This is intended to give you an instant insight into spark-bigquery-connector implemented functionality, and help decide if they suit your requirements.

Creates a read session
Gets the actual table
Checks if the input table is a view
Creates a table with a materialization project
Takes a list of URIs and returns a list of URIs to optimize performance reasons
Given a list of uri and a pattern returns a Multimap with the uri
Load data into a table
Create and wait for the given job
Entry point for the Maven wrapper
Downloads a file from the given URL string
Tries to consume readers
Returns next read result
Converts to a BigQueryClient
Adds a single row to the buffer
Converts SparkRows to protoRows
Provide the implementation of the BigQuery client
Compares two BigQueryFactory instances
Closes the worker thread
Read bytes from the stream
Stops the reader
Verifies that the object is correct
Read more items from the iterator
Returns the standard data type for the given field
Advances to the next batch
Gets the read table
Commits the output stream

Get all kandi verified functions for this library.

spark-bigquery-connector Key Features

No Key Features are available at this moment for spark-bigquery-connector.

spark-bigquery-connector Examples and Code Snippets

No Code Snippets are available at this moment for spark-bigquery-connector.

Community Discussions

Trending Discussions on spark-bigquery-connector

Bigquery as metastore for Dataproc

Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated while reading from bigquery in Jupyter lab

Dataproc Cluster creation is failing with PIP error "Could not build wheels"

Facing Issue with DataprocCreateClusterOperator (Airflow 2.0)

Facing Issue in passing metadata field with DataprocCreateClusterOperator (Airflow 2.0)

How to include DeltaLake Files from GCS to BigQuery

How to add bigquery-connector to an existing cluster on dataproc

creating dataproc cluster with multiple jars

Pyspark is unable to find bigquery datasource

How does google Spark-BigQuery-Connector leverage BigQuery Storage API?

QUESTION

Bigquery as metastore for Dataproc

Asked 2022-Apr-01 at 04:00

We are trying to migrate pyspark script from on-premise which creates and drops tables in Hive with data transformations to GCP platform.

Hive is replaced by BigQuery. In this case, the hive reads and writes is converted to bigquery reads and writes using spark-bigquery-connector.

However the problem lies with creation and dropping of bigquery tables via spark sql as spark sql will default run the create and drop queries on hive backed by hive metastore not on big query.

I wanted to check if there is plan to incorporate DDL statements support as well as part of spark-bigquery-connector.

Also, from architecture perspective is it possible to base the metastore for spark sql on bigquery so that any create or drop statement can be run on bigquery from spark.

...

ANSWER

Answered 2022-Apr-01 at 04:00

I don't think Spark SQL will support BigQuery as metastore, nor BQ connector will support BQ DDL. On Dataproc, Dataproc Metastore (DPMS) is the recommended solution for Hive and Spark SQL metastore.

In particular, for no-prem to Dataproc migration, it is more straightforward to migrate to DPMS, see this doc.

Source https://stackoverflow.com/questions/71676161

QUESTION

Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated while reading from bigquery in Jupyter lab

Asked 2022-Feb-19 at 21:47

I have followed this post pyspark error reading bigquery: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class

and followed the resolution provided but still getting the same error. Please help.

I am trying to run this using Jupyter lab created using data proc cluster in GCP.

I am using Python 3 kernel (not PySpark) to allow you to configure the SparkSession in the notebook and include the spark-bigquery-connector required to use the BigQuery Storage API.

...

ANSWER

Answered 2021-Dec-16 at 17:59

Please switch to gs://spark-lib/bigquery/spark-bigquery-latest_2.11.jar. The number after the _ is the Scala binary version.

Source https://stackoverflow.com/questions/70379400

QUESTION

Dataproc Cluster creation is failing with PIP error "Could not build wheels"

Asked 2022-Jan-24 at 13:04

We use to spin cluster with below configurations. It used to run fine till last week but now failing with error ERROR: Failed cleaning build dir for libcst Failed to build libcst ERROR: Could not build wheels for libcst which use PEP 517 and cannot be installed directly

...

ANSWER

Answered 2022-Jan-19 at 21:50

Seems you need to upgrade pip, see this question.

But there can be multiple pips in a Dataproc cluster, you need to choose the right one.

For init actions, at cluster creation time, /opt/conda/default is a symbolic link to either /opt/conda/miniconda3 or /opt/conda/anaconda, depending on which Conda env you choose, the default is Miniconda3, but in your case it is Anaconda. So you can run either /opt/conda/default/bin/pip install --upgrade pip or /opt/conda/anaconda/bin/pip install --upgrade pip.
For custom images, at image creation time, you want to use the explicit full path, /opt/conda/anaconda/bin/pip install --upgrade pip for Anaconda, or /opt/conda/miniconda3/bin/pip install --upgrade pip for Miniconda3.

So, you can simply use /opt/conda/anaconda/bin/pip install --upgrade pip for both init actions and custom images.

Source https://stackoverflow.com/questions/70743642

QUESTION

Facing Issue with DataprocCreateClusterOperator (Airflow 2.0)

Asked 2022-Jan-04 at 22:26

I'm trying to migrate from airflow 1.10 to Airflow 2 which has a change of name for some operators which includes - DataprocClusterCreateOperator. Here is an extract of the code.

...

ANSWER

Answered 2022-Jan-04 at 22:26

It seems that in this version the type of metadata parameter is no longer dict. From the docs:

metadata (Sequence[Tuple[str, str]]) -- Additional metadata that is provided to the method.

Try with:

Source https://stackoverflow.com/questions/70423687

QUESTION

Facing Issue in passing metadata field with DataprocCreateClusterOperator (Airflow 2.0)

Asked 2021-Dec-22 at 20:29

I am facing some issues while installing Packages in the Dataproc cluster using DataprocCreateClusterOperator I am trying to upgrade to Airflow 2.0

Error Message:

...

ANSWER

Answered 2021-Dec-22 at 20:29

the following dag is working as expected, changed:

the cluster name (cluster_name -> cluster-name).
path for scripts.
Dag definition.

Source https://stackoverflow.com/questions/70434588

QUESTION

How to include DeltaLake Files from GCS to BigQuery

Asked 2021-Dec-16 at 10:28

Is there a library/connector available to import Delta Lake files stored in Google Cloud Storage (GCS) directly to BigQuery ?

I have managed to write BigQuery tables using a Spark Dataframe as intermediary but I can't find any direct connector or BigQuery Library that does this without transitionning through spark dataframes.

Update 1: I tried using the official connector spark-bigquery-connector but documentation is lacking on how to point to a specific project in BigQuery so I couldn't go further than loading the DeltaLake files from GCS in a Dataframe.

Update 2: using Javier's comment, I managed to write to BQ but this solution isn't optimized and as much as I can optimize the spark Job, it won't be as direct as using a Google Bigquery library that does it under the hood

Update 3 and Temporary Solution: Not finding any direct solution, I ended up using spark-bigquery-connector to ingest Delta files as following :

...

ANSWER

Answered 2021-Nov-15 at 22:38

There is no way to ingest a Datalake file in GCS to BigQuery without going through some intermediary.

You could setup a GCE VM that downloads the Datalake file from GCS, reads it using the Datalake Standalone Connector and then write to BigQuery (either by the streaming API or writing to a supported format like Parquet and importing).

However this is essentially doing manually the same thing that Spark would be doing.

Source https://stackoverflow.com/questions/69738942

QUESTION

How to add bigquery-connector to an existing cluster on dataproc

Asked 2021-Dec-14 at 12:14

I've just started to use dataproc for doing machine learning on big data in bigquery.When i try to run this code :

...

ANSWER

Answered 2021-Dec-14 at 12:14

While creating a cluster, i opened the gcp console and type this script

Source https://stackoverflow.com/questions/70107737

QUESTION

creating dataproc cluster with multiple jars

Asked 2021-Nov-27 at 22:40

I am trying to create a dataproc cluster that will connect dataproc to pubsub. I need to add multiple jars on cluster creation in the spark.jars flag

...

ANSWER

Answered 2021-Nov-27 at 22:40

The answer you linked is the correct way to do it: How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?

If you also post the command you tried with the escaping syntax and the resulting error message then others could more easily verify what you did wrong. It looks like you're specifying an additional spark property in addition to your list of jars spark:spark.driver.memory=3000m, and tried to just space-separate that from your jars flag, which isn't allowed.

Per the linked result, you'd need to use the newly assigned separator character to separate the second spark property:

Source https://stackoverflow.com/questions/70139181

QUESTION

Pyspark is unable to find bigquery datasource

Asked 2021-Nov-12 at 18:59

This is my pyspark configuration. Ive followed the steps mentioned here and didnt create a sparkcontext.

...

ANSWER

Answered 2021-Sep-20 at 16:43

My problem was with faulty jar versions. I am using spark 3.1.2 and hadoop 3.2 this was the maven jars with code which worked for me.

Source https://stackoverflow.com/questions/69253892

QUESTION

How does google Spark-BigQuery-Connector leverage BigQuery Storage API?

Asked 2020-Mar-15 at 17:17

According to https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery the connector uses BigQuery Storage API to read data using gRPC. However, I couldn't find any Storage API/gRPC usage in the source code here: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/tree/master/connector/src/main/scala

My questions are: 1. could anyone show me the source code where uses storage API & gprc call? 2. Does Dataset df = session.read().format("bigquery").load() work through GBQ storage API? if not, how to read from GBQ to Spark using BigQuery Storage API?

...

ANSWER

Answered 2020-Mar-15 at 17:17

Spark BigQuery Connector uses only BigQuery Storage API for reads, you can see it here, for example.
Yes, Dataset df = session.read().format("bigquery").load() works through BigQuery Storage API.

Source https://stackoverflow.com/questions/60689102

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install spark-bigquery-connector

You can download it from GitHub.
You can use spark-bigquery-connector like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the spark-bigquery-connector component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: