spark-bigquery-connector | BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames int
kandi X-RAY | spark-bigquery-connector Summary
kandi X-RAY | spark-bigquery-connector Summary
The connector supports reading Google BigQuery tables into Spark's DataFrames, and writing DataFrames back into BigQuery. This is done by using the Spark SQL Data Source API to communicate with BigQuery.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Creates a read session
- Gets the actual table
- Checks if the input table is a view
- Creates a table with a materialization project
- Takes a list of URIs and returns a list of URIs to optimize performance reasons
- Given a list of uri and a pattern returns a Multimap with the uri
- Load data into a table
- Create and wait for the given job
- Entry point for the Maven wrapper
- Downloads a file from the given URL string
- Tries to consume readers
- Returns next read result
- Converts to a BigQueryClient
- Adds a single row to the buffer
- Converts SparkRows to protoRows
- Provide the implementation of the BigQuery client
- Compares two BigQueryFactory instances
- Closes the worker thread
- Read bytes from the stream
- Stops the reader
- Verifies that the object is correct
- Read more items from the iterator
- Returns the standard data type for the given field
- Advances to the next batch
- Gets the read table
- Commits the output stream
spark-bigquery-connector Key Features
spark-bigquery-connector Examples and Code Snippets
Community Discussions
Trending Discussions on spark-bigquery-connector
QUESTION
We are trying to migrate pyspark script from on-premise which creates and drops tables in Hive with data transformations to GCP platform.
Hive is replaced by BigQuery. In this case, the hive reads and writes is converted to bigquery reads and writes using spark-bigquery-connector.
However the problem lies with creation and dropping of bigquery tables via spark sql as spark sql will default run the create and drop queries on hive backed by hive metastore not on big query.
I wanted to check if there is plan to incorporate DDL statements support as well as part of spark-bigquery-connector.
Also, from architecture perspective is it possible to base the metastore for spark sql on bigquery so that any create or drop statement can be run on bigquery from spark.
...ANSWER
Answered 2022-Apr-01 at 04:00I don't think Spark SQL will support BigQuery as metastore, nor BQ connector will support BQ DDL. On Dataproc, Dataproc Metastore (DPMS) is the recommended solution for Hive and Spark SQL metastore.
In particular, for no-prem to Dataproc migration, it is more straightforward to migrate to DPMS, see this doc.
QUESTION
I have followed this post pyspark error reading bigquery: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class
and followed the resolution provided but still getting the same error. Please help.
I am trying to run this using Jupyter lab created using data proc cluster in GCP.
I am using Python 3 kernel (not PySpark) to allow you to configure the SparkSession in the notebook and include the spark-bigquery-connector required to use the BigQuery Storage API.
...ANSWER
Answered 2021-Dec-16 at 17:59Please switch to gs://spark-lib/bigquery/spark-bigquery-latest_2.11.jar
. The number after the _
is the Scala binary version.
QUESTION
We use to spin cluster with below configurations. It used to run fine till last week but now failing with error ERROR: Failed cleaning build dir for libcst Failed to build libcst ERROR: Could not build wheels for libcst which use PEP 517 and cannot be installed directly
ANSWER
Answered 2022-Jan-19 at 21:50Seems you need to upgrade pip
, see this question.
But there can be multiple pip
s in a Dataproc cluster, you need to choose the right one.
For init actions, at cluster creation time,
/opt/conda/default
is a symbolic link to either/opt/conda/miniconda3
or/opt/conda/anaconda
, depending on which Conda env you choose, the default is Miniconda3, but in your case it is Anaconda. So you can run either/opt/conda/default/bin/pip install --upgrade pip
or/opt/conda/anaconda/bin/pip install --upgrade pip
.For custom images, at image creation time, you want to use the explicit full path,
/opt/conda/anaconda/bin/pip install --upgrade pip
for Anaconda, or/opt/conda/miniconda3/bin/pip install --upgrade pip
for Miniconda3.
So, you can simply use /opt/conda/anaconda/bin/pip install --upgrade pip
for both init actions and custom images.
QUESTION
I'm trying to migrate from airflow 1.10 to Airflow 2 which has a change of name for some operators which includes - DataprocClusterCreateOperator
. Here is an extract of the code.
ANSWER
Answered 2022-Jan-04 at 22:26It seems that in this version the type of metadata
parameter is no longer dict
. From the docs:
metadata (
Sequence[Tuple[str, str]]
) -- Additional metadata that is provided to the method.
Try with:
QUESTION
I am facing some issues while installing Packages in the Dataproc cluster using DataprocCreateClusterOperator
I am trying to upgrade to Airflow 2.0
Error Message:
...ANSWER
Answered 2021-Dec-22 at 20:29the following dag is working as expected, changed:
- the cluster name (
cluster_name
->cluster-name
). - path for scripts.
- Dag definition.
QUESTION
Is there a library/connector available to import Delta Lake files stored in Google Cloud Storage (GCS) directly to BigQuery ?
I have managed to write BigQuery tables using a Spark Dataframe as intermediary but I can't find any direct connector or BigQuery Library that does this without transitionning through spark dataframes.
Update 1: I tried using the official connector spark-bigquery-connector but documentation is lacking on how to point to a specific project in BigQuery so I couldn't go further than loading the DeltaLake files from GCS in a Dataframe.
Update 2: using Javier's comment, I managed to write to BQ but this solution isn't optimized and as much as I can optimize the spark Job, it won't be as direct as using a Google Bigquery library that does it under the hood
Update 3 and Temporary Solution: Not finding any direct solution, I ended up using spark-bigquery-connector to ingest Delta files as following :
...ANSWER
Answered 2021-Nov-15 at 22:38There is no way to ingest a Datalake file in GCS to BigQuery without going through some intermediary.
You could setup a GCE VM that downloads the Datalake file from GCS, reads it using the Datalake Standalone Connector and then write to BigQuery (either by the streaming API or writing to a supported format like Parquet and importing).
However this is essentially doing manually the same thing that Spark would be doing.
QUESTION
I've just started to use dataproc for doing machine learning on big data in bigquery.When i try to run this code :
...ANSWER
Answered 2021-Dec-14 at 12:14While creating a cluster, i opened the gcp console and type this script
QUESTION
I am trying to create a dataproc cluster that will connect dataproc to pubsub. I need to add multiple jars on cluster creation in the spark.jars flag
...ANSWER
Answered 2021-Nov-27 at 22:40The answer you linked is the correct way to do it: How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?
If you also post the command you tried with the escaping syntax and the resulting error message then others could more easily verify what you did wrong. It looks like you're specifying an additional spark property in addition to your list of jars spark:spark.driver.memory=3000m
, and tried to just space-separate that from your jars flag, which isn't allowed.
Per the linked result, you'd need to use the newly assigned separator character to separate the second spark property:
QUESTION
This is my pyspark configuration. Ive followed the steps mentioned here and didnt create a sparkcontext.
...ANSWER
Answered 2021-Sep-20 at 16:43My problem was with faulty jar versions. I am using spark 3.1.2 and hadoop 3.2 this was the maven jars with code which worked for me.
QUESTION
According to https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery the connector uses BigQuery Storage API to read data using gRPC. However, I couldn't find any Storage API/gRPC usage in the source code here: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/tree/master/connector/src/main/scala
My questions are:
1. could anyone show me the source code where uses storage API & gprc call?
2. Does Dataset df = session.read().format("bigquery").load()
work through GBQ storage API? if not, how to read from GBQ to Spark using BigQuery Storage API?
ANSWER
Answered 2020-Mar-15 at 17:17Spark BigQuery Connector uses only BigQuery Storage API for reads, you can see it here, for example.
Yes,
Dataset df = session.read().format("bigquery").load()
works through BigQuery Storage API.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install spark-bigquery-connector
You can use spark-bigquery-connector like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the spark-bigquery-connector component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page