DataflowTemplate | Mercari Dataflow Template | GCP library

by mercari Java Version: v0.9.0 License: MIT

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | DataflowTemplate Summary

DataflowTemplate is a Java library typically used in Cloud, GCP applications. DataflowTemplate has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can download it from GitHub.

The Mercari Dataflow Template allows you to run various pipelines without writing programs by simply defining a configuration file. Mercari Dataflow Template is implemented as a FlexTemplate for Cloud Dataflow. Pipelines are assembled based on the defined configuration file and can be executed as Cloud Dataflow Jobs. See the Document for usage.

Support

Quality

Security

License

Reuse

Support

DataflowTemplate has a low active ecosystem.

It has 49 star(s) with 17 fork(s). There are 7 watchers for this library.

It had no major release in the last 6 months.

There are 1 open issues and 2 have been closed. On average issues are closed in 204 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of DataflowTemplate is v0.9.0

Quality

DataflowTemplate has no bugs reported.

Security

DataflowTemplate has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

DataflowTemplate is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

DataflowTemplate releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed DataflowTemplate and discovered the below as its top functions. This is intended to give you an instant insight into DataflowTemplate implemented functionality, and help decide if they suit your requirements.

Accumulate change records from table
Create key and change record key
Convert an object to a value
Returns true if the given schema is JSON
Accumulate change records
Create key and change record key
Convert an object to a value
Returns true if the given schema is JSON
Expand the input results
Sets the parameters to the given buffer
Gets a builder that allows to select fields
Converts the GenericRecord to mutations
Returns the value of a field
Convert a DataChangeRecord to a DataChangeRow
Sets the value of a field
Converts the given struct to mutations
Creates a Key from a GenericRecord
Returns column type
Creates SQL statement for insert operations
Merge values into entity
Gets the value of a row
Expand results from input
Upload a model to the server
Convert a schema to a generic record
Expand a list of results into a map
Set the value of the given field
Convert a DataChangeRecord to GenericRecord
Converts the given entity to mutations
Obtain the mutations from a row

Get all kandi verified functions for this library.

DataflowTemplate Key Features

No Key Features are available at this moment for DataflowTemplate.

DataflowTemplate Examples and Code Snippets

No Code Snippets are available at this moment for DataflowTemplate.

Community Discussions

Trending Discussions on DataflowTemplate

Spring Cloud Data Flow : Unable to launch multiple instances of the same Task

PubSub to BigQuery - Dataflow/Beam template in Python?

IoT pipeline in GCP

Pub/Sub csv data to Dataflow to BigQuery

Can Dataflow CDC be used for initial dump too?

Error while running maven command in gcp console unable to install common package error

How to fix "Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java" error running a Dataflow template from a VM instance in GCP?

Dataflow job to write into BigQuery with schema autodetect

I get error: "Overload resolution ambiguity" from MapElements transform in Apache Beam when using Kotlin

Is it required to set `packageVersion` of PackageIdentifier instance for DataFlowTemplate.streamOperations().updateStream(..) method?

QUESTION

Spring Cloud Data Flow : Unable to launch multiple instances of the same Task

Asked 2021-May-13 at 09:21

TL;DR

Spring Cloud Data Flow does not allow multiple executions of the same Task even though the documentation says that this is the default behavior. How can we allow SCDF to run multiple instances of the same task at the same time using the Java DSL to launch tasks? To make things more interesting, launching of the same task multiple times works fine when directly hitting the rest enpoints using curl for example.

Background :

I have a Spring Cloud Data Flow Task that I have pre-registered in the Spring Cloud Data Flow UI Dashboard

...

ANSWER

Answered 2021-May-12 at 16:57

In this case it looks like you are trying to recreate the task definition. You should only need to create the task definition once. From this definition you can launch multiple times. For example:

Source https://stackoverflow.com/questions/67506703

QUESTION

PubSub to BigQuery - Dataflow/Beam template in Python?

Asked 2021-Feb-21 at 16:20

Is there any Python template/script (existing or roadmap) for Dataflow/Beam to read from PubSub and write to BigQuery? As per the GCP documentation, there is only a Java template.

Thanks !

...

ANSWER

Answered 2021-Feb-21 at 13:57

You can find an example here Pub/Sub to BigQuery sample with template:

An Apache Beam streaming pipeline example.

It reads JSON encoded messages from Pub/Sub, transforms the message data, and writes the results to BigQuery.

Here's another example that shows how to handle invalid message from pubsub into a different table in Bigquery :

Source https://stackoverflow.com/questions/66302651

QUESTION

IoT pipeline in GCP

Asked 2021-Jan-24 at 21:53

I have an IoT Pipeline in GCP that is structured like:

...

ANSWER

Answered 2021-Jan-24 at 21:53

This error was being caused because after every 10 seconds the pub/sub resent the messages that had not yet been acknowledged. This caused the total number of messages to grow rapidly as the number of devices sending the messages and the rate at which they sent them was already very high. So I increased this wait time to 30 seconds and the system calmed down. Now there is no large group of unacknowledged messages forming when I run the pipeline.

Source https://stackoverflow.com/questions/65613491

QUESTION

Pub/Sub csv data to Dataflow to BigQuery

Asked 2021-Jan-02 at 22:47

My pipeline is IoTCore -> Pub/Sub -> Dataflow -> BigQuery. Initially the data I was getting was Json format and the pipeline was working properly. Now I need to shift to csv and the issue is the Google defined dataflow template I was using uses Json input instead of csv. Is there an easy way of transfering csv data from pub/sub to bigquery through dataflow. The template can probably be changed but it is implemented in Java which I have never used so would take a long time to implement. I also considered implementing an entire custom template in python but that would take too long. Here is a link to the template provided by google: https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubSubToBigQuery.java

Sample: Currently my pub/sub messages are JSON and these work correctly

...

ANSWER

Answered 2021-Jan-02 at 19:48

Very easy: Do nothing!! If you have a look to this line you can see that the type of the messages used is the PubSub message JSON, not your content in JSON.

So, to prevent any issues (to query and to insert), write in another table and it should work nicely!

Source https://stackoverflow.com/questions/65542582

QUESTION

Can Dataflow CDC be used for initial dump too?

Asked 2020-Oct-07 at 17:50

Can Google Dataflow CDC be used to copy the mysql DB tables for the very first time too or is it only used for change data going forward? https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/master/v2/cdc-parent#deploying-the-connector

...

ANSWER

Answered 2020-Oct-07 at 17:50

The CDC solution you linked to includes the initial copy as part of its normal operation. When you first start it up, it will copy the current contents of the DB first, then continue to copy any updates.

Source https://stackoverflow.com/questions/64238037

QUESTION

Error while running maven command in gcp console unable to install common package error

Asked 2020-Sep-02 at 08:14

unable to install apache maven packages in gcp console please let me know if any one resolves the issue. I'm trying to create dataflow pipeline following the below link enter link description here

...

ANSWER

Answered 2020-Sep-02 at 08:14

By now this is well known to developers bug, confirmed on my side, getting the same kafka-to-bigquery template compilation error around DataStreamClient class. Seems the new PR for CacheUtils.java is going to appear soon, more info here.

Source https://stackoverflow.com/questions/63659156

QUESTION

How to fix "Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java" error running a Dataflow template from a VM instance in GCP?

Asked 2019-Dec-09 at 10:49

I'm trying to execute the Dataflow template named PubSubToBigQuery.java at a VM instance (OS: "linux", version: "4.9.0-11-amd64", Distributor: Debian GNU/Linux 9.11 (stretch)) to take input messages from a Pub/Sub subscription and write them in a BigQuery table (without modifying the template for the moment). In order to do this I cloned the GitHub DataflowTemplates repo into my Cloud Shell in the $HOME/opt/ directory. Following the README document I've installed Java 8 and Maven 3:

Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T16:41:47+00:00) Maven home: /opt/maven Java version: 1.8.0_232, vendor: Oracle Corporation Java home: /usr/lib/jvm/java-8-openjdk-amd64/jre Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "4.9.0-11-amd64", arch: "amd64", family: "unix"

After building the entire project, this is what I'm trying to execute from the comand line to compile the code:

...

ANSWER

Answered 2019-Dec-09 at 10:48

To clone, compile and run a Dataflow Template is necessary enable all the necessary APIs in your GCP project:

Dataflow
Compute Engine
Stackdriver Logging
Cloud Storage
Cloud Storage JSON
BigQuery
PubSub

In order to do that you can click in this helper link:

https://console.cloud.google.com/flows/enableapi?apiid=dataflow,compute_component,logging,storage_component,storage_api,bigquery,pubsub

Source https://stackoverflow.com/questions/58993958

QUESTION

Dataflow job to write into BigQuery with schema autodetect

Asked 2019-Nov-16 at 13:54

Currently we are searching for the best way how we can convert raw data into common structure for further analysis. Our data is JSON files, some files has more fields, some less, some might have arrays, but in general it is pretty the same structure.

I'm trying to build Apache Beam pipeline in Java for this purpose. All my pipelines are based on this template: TextIOToBigQuery.java

First approach is to load entire JSON as string into one column and then use JSON Functions in Standard SQL to transform into common structure. This is well described here: How to manage/handle schema changes while loading JSON file into BigQuery table

Second approach is to load data into appropriate columns. So now data can be queried via standard SQL. It also requires to know schema. It is possible to detect it via console, UI and other: Using schema auto-detection, however I didn't find anything about how this can be achieved via Java and Apache Beam pipeline.

I analyzed BigQueryIO and looks like it cannot work without schema (with one exception, if table already created)

As I mentioned before, new files might bring new fields, so schema should be updated accordingly.

Let's say I have three JSON files:

...

ANSWER

Answered 2019-Nov-16 at 13:54

I did some tests where I simulate the typical auto-detect pattern: first I run through all the data to build a Map of all possible fields and the type (here I just considered String or Integer for simplicity). I use a stateful pipeline to keep track of the fields that have already been seen and save it as a PCollectionView. This way I can use .withSchemaFromView() as the schema is unknown at pipeline construction. Note that this approach is only valid for batch jobs.

First, I create some dummy data without a strict schema where each row may or may not contain any of the fields:

Source https://stackoverflow.com/questions/58794005

QUESTION

I get error: "Overload resolution ambiguity" from MapElements transform in Apache Beam when using Kotlin

Asked 2019-Aug-16 at 09:05

I'm exploring Apache Beam dataflow templates provided by GoogleCloudPlatform on Github.

In particular, I'm converting the PubSubToBigQuery template from Java into Kotlin.

By doing so, I get an Overload ambiguity resolution error in the MapElements.input(...).via(...) transform on line 274. The error message is:

...

ANSWER

Answered 2019-Aug-16 at 09:05

The reason is that the overload rules are slightly different between Java and Kotlin, which means that in Kotlin there are two matching overloads;

Source https://stackoverflow.com/questions/57511611

QUESTION

Is it required to set `packageVersion` of PackageIdentifier instance for DataFlowTemplate.streamOperations().updateStream(..) method?

Asked 2019-Jun-25 at 18:39

I am instantiating PackageIdentifier class to pass it to DataFlowTemplate.streamOperations().updateStream(..) method, I set properties repositoryName and packageName, but I want to know if packageVersion is required property? Because I can see that it works without it. It is just, I had an exception, but not able to reproduce it again, and was wondering if packageVersion is the cause of this problem?:

...

ANSWER

Answered 2019-Jun-25 at 18:39

The packageVersion is not required as far as there is a package with the desired name (in this case the "stream name") that exists in Skipper database.

See: Stream.java#L112-L114.

As for the error, it could be that you were using H2 instead of a persistent database for Skipper, and upon a restart, perhaps your client/test continued to attempt an upgrade on the transient database that doesn't have any footprint anymore.

Source https://stackoverflow.com/questions/56746910

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install DataflowTemplate

You can download it from GitHub.
You can use DataflowTemplate like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the DataflowTemplate component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

Support

Please read the CLA carefully before submitting your contribution to Mercari. Under any circumstances, by submitting your contribution, you are deemed to accept and agree to be bound by the terms and conditions of the CLA.

Find more information at: