Data-Pipeline | Data pipeline is a tool to run Data loading pipelines | Cloud Storage library

by GoogleCloudPlatform Python Version: Current License: Apache-2.0

X-Ray Key Features Code Snippets(1)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | Data-Pipeline Summary

Data-Pipeline is a Python library typically used in Storage, Cloud Storage applications. Data-Pipeline has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. However Data-Pipeline build file is not available. You can download it from GitHub.

Data Pipeline is a self-hosted Google App Engine sample application that enables its users to easily define and execute data flows across different Google Cloud Platform products. It is intended as a reference for connecting multiple cloud services together, and as a head start for building custom data processing solutions.

Support

Quality

Security

License

Reuse

Support

Data-Pipeline has a low active ecosystem.

It has 86 star(s) with 36 fork(s). There are 52 watchers for this library.

It had no major release in the last 6 months.

There are 2 open issues and 0 have been closed. On average issues are closed in 2310 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of Data-Pipeline is current.

Quality

Data-Pipeline has 0 bugs and 0 code smells.

Security

Data-Pipeline has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

Data-Pipeline code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

Data-Pipeline is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

Data-Pipeline releases are not available. You will need to build from source code and install.

Data-Pipeline has no build file. You will be need to create the build yourself to build the component from source.

Installation instructions, examples and code snippets are available.

Data-Pipeline saves you 2964 person hours of effort in developing the same functionality from scratch.

It has 6396 lines of code, 433 functions and 114 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed Data-Pipeline and discovered the below as its top functions. This is intended to give you an instant insight into Data-Pipeline implemented functionality, and help decide if they suit your requirements.

Run s3 stage
Reads an object from S3
Convert a S3 bucket name to a bucket and name
Calls the appropriate handler
Run the worker
Read data from source to destination
Performs transform and writes rows
Find the start of the source file after skip_rows
Stat object
Lints the configuration
Normalizes a time stamp
Convert format format to regular expression
Returns the documentation for the given type
Parse the pipeline configuration
Lint the required fields
Generate GCS resources
Handle POST request
Runs the pipeline
Run the pipeline
Run the query
Run the engine
Transforms source and sinks
Start the pipeline
Runs GCS
Transforms the instance data
Transforms a GCE file

Get all kandi verified functions for this library.

Data-Pipeline Key Features

No Key Features are available at this moment for Data-Pipeline.

Data-Pipeline Examples and Code Snippets

Creates a legacy snapshot dataset .

python

Lines of Code : 87

License : Non-SPDX (Apache License 2.0)

Copy

def legacy_snapshot(path,
                    compression=None,
                    reader_path_prefix=None,
                    writer_path_prefix=None,
                    shard_size_bytes=None,
                    pending_snapshot_expiry_seconds=N

Community Discussions

Trending Discussions on Data-Pipeline

Airflow: Storing a Connection in Environment Variables , for databricks connection

Spring Integration Message Driven Channel Adapter not working with Spring-Kafka 2.3+

Gradle build failing "com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: org/gradle/api/internal/java/JavaLibrary"

updating data in dvc registry from other projects

AWS IAM Setup for EC2 Resource in AWS Data Pipeline

I wonder if I can perform data-pipeline by directory of a specific name with DataFusion

AWS Data Pipeline restore to DynamoDB table error with "is in status 'CANCELLED' with reason 'Job terminated' "

Assistance with Keras for a noise detection script

I see some unknown containers in `docker ps`

I'm curious about the internal workflow of GCP's Data Fusion

QUESTION

Airflow: Storing a Connection in Environment Variables , for databricks connection

Asked 2021-Apr-22 at 23:47

I want to store my databricks connection information as an env variable. as mentioned in

https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html#:~:text=create%20the%20connection.-,Editing%20a%20Connection%20with%20the%20UI,button%20to%20save%20your%20changes.

I am also looking at the following: https://docs.databricks.com/dev-tools/data-pipelines.html

it says to set the login as : {“token”: “abc”, “host”:"123"} I not sure what to export… does anyone have a clue?? I have the token etc… but what is the export statement?

...

ANSWER

Answered 2021-Apr-22 at 23:47

If you have already created the connection from the Airflow UI, open a terminal an enter this command: airflow connections get your_connection_id.

Example:

Source https://stackoverflow.com/questions/67217853

QUESTION

Spring Integration Message Driven Channel Adapter not working with Spring-Kafka 2.3+

Asked 2021-Mar-25 at 15:02

I am getting the following issues when trying to get Message Driven Channel Adaptor working with Spring-Kafka 2.3+. Does anyone have any example code which would help me?

1. org.springframework.kafka.listener.config.ContainerProperties does not actually exist.

2. org.springframework.kafka.listener.ContainerProperties does exist but produces the below issue when trying to run.

Description:

An attempt was made to call a method that does not exist. The attempt was made from the following location:

...

ANSWER

Answered 2021-Mar-25 at 15:02

5.4.5

This includes version Spring-Kafka 2.3.6

No it does not; the 5.4.x versions of spring-integration-kafka require 2.6.x; that method was added to the properties in 2.5.

See the project page for compatible versions.

https://spring.io/projects/spring-kafka

If you are using Spring Boot, it will bring in all of the right versions, and you should not specify versions at all in your pom.

For problem 3, it looks like you have a producer factory declared somewhere that is not compatible with the kafka template bean.

Source https://stackoverflow.com/questions/66801324

QUESTION

Gradle build failing "com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: org/gradle/api/internal/java/JavaLibrary"

Asked 2021-Mar-25 at 07:59

I just upgraded my gradle to 6.8.1 from 3.2.1 and keep on getting below error, I checked the gradle version output and all the dependencies looks appropriate as per the documentation. Below is the build.gradle, output of gradle -version and gradle.properties file. Thanks for your time.

...

ANSWER

Answered 2021-Mar-25 at 07:59

Caching Issue?

This is just a guess but it feels like this is a caching issue. The org.gradle.api.internal.java.JavaLibrary type exists in Gradle 3.2.1 but it doesn’t seem to exist in Gradle 6.8.1 – I checked here.

So I’d suggest trying to clear some caches, maybe starting by removing the .gradle/ and build/ directories of your project(s). My hunch is that that will already help. If it doesn’t, then try to remove (or temporarily rename) the cache/ directory in your Gradle user home (usually under ~/.gradle/) or even entirely remove/rename your Gradle user home.

Outdated Plugins Applied?

Looking more closely at the stacktrace, we can see a mention of com.github.jengelman.gradle.plugins.shadow.ShadowPlugin. Is it possible that the com.github.johnrengelman.shadow plugin is applied to your actual build somewhere? At least it’s not mentioned in the build.gradle of your question …

I could imagine that you’re using an old version of that plugin which is not compatible with Gradle 6.8.1. For example, version 1.2.4 of the plugin depends on org.gradle.api.internal.java.JavaLibrary. If you’re indeed using an old version of the plugin, then the solution would probably be to upgrade to a recent version.

If you’re applying any other external plugins that are not mentioned in your question, then they could also be the culprits or result in similar issues.

Source https://stackoverflow.com/questions/66723898

QUESTION

updating data in dvc registry from other projects

Asked 2021-Mar-02 at 15:33

I have a couple of projects that are using and updating the same data sources. I recently learned about dvc's data registries, which sound like a great way of versioning data across these different projects (e.g. scrapers, computational pipelines).

I have put all of the relevant data into data-registry and then I imported the relevant files into the scraper project with:

...

ANSWER

Answered 2021-Mar-02 at 15:29

When you import (or add) something into your project, a .dvc file is created with that lists that something (in this case the raw/ dir) as an "output".

DVC doesn't allow overlapping outputs among .dvc files or dvc.yaml stages, meaning that your "menu_items" stage shouldn't write to raw/ since it's already under the control of raw.dvc.

Can you make a separate directory for the pipeline outputs? E.g. use processed/menu_items/restaurant.jsonl

Source https://stackoverflow.com/questions/66409283

QUESTION

AWS IAM Setup for EC2 Resource in AWS Data Pipeline

Asked 2020-Dec-30 at 20:18

I am having an issue getting AWS Data Pipeline to run on an EC2 Instance via a Shell Command Activity.

I have been following the guide found here step by step: https://medium.com/@SarwatFatimaM/data-scientists-guide-setting-up-aws-datapipeline-for-running-python-etl-scripts-using-c6c8fa4de70d

The primary issue I am running into is that the pipeline will hang on the WAITING_FOR_RUNNER Status. I have confirmed that my python script and .bat (had to change from .sh as I am using a windows ec2) run inside of the desired Ec2 instance. However, from what I can tell the issue is a result of the warning I am receiving from inside the Datapipline Architect:

...

ANSWER

Answered 2020-Dec-30 at 20:18

As per the AWS Data Pipeline documentation on AWS found below, the custom AMI must have Linux installed. This, therefore, cannot be completed currently on a Windows EC2 and must be completed on a Linux EC2.

https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-custom-ami.html

Source https://stackoverflow.com/questions/65510944

QUESTION

I wonder if I can perform data-pipeline by directory of a specific name with DataFusion

Asked 2020-Nov-19 at 00:37

I'm using google-cloud-platform data fusion.

Assuming that the bucket's path is as follows:

test_buk/...

In the test_buk bucket there are four files:

20190901, 20190902

20191001, 20191002

Let's say there is a directory inside test_buk called dir.

I have a prefix-based bundle based on 201909(e.g, 20190901, 20190902)

also, I have a prefix-based bundle based on 201910(e.g, 20191001, 20191002)

I'd like to complete the data-pipeline for 201909 and 201910 bundles.

Here's what I've tried:

with regex path filter gs://test_buk/dir//2019 to run the data pipeline.

If regex path filter is inserted, the Input value is not read, and likewise there is no Output value.

When I want to create a data pipeline with a specific directory in a bundle, how do I handle it in a datafusion?

...

ANSWER

Answered 2020-Nov-17 at 12:14

If using directly the raw path (gs://test_buk/dir/), you might be getting an error when escaping special characters in the regex. That might be the reason for which you do not get any input file into the pipeline that matches your filter.

I suggest instead that you use ".*" to math the initial part (given that you are also specifying the path, no additional files in other folders will match the filter).

Therefore, I would use the following expressions depending on the group of files you want to use (feel free to change the extension of the files):

path = gs://test_buk/dir/

regex path filter = .*201909.*\.csv or .*201910.*\.csv

If you would like to know more about the regex used, you can take a look at (1)

Source https://stackoverflow.com/questions/64364158

QUESTION

AWS Data Pipeline restore to DynamoDB table error with "is in status 'CANCELLED' with reason 'Job terminated' "

Asked 2020-Nov-17 at 08:11

I have configured a AWS Data Pipeline to export a DynamoDB table to a S3 bucket (using the template) in a different account. That export is working just fine, but I have some issues when I try to restore the backup into the new table in that second account (Using the import template too).

My source of information for this task: https://aws.amazon.com/premiumsupport/knowledge-center/data-pipeline-account-access-dynamodb-s3/

I can see that the AWS Data Pipeline is restoring data to the new table (not sure if all the data is being restored) however the execution has status CANCELED.
The activity log shows several times this: EMR job '@TableLoadActivity_2020-09-07T12:45:59_Attempt=1' with jobFlowId 'j-11620944P11II' is in status 'WAITING' and reason 'Cluster ready after last step completed.'. Step 'df-06812232H5PDR4VVK472_@TableLoadActivity_2020-09-07T12:45:59_Attempt=1' is in status 'RUNNING' with reason 'null',

2.a) then the canceled part EMR job '@TableLoadActivity_2020-09-07T12:45:59_Attempt=1' with jobFlowId 'j-11620944P11II' is in status 'WAITING' and reason 'Cluster ready after last step completed.'. Step 'df-06812232H5PDR4VVK472_@TableLoadActivity_2020-09-07T12:45:59_Attempt=1' is in status 'CANCELLED' with reason 'Job terminated'

See full log below (just left few lines with error on point #2):

...

ANSWER

Answered 2020-Sep-29 at 11:48

ANSWERING TO MYSELF

The issue was that I set the Terminate After field because I got the warning message on the image below which suggested to do that, so I set Terminate After 1 hour and the reason I used that time is because the file to be imported was only 9,6 MB. How much time does it need to process such as small file?.

So the import process of that small file last about 5 hours.

Findings:

To improve the import time I increased the myDDBWriteThroughputRatiovalue to 0.95 from 0.25, At the beginning I didn't touch that parameter because it was the default value from the template, and AWS documentation sometimes simplify a lot stuff that in many cases you have to discover by trial and error.

After changing that value the import last about an hour, which is way better that 5 hours but still slow, because we are talking about only 9,6 MB

Then this that I saw in the logs is in status 'WAITING' and reason 'Cluster ready after last step completed.' which worried me a bit due to I'm new to using this tool and I didn't quite get message, is simply the following as explained ny someone from AWS

If you see that the EMR Cluster is in Waiting, Cluster ready after last steps, it means that cluster had executed the first request it has received and is waiting to execute the next request/activity on the cluster.

These are all my findings, hopefully this will helps someone else.

Source https://stackoverflow.com/questions/63782135

QUESTION

Assistance with Keras for a noise detection script

Asked 2020-Nov-03 at 19:43

I'm currently trying to learn more about Deep learning/CNN's/Keras through what I thought would be a quite simple project of just training a CNN to detect a single specific sound. It's been a lot more of a headache than I expected.

I'm currently reading through this ignoring the second section about gpu usage, the first part definitely seems like exactly what I'm needing. But when I go to run the script, (my script is pretty much totally lifted from the section in the link above that says "Putting the pieces together, you may end up with something like this:"), it gives me this error:

...

ANSWER

Answered 2020-Nov-03 at 19:43

The statement df.file_path denotes that you want access the file_path column in your dataframe table. It seams that you dataframe object does not contain this column. With df.head() you can check if you dataframe object contains the needed fields.

Source https://stackoverflow.com/questions/64668922

QUESTION

I see some unknown containers in `docker ps`

Asked 2020-Oct-29 at 08:35

When I run docker ps -a I get a list of unknown containers, with weird names

...

ANSWER

Answered 2020-Oct-29 at 07:16

No need to worry. These are containers that you have run, though, probably whilst testing/debugging something.

They can safely be deleted with docker rm , or to clear them all, docker rm $(docker ps -aq)

Source https://stackoverflow.com/questions/64586414

QUESTION

I'm curious about the internal workflow of GCP's Data Fusion

Asked 2020-Oct-29 at 02:43

I've used the Google Cloud platform's DataFusion product in developer and Enterprise mode.

For developer mode, there was no dataproc setting (Master node, Worker node).

For enterprise mode, there was a dataproc setting value. (Master node, Worker node)

What I'm curious about is the case of Enterprise mode.

I was able to set values for the Master node and Worker node.

in detail

...

ANSWER

Answered 2020-Oct-29 at 02:43

Dataproc allows users to create clusters, whereas the driver and executor settings in Cloud Data Fusion allow users to adjust how much of the cluster resources a pipeline run will use.

As such, creating a Dataproc cluster with 3 workers and 1 master will create 4 VMs with the memory and CPUs specified in the Dataproc configuration, whereas the setting the driver/executor CPUs and memory dictates how much of each master/worker VMs CPUs and memory resources a data pipeline job running on the cluster will use.

Source https://stackoverflow.com/questions/64540517

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install Data-Pipeline

If you don't have it already, install the Google App Engine SDK and follow the installation instructions. As noted previously, Data Pipeline use App Engine Modules which were introduced in App Engine 1.8.3 so you must install at least that version.
Make an app at appengine.google.com (we use an app id of example for this document).
Make an app at appengine.google.com (we use an app id of example for this document).
Enable billing.
Set up a Google Cloud Storage bucket (if you don't have already have it, install gsutil. If you do have it, you might need to run gsutil config to set up the credentials):
Go to Application Setting for your app on appengine.google.com
Copy the service account example@appspot.gserviceaccount.com
Click on the Google APIs Console Project NumberClick on the Google APIs Console Project Number
Add the service account under Permissions.
Click on APIs and Auth and turn on BigQuery, Google Cloud Storage and Google Cloud Storage JSON API.
Replace the application name in the .yaml files. So for example, if your app is called example.appspot.com:
Now publish your application:
You can now connect to your application and verify it: Click the little cog and add your default bucket of gs://example (be sure to substitute your bucket name here). You probably want to add prefix (e.g. tmp/) to isolate any temporary objects used to move data between stages. Now create a new pipeline and upload the contents of app/static/examples/gcstobigquery.json. Run the pipeline. It should successfully run to completion. Go to BigQuery and view your dataset and table.
As in the previous section, here we also assume gs://example for your bucket; and gce-example is a project that has enough quota for Google Compute Engine to host your Hadoop cluster. The quota size (instances and CPUs) depends on the Hadoop cluster size you will be using. We can use the same project as we did for BigQuery. As before, the following script can be copied and pasted into a shell as-is:.
Go to Application Settings on in the App Engine console and copy the value (should be an email address) indicated the Service Account Name field.
Go to the Cloud Console of the project for which Google Compute Engine will be used.
Go to the Permissions page, and click the red ADD MEMBER button on the top.
Paste the value from step #1 as the email address. Make sure the account has can edit permission. Click the Add button to save the change.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: