dataflow | experimental self-hosted Observable notebook editor

by asg017 JavaScript Version: v0.0.10 License: MIT

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | dataflow Summary

dataflow is a JavaScript library. dataflow has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

A self-hosted Observable notebook editor, with support for FileAttachments, Secrets, custom standard libraries, and more!.

Support

Quality

Security

License

Reuse

Support

dataflow has a low active ecosystem.

It has 343 star(s) with 18 fork(s). There are 13 watchers for this library.

It had no major release in the last 6 months.

There are 27 open issues and 4 have been closed. On average issues are closed in 16 days. There are 1 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of dataflow is v0.0.10

Quality

dataflow has no bugs reported.

Security

dataflow has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

dataflow is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

dataflow releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of dataflow

Get all kandi verified functions for this library.

dataflow Key Features

No Key Features are available at this moment for dataflow.

dataflow Examples and Code Snippets

No Code Snippets are available at this moment for dataflow.

Community Discussions

Trending Discussions on dataflow

Dynamically set bigquery table id in dataflow pipeline

Apache Beam Python gscio upload method has @retry.no_retries implemented causes data loss?

Azure Data Flow- Source query push down

Failed to start the VM error when starting a Dataflow SQL job

Memory usage from C# apps using C++ DLLs

Error running Beam job with DataFlow runner (using Bazel): no module found error

How to limit PCollection in Apache Beam as soon as possible?

Custom condition for Azure monitor rule doesn't show expected data

Nested rows using STRUCT are not supported in Dataflow SQL (GCP)

Inaccurate resulte from Bing Maps Geocode Dataflow

QUESTION

Dynamically set bigquery table id in dataflow pipeline

Asked 2021-Jun-15 at 14:30

I have dataflow pipeline, it's in Python and this is what it is doing:

Read Message from PubSub. Messages are zipped protocol buffer. One Message receive on a PubSub contain multiple type of messages. See the protocol parent's message specification below:
...

ANSWER

Answered 2021-Apr-16 at 18:49

How about using TaggedOutput.

Source https://stackoverflow.com/questions/67107333

QUESTION

Apache Beam Python gscio upload method has @retry.no_retries implemented causes data loss?

Asked 2021-Jun-14 at 18:49

I have a Python Apache Beam streaming pipeline running in Dataflow. It's reading from PubSub and writing to GCS. Sometimes I get errors like "Error in _start_upload while inserting file ...", which comes from:

...

ANSWER

Answered 2021-Jun-14 at 18:49

In a streaming pipeline, Dataflow retries work items running into errors indefinitely.

The code itself does not need to have retry logic.

Source https://stackoverflow.com/questions/67972758

QUESTION

Azure Data Flow- Source query push down

Asked 2021-Jun-10 at 19:03

My dataflow job has both source & sink as synapse database.

I have a source query with joins & transformations in the dataflow while extracting data from the synapse database.

As we know, dataflow under the hood will spin up the databricks cluster to execute the dataflow code.

My question here, the source query I am using in the data flow will that be executed on the synapse db/databricks cluster?

...

ANSWER

Answered 2021-Jun-10 at 19:03

The data flow requires a compute context, which is Spark. When you use a query in the transformation, that query will get executed from that Spark cluster, which essentially gets pushed down into the database engine for resolution.

Source https://stackoverflow.com/questions/67924304

QUESTION

Failed to start the VM error when starting a Dataflow SQL job

Asked 2021-Jun-09 at 23:02

Getting the following error when I try to launch a Dataflow SQL job:

Failed to start the VM, launcher-____, used for launching because of status code: INVALID_ARGUMENT, reason: Error: Message: Invalid value for field 'resource.networkInterfaces[0].network': 'global/networks/default'. The referenced network resource cannot be found. HTTP Code: 400.

This issue just started today.

...

ANSWER

Answered 2021-Jun-09 at 23:02

Adding the default network solved the issue.

Source https://stackoverflow.com/questions/67910977

QUESTION

Memory usage from C# apps using C++ DLLs

Asked 2021-Jun-09 at 08:33

Just to get it right, I would like to have your opinion if I am right with my imagination of how the dataflow is between a C# programm calling a C++ dll with delegates as parameter.

The System gives memory to the C# program
The C# Program loads the .dll and gives some of its space to the C++ dll. In this space there will be no C# Garbage Collection, only if the .dll is unloaded and then there can be freed the whole space.
A C++ function is called. The specific Function has a delegate as parameter. We dive into the C++ memory area and declare some variables. The C++ function will somewhere in its Code call the C# delegate.
The C# delegate operates on the C# Memory and will have a copy of its input parameters in the C# memory, if they are native types or a reference to the variables in the C++ memory, if it is a complex type. If we have native types I can just save it into the C# world and all will be fine. But if it is a reference and I just save it into my C# memory, I will get undefined behaviour, if I end my C++ function, because the variables will get out of scope and will be destroyed.
The C# function ends and we get the returnvalue in C++ as copy (or a pointer to the returnvalue, if it is a complex type, the pointer will point into the C# memory)
the C++ function ends and the used memory of the C++ function is released

Am I right with this?

...

ANSWER

Answered 2021-Jun-09 at 08:31

This should be described in the documentation for the marshaller

if they are native types or a reference to the variables in the C++ memory, if it is a complex type. If we have native types I can just save it into the C# world and all will be fine. But if it is a reference and I just save it into my C# memory, I will get undefined behavior, if I end my C++ function, because the variables will get out of scope and will be destroyed

My understanding is that the marshaller will either convert complex types to structs, or pointers (IntPtr). Structs are passed by value, so you would have a copy in managed memory (probably on the stack). Pointers would need unsafe code to access, so you would be responsible for handling these safely.

C++ as copy (or a pointer to the returnvalue, if it is a complex type, the pointer will point into the C# memory)

There is not really a way a managed function can return a pointer to managed memory in a safe way. To create a pointer you would need to fix the object to preventing the GC from moving it, but fixing is scoped, so it would not work for return values.

I personally consider the marshalling rules a bit complicated, and I would prefer to keep any p/Invoke simple, if for no other reason than to avoid questions about safety. For more complicated interoperability between c# and c++ I would suggest c++/cli. This allow you to do type conversion yourself, and adds a whole host of tools you can use to ensure correct functioning.

Source https://stackoverflow.com/questions/67898785

QUESTION

Error running Beam job with DataFlow runner (using Bazel): no module found error

Asked 2021-Jun-09 at 00:05

I am trying to run a beam job on dataflow using the python sdk.

My directory structure is :

...

ANSWER

Answered 2021-Jun-08 at 09:22

Probably the wrapper-runner script generated by Bazel (you can find path to it by calling bazel build on a target) restrict set of modules available in your script. The proper approach is to fetch PyPI dependencies by Bazel, look at example

Source https://stackoverflow.com/questions/67864433

QUESTION

How to limit PCollection in Apache Beam as soon as possible?

Asked 2021-Jun-08 at 13:40

I'm using Apache Beam 2.28.0 on Google Cloud DataFlow (with Scio SDK). I have a large input PCollection (bounded) and I want to limit / sample it to a fixed number of elements, but I want to start the downstream processing as soon as possible.

Currently, when my input PCollection has e.g. 20M elements and I want to limit it to 1M by using https://beam.apache.org/releases/javadoc/2.28.0/org/apache/beam/sdk/transforms/Sample.html#any-long-

...

ANSWER

Answered 2021-Jun-08 at 13:40

OK, so my initial solution for that is to use Stateful DoFn like this (I'm using Scio's Scala SDK as mentioned in the question):

Source https://stackoverflow.com/questions/67885943

QUESTION

Custom condition for Azure monitor rule doesn't show expected data

Asked 2021-Jun-08 at 08:40

I am trying to set up a new custom condition for Azure monitor alert rule, but when I enter my kql query it doesn't show the expected data. When I run the same query in Logs it outputs 9 rows that fulfil my condition, but for some reason, no data are shown in the Monitor Alerts.

I can see that the problem is in the last condition | where Anomaly has "1" as I get data when I delete this condition - but I need to have it included in the query (or at least a similar version of it). Any suggestions? (I have tried also contains and == but it gives the same problem)

...

ANSWER

Answered 2021-Jun-03 at 00:01

The most general answer: start by working backwards and validate your assumptions.

remove the final | where... line and see what the query returns. does it have 1s?

has and has_any and contains all have subtly different semantics, so you may need to use one or the other or somethin.

if your result doesn't have 1s, then work back one more line, is your array_slice call return the items you think it does?

if you just want the 0th item, why even use slice? why not just use Anomaly=anomalies[0] ?

without having your exact data set, there's no way for us to reproduce the query /results exactly.

Source https://stackoverflow.com/questions/67805974

QUESTION

Nested rows using STRUCT are not supported in Dataflow SQL (GCP)

Asked 2021-Jun-04 at 19:29

With Dataflow SQL I would like to read a Pub/Sub topic, enrich the message and write the message to a Pub/Sub topic.

Which Dataflow SQL query will create my desired output message?

Pub/Sub input message: {"event_timestamp":1619784049000, "device":{"ID":"some_id"}}

Desired Pub/Sub output message: {"event_timestamp":1619784049000, "device":{“ID":"some_id", “NAME”:”some_name”}}

What I get is: {"event_timestamp":1619784049000, "device":{"ID":"some_id"}, "NAME":"some_name" }

but I need the NAME inside the “device” attribute.

...

ANSWER

Answered 2021-May-07 at 14:16

You need to create a struct in the projection (SELECT part)

Source https://stackoverflow.com/questions/67431641

QUESTION

Inaccurate resulte from Bing Maps Geocode Dataflow

Asked 2021-Jun-02 at 15:28

I am using a Python POST request to geocode the addresses of my company's branches, but I'm getting wildly inaccurate results.

I looked at this answer, but the problem is that some results aren't being processed. My problem is different in that all of my results are inaccurate, even ones with Confidence="High". And I do have an enterprise account.

Here's the documentation that shows how to create a geocode Job and upload data:
https://docs.microsoft.com/en-us/bingmaps/spatial-data-services/geocode-dataflow-api/create-a-geocode-job-and-upload-data

here's a basic version of my code to upload:

...

ANSWER

Answered 2021-Jun-02 at 15:28

I see several issues in your request data:

The "query" value you are passing in is a combination of a point of interest name and a location. Geocoders only work with addresses. So in this case the point of interest name is being dropped and only "Los Angeles" is being used by the geocoder, thus the result.
You are mixing two different geocode query types into a single query. Either use just "query" or just the individual address parts (AddressLine, Locality, AdminDistrict, CountryRegion, PostalCode). In this case, the "query" value is being used an everything else in being ignored, using the individual address parts will be much more accurate than your query.
You are passing in the full address into the AddressLine field. That should only be the street address (i.e. "8830 Slauson Ave").

Here is a modified version of the request that will likely return the information you are expecting:

Source https://stackoverflow.com/questions/67797560

Community Discussions, Code Snippets contain sources that include Stack Exchange Network