datafusion | DataFusion has now been donated to the Apache Arrow project

by andygrove Rust Version: 0.5.2 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | datafusion Summary

datafusion is a Rust library typically used in Big Data, Spark applications. datafusion has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

DataFusion is an attempt at building a modern distributed compute platform in Rust, leveraging Apache Arrow as the memory model. See my article How To Build a Modern Distributed Compute Platform to learn about the design and my motivation for building this. The TL;DR is that this project is a great way to learn about building a query engine but this is quite early and not usable for any real world work just yet.

Support

Quality

Security

License

Reuse

Support

datafusion has a low active ecosystem.

It has 627 star(s) with 55 fork(s). There are 42 watchers for this library.

It had no major release in the last 6 months.

There are 0 open issues and 120 have been closed. On average issues are closed in 243 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of datafusion is 0.5.2

Quality

datafusion has 0 bugs and 0 code smells.

Security

datafusion has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

datafusion code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

datafusion is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

datafusion releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of datafusion

Get all kandi verified functions for this library.

datafusion Key Features

No Key Features are available at this moment for datafusion.

datafusion Examples and Code Snippets

No Code Snippets are available at this moment for datafusion.

Community Discussions

Trending Discussions on datafusion

Indexing in datafusion

How to specify which GCP project to use when triggering a pipeline through Data Fusion operator on Cloud Composer

Invalid arguments were passed to CloudDataFusionPipelineStateSensor 'failure_statuses'

Issue installing apache-airflow-backport-providers-google module on airflow instance of Google Composer

Reading Excel with edit on read in DataFusion

GCP - CDAP - Dataproc cluster stucks in running state

Using GCP DLP with DataFusion, unable to find template

How do I await a future inside a non-async method which was called from an async method within the context of a Tokio runtime?

Replace dot in datafusion wrangler not working

I wonder if I can perform data-pipeline by directory of a specific name with DataFusion

QUESTION

Indexing in datafusion

Asked 2022-Mar-29 at 12:23

Context: I am using datafusion to build a data validator for a csv file input.

Requirement: I want to add row number where the error occurred in output report. In pandas, I have ability to add row index which can be used for this purpose. Is there a way to achieve similar result in datafusion.

...

ANSWER

Answered 2021-Aug-09 at 02:35

There doesn't appear to be any easy way to do this within datafusion after opening the CSV file. But you could instead open the CSV file directly with arrow, produce a new RecordBatch that incorporates the index column, and then feed this to datafusion using a MemTable. Here's the example assuming we are only processing one batch ...

Source https://stackoverflow.com/questions/68692580

QUESTION

How to specify which GCP project to use when triggering a pipeline through Data Fusion operator on Cloud Composer

Asked 2021-Dec-10 at 09:18

I need to to trigger a Data Fusion pipeline located on a GCP project called myDataFusionProject through a Data Fusion operator (CloudDataFusionStartPipelineOperator) inside a DAG whose Cloud Composer instance is located on another project called myCloudComposerProject.

I have used the official documentation as well as the source code to write the code that roughly resembles the below snippet:

...

ANSWER

Answered 2021-Dec-09 at 13:19

As a recommendation while developing operators on airflow, we should check the classes that are implementing the operators as documentation may lack some information due to versioning.

As commented, if you check CloudDataFusionStartPipelineOperator you will find that it makes use of a hook that gets the instance base on a project-id. This project-id its optional, so you can add your own project-id.

Source https://stackoverflow.com/questions/70286201

QUESTION

Invalid arguments were passed to CloudDataFusionPipelineStateSensor 'failure_statuses'

Asked 2021-Nov-02 at 15:50

I am trying to check the state of a Data Fusion pipeline with Cloud Composer. In the DAG, I have the following code, which is a copy from Airflow website:

...

ANSWER

Answered 2021-Nov-02 at 15:50

The failure_statuses parameter for the CloudDataFusionPipelineStateSensor wasn't introduced until v6.0.0 of the Google provider in Airflow. The example DAG reflects the provider with this version. Try upgrading to the latest Google provider and the example should work.

Be aware that there were some breaking changes between v5.1.0 to v6.0.0 of the provider.

Side note on looking at source code in Airflow. As of Airflow 2, releases of core Airflow and functionality related to service providers (e.g. hooks, operators, sensors for Google, Databricks, etc.) have been decoupled. This means that provider functionality can be released independent from core Airflow. Providers are released typically monthly. The main branch in Airflow reflects the latest code base but that does not mean it reflects the latest available code. To make sure you are looking at the correct code for the provider version you have installed, use tags when searching through the source code:

Source https://stackoverflow.com/questions/69811499

QUESTION

Issue installing apache-airflow-backport-providers-google module on airflow instance of Google Composer

Asked 2021-Oct-18 at 08:34

I need execute Data Fusion pipelines from Composer, using de operatos for this:

...

ANSWER

Answered 2021-Oct-18 at 08:34

There is no need to install apache-airflow-backport-providers-google in Airflow 2.0+. This package actually backports Airflow 2 operators into Airflow 1.10.*. In addition, in Composer version composer-1.17.0-airflow-2.1.2 the apache-airflow-providers-google==5.0.0 package is already installed according to the documentation. You should be able to import the Data Fusion operators with the code snippet you posted as is.

However, if this is not the case, you should probably handle the conflict shown in the logs when trying to reinstall apache-airflow-providers-google==5.0.0:

Source https://stackoverflow.com/questions/69589570

QUESTION

Reading Excel with edit on read in DataFusion

Asked 2021-Oct-04 at 21:03

I am reading an excel file with google DataFusion Wrangler plugin. In the excel the first row needs to be discarded, as headers and data start from second row.
Problem is when Wrangler reads and parse-as-excel a file, it gives default option of choosing the first row as header.
Need some help to isolate such that first row is skipped and header is 2nd row with the data following.
Thanks for the help!

...

ANSWER

Answered 2021-Sep-27 at 18:08

This behavior is currently not supported by the Wrangler plugin. As you are already aware, Wrangler will only take a look at the first column to decode headers.

In this case, pre-processing the file to remove the first row is the easiest solution.

Source https://stackoverflow.com/questions/69177233

QUESTION

GCP - CDAP - Dataproc cluster stucks in running state

Asked 2021-Mar-29 at 20:41

We have a DataFusion pipeline which is triggered by a Cloud Composer DAG. This pipeline provisions an ephemeral DataProc cluster which cluster - in an ideally scenario - terminates after finishing the tasks.

In our case, sometimes, not always, this ephemeral DataProc cluster stucks in a running state. The job inside in the cluster is also in a running state, and the last log messages are the followings:

...

ANSWER

Answered 2021-Mar-29 at 20:41

Which version of Datafusion are you running? Also what is the amount of memory for the Dataproc cluster? Sometimes we observe this issue when the Dataproc cluster ran out of memory. I would suggest increasing the amount of memory.

Source https://stackoverflow.com/questions/66807255

QUESTION

Using GCP DLP with DataFusion, unable to find template

Asked 2021-Mar-23 at 03:09

I have created a DLP Identification template named DLPTest in Project X.
My Datafusion resources are hosted in Project Y.
Issue is when I use the Redact plugin in Datafusion, and provide the template ID or path in the form -
projects/X/locations/{LOCATION}/inspectTemplates/DLPTest or
projects/X/inspectTemplates/DLPTest
All permissions have been provided to datafusion SA, compute engine SA, DLP Service Account. Datafusion fails to find the template, as it keeps searching for template in Project Y.
> Error logs - > Caused by:com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Invalid path:
Datafusion is expecting template in location projects/Y/inspectTemplates/projects/DLPTest
How do I enable DF to look for template in the correct location in separate project? Thanks. ...

ANSWER

Answered 2021-Mar-23 at 03:09

When you want Project Y (where your data fusion is in) to use resources from Project X (where the DLP is in) is to add the data fusion and compute engine service accounts of Project Y to Project X.

Notes:

Data Fusion service account: service-xxxxxxx@gcp-sa-datafusion.iam.gserviceaccount.com
Default compute engine service account: xxxxx-compute@developer.gserviceaccount.com

Project Y:

Go to IAM & Admin -> IAM
Click View by: "Members"
Tick checkbox "Include Google-provided role grants"
Look for service-(project number of Project Y)@gcp-sa-datafusion.iam.gserviceaccount.com and (project number of Project Y)-compute@developer.gserviceaccount.com
Add role "DLP Administrator" for service-(project number of Project Y)@gcp-sa-datafusion.iam.gserviceaccount.com

Project X:

Go to IAM & Admin -> IAM
Click Add
Under New Members, put service-(project number of Project Y)@gcp-sa-datafusion.iam.gserviceaccount.com
Grant role of "DLP Admininistrator"
Repeat step 2 to step 4 but this time put in (project number of Project Y)-compute@developer.gserviceaccount.com

Now that you are able to set the permissions, Go back to Project Y and update your Redact to point to Project X.

Go to Data Fusion -> Studio
Click Redact-> Properties
Put the template ID you created in Project X, in my sample it is "test_template"
Under Project ID, put the Project ID of Project X
Run your Data Fusion pipeline

Source https://stackoverflow.com/questions/66753289

QUESTION

How do I await a future inside a non-async method which was called from an async method within the context of a Tokio runtime?

Asked 2021-Feb-19 at 16:05

I'm using Tokio 1.1 to do async things. I have an async main with #[tokio::main] so I'm already operating with a runtime.

main invokes a non-async method where I'd like to be await on a future (specifically, I'm collecting from a datafusion dataframe). This non-async method has a signature prescribed by a trait which returns a struct, not a Future. As far as I'm aware, I can't mark it async.

If I try and call df.collect().await;, I get the

only allowed inside async functions and blocks

error from the compiler, pointing out that the method that I'm calling await within is not async.

If I try and block_on the future from a new runtime like this:

...

ANSWER

Answered 2021-Feb-19 at 16:05

I'm within an async context, and it feels like the compiler should know that and allow me to call .await from within the method

It is fundamentally impossible to await inside of a synchronous function whether or not you are within the context of a runtime. await's are transformed into yield points, and async functions are transformed into state machine's that make use of these yield points to perform asynchronous computations. Without marking your function as async, this transformation is impossible.

If I understand your question correctly, you have the following code:

Source https://stackoverflow.com/questions/66035290

QUESTION

Replace dot in datafusion wrangler not working

Asked 2020-Dec-17 at 07:27

I need to remove dots from a number in Google DataFusion. For this I'm using the Wrangler transformation, but I'm having troubles with one file, because if I replace the dots, the whole cells gets empty. If I replace any other character, it works. What can be the trouble?

Thanks!

Original Value: After replacing dots (.) :

Same cell/row but replacing spaces and number 1

...

ANSWER

Answered 2020-Dec-17 at 07:27

The find and replace function of the wrangler is similar with "sed" wherein it applies regular expressions.

Period (.) matches any character except a newline character.

Here is the original data:

I tried this on my own project and here is the result when using the un-escaped period:

You need to escape the period symbol (.) so it will treat it as a regular period. Here is the result when escaping period:

As you can see, the period(.) was removed before "jpg".

Source https://stackoverflow.com/questions/65325022

QUESTION

I wonder if I can perform data-pipeline by directory of a specific name with DataFusion

Asked 2020-Nov-19 at 00:37

I'm using google-cloud-platform data fusion.

Assuming that the bucket's path is as follows:

test_buk/...

In the test_buk bucket there are four files:

20190901, 20190902

20191001, 20191002

Let's say there is a directory inside test_buk called dir.

I have a prefix-based bundle based on 201909(e.g, 20190901, 20190902)

also, I have a prefix-based bundle based on 201910(e.g, 20191001, 20191002)

I'd like to complete the data-pipeline for 201909 and 201910 bundles.

Here's what I've tried:

with regex path filter gs://test_buk/dir//2019 to run the data pipeline.

If regex path filter is inserted, the Input value is not read, and likewise there is no Output value.

When I want to create a data pipeline with a specific directory in a bundle, how do I handle it in a datafusion?

...

ANSWER

Answered 2020-Nov-17 at 12:14

If using directly the raw path (gs://test_buk/dir/), you might be getting an error when escaping special characters in the regex. That might be the reason for which you do not get any input file into the pipeline that matches your filter.

I suggest instead that you use ".*" to math the initial part (given that you are also specifying the path, no additional files in other folders will match the filter).

Therefore, I would use the following expressions depending on the group of files you want to use (feel free to change the extension of the files):

path = gs://test_buk/dir/

regex path filter = .*201909.*\.csv or .*201910.*\.csv

If you would like to know more about the regex used, you can take a look at (1)

Source https://stackoverflow.com/questions/64364158

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install datafusion

You can download it from GitHub.
Rust is installed and managed by the rustup tool. Rust has a 6-week rapid release process and supports a great number of platforms, so there are many builds of Rust available at any time. Please refer rust-lang.org for more information.