hudi | Upserts , Deletes And Incremental Processing on Big Data

by apache Java Version: release-0.13.1 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | hudi Summary

hudi is a Java library typically used in Big Data, Spark, Amazon S3 applications. hudi has build file available, it has a Permissive License and it has medium support. However hudi has 275 bugs and it has 1 vulnerabilities. You can download it from GitHub, Maven.

Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage).

Support

Quality

Security

License

Reuse

Support

hudi has a medium active ecosystem.

It has 4276 star(s) with 1976 fork(s). There are 1182 watchers for this library.

It had no major release in the last 12 months.

There are 306 open issues and 2140 have been closed. On average issues are closed in 38 days. There are 383 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of hudi is release-0.13.1

Quality

hudi has 275 bugs (21 blocker, 14 critical, 48 major, 192 minor) and 2579 code smells.

Security

hudi has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

hudi code analysis shows 1 unresolved vulnerabilities (1 blocker, 0 critical, 0 major, 0 minor).

There are 32 security hotspots that need review.

License

hudi is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

hudi releases are available to install and integrate.

Deployable package is available in Maven.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

hudi saves you 84469 person hours of effort in developing the same functionality from scratch.

It has 92879 lines of code, 6730 functions and 934 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed hudi and discovered the below as its top functions. This is intended to give you an instant insight into hudi implemented functionality, and help decide if they suit your requirements.

Adds a column to a column vector .
Get multiple input splits .
Creates a converter for the given type .
Creates a column vector from a constant value .
register latest file slices
Reads information from the source .
Converts the given logical type into a schema .
Get input format
Convert a parquet field to a string .
Checks to see if a given path is accepted .

Get all kandi verified functions for this library.

hudi Key Features

No Key Features are available at this moment for hudi.

hudi Examples and Code Snippets

No Code Snippets are available at this moment for hudi.

Community Discussions

Trending Discussions on hudi

Unable to run spark.sql on AWS Glue Catalog in EMR when using Hudi

Writing spark DataFrame In Apache Hudi Table

Issue for Integrating Hudi with Kafka using Avro Schema

Why does Delta Lake seem to store so much redundant information?

More than 1 column in record key in spark Hudi Job while making an upsert

spark-submit Error: java.util.NoSuchElementException: spark.scheduler.mode

Spark streaming - Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file

Spark structured streaming with Apache Hudi

Delta Lake : How does upsert internally work?

ng-container => blank page bug in Ionic

QUESTION

Unable to run spark.sql on AWS Glue Catalog in EMR when using Hudi

Asked 2021-Apr-16 at 22:29

Our setup is configured that we have a default Data Lake on AWS using S3 as storage and Glue Catalog as our metastore.

We are starting to use Apache Hudi and we could get it working following de AWS documentation. The issue is that, when using the configuration and JARs indicated in the doc, we are unable to run spark.sql on our Glue metastore.

Here follows some information.

We are creating the cluster with boto3:

...

ANSWER

Answered 2021-Apr-12 at 11:46

please open an issue in github.com/apache/hudi/issues to get help from the hudi community.

Source https://stackoverflow.com/questions/67027525

QUESTION

Writing spark DataFrame In Apache Hudi Table

Asked 2021-Mar-21 at 15:58

I am new to apace hudi and trying to write my dataframe in my Hudi table using spark shell. For type first time i am not creating any table and writing in overwrite mode so I am expecting it will create hudi table.I am Writing below code.

...

ANSWER

Answered 2021-Mar-21 at 15:58

Here is a working sample for your question in pyspark:

Source https://stackoverflow.com/questions/66705961

QUESTION

Issue for Integrating Hudi with Kafka using Avro Schema

Asked 2021-Mar-18 at 10:14

I am trying to integrate Hudi with Kafka topic.

Steps followed :

Created Kafka topic in Confluent with schema defined in schema registry.
Using kafka-avro-console-producer, I am trying to produce data.
Running Hudi Delta Streamer in continuous mode to consume the data.

Infrastructure :

AWS EMR
Spark 2.4.4
Hudi Utility ( Tried with 0.6.0 and 0.7.0 )
Avro ( Tried avro-1.8.2, avro-1.9.2 and avro-1.10.0 )

I am getting the below error stacktrace. Can someone please help me out with this?

...

ANSWER

Answered 2021-Mar-02 at 11:15

please open a github issue (https://github.com/apache/hudi/issues) to get timely reply.

Source https://stackoverflow.com/questions/66372649

QUESTION

Why does Delta Lake seem to store so much redundant information?

Asked 2020-Oct-19 at 15:29

I just started using delta lake so my mental model might be off - I'm asking this question to validate/refute it.

My understanding of delta lake is that it only stores incremental changes to data (the "deltas"). Kind of like git - every time you make a commit, you're not storing an entire snapshot of the codebase - a commit only contains the changes you made. Similarly, I would imagine that if I create a Delta table and then I attempt to "update" the table with everything it already contains (i.e. an "empty commit") then I would not expect to see any new data created as a result of that update.

However this is not what I observe: such an update appears to duplicate the existing table. What's going on? That doesn't seem very "incremental" to me.

(I'll replace the actual UUID values in the filenames for readability)

...

ANSWER

Answered 2020-Oct-19 at 15:29

it's not completely correct understanding - Delta won't check the existing data for duplicates automatically - if you want to store only new/updated data, then you need to use merge operation that will check for existing data, and then you can decide what you'll do with existing data - overwrite with new data, or just ignore that.

You may find more information on the Delta's site, or in the 9th chapter of the Learning Spark, 2ed book (it's freely available from Databricks)

Source https://stackoverflow.com/questions/64429977

QUESTION

More than 1 column in record key in spark Hudi Job while making an upsert

Asked 2020-Sep-02 at 19:14

I am currently doing a POC on deltalake where I came across this framework called Apache Hudi. Below is the data I am trying to write using apache spark framework.

...

ANSWER

Answered 2020-Sep-02 at 05:29

This can be solved using ComplexKeyGenerator instead of SimplekeyGenerator.

Source https://stackoverflow.com/questions/63645977

QUESTION

spark-submit Error: java.util.NoSuchElementException: spark.scheduler.mode

Asked 2020-May-30 at 16:48

I am trying to setup Apache Hudi on an Ubuntu 16.04 server. I cloned the repo https://github.com/apache/incubator-hudi.git and then build it as

...

ANSWER

Answered 2019-Jun-28 at 10:04

The issue has been resolved in the updated code

https://github.com/apache/incubator-hudi/issues/767
https://github.com/apache/incubator-hudi/pull/758

Source https://stackoverflow.com/questions/56679759

QUESTION

Spark streaming - Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file

Asked 2020-May-30 at 16:46

Im using spark to write my json data to s3. However, I keep getting the below error. We are using apache hudi for updates. This only happens for some data, everything else works fine.

...

ANSWER

Answered 2019-Dec-28 at 14:13

Found out the issue. The issue was with schema mismatch in existing parquet files and incoming data. One of the fields was string in existing parquet schema, and it was being sent as long in the newer chunk of data.

Source https://stackoverflow.com/questions/59492879

QUESTION

Spark structured streaming with Apache Hudi

Asked 2020-May-29 at 18:00

I have a requirement where i need to write the stream using structured streaming to Hudi dataset. I found there is a provision to do this over Apache Hudi Jira issues but wanted to know if anyone successfully implemented this and have an example. I am trying to structure stream the data from AWS Kinesis Firehose to Apache Hudi using spark structured streaming

Quick help is appreciated.

...

ANSWER

Answered 2019-Aug-27 at 13:09

I know of atleast one user using structure streaming sink in Hudi. https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/test/scala/DataSourceTest.scala#L190 could help?

Source https://stackoverflow.com/questions/57494700

QUESTION

Delta Lake : How does upsert internally work?

Asked 2019-Dec-27 at 23:55

In our data pipelines ,we ingest CDC events from data-sources and write these changes into "incremental data" folder in AVRO format.

Then periodically, we run Spark jobs to merge this "incremental data" with our current version of the "snapshot table" (ORC format) to get the latest version of the upstream snapshot.

During this merge logic :

1) we load the "incremental data" as an DataFrame df1

2) load the current "snapshot table" as an DataFrame df2

3) merge df1 and df2 de-duplicating ids and taking the latest version of the rows (using update_timestamp column)

This logic loads the entire data for both "incremental data" and current "snapshot table" into Spark memory which can be quite huge depend on the database.

I noticed that in Delta Lake, similar operation is done using following code:

...

ANSWER

Answered 2019-Dec-27 at 23:55

   1) How does merge/upsert internally works? Does it load entire "updatedDF" and 
   "/data/events/" into Spark memory?

Source https://stackoverflow.com/questions/59476892

QUESTION

ng-container => blank page bug in Ionic

Asked 2018-Nov-28 at 19:26

In order to have a clean and DRY ionic project, I just wanted my navbar code to be written in one place, instead of writing the whole HTML in every Ionic page.

For this purpose, I created an Angular component (not an Ionic page) named navbar, and I inject it in my pages. To keep a clean layout with no additional stuff in the DOM, I created a component with the brackets notation, like this:

...

ANSWER

Answered 2018-Nov-28 at 19:26

You can apply the navbar attribute to the ion-header element:

Source https://stackoverflow.com/questions/53524482

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install hudi

The default Scala version supported is 2.11. To build for Scala 2.12 version, build using scala-2.12 profile.
The default Spark version supported is 2.4.4. To build for different Spark 3 versions, use the corresponding profile.
The default hudi-jar bundles spark-avro module. To build without spark-avro module, build using spark-shade-unbundle-avro profile.
Please visit https://hudi.apache.org/docs/quick-start-guide.html to quickly explore Hudi's capabilities using spark-shell.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: