parquet-format | Apache Parquet

by apache Java Version: apache-parquet-format-2.9.0 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | parquet-format Summary

parquet-format is a Java library typically used in Big Data, Spark, Hadoop applications. parquet-format has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can download it from GitHub, Maven.

Apache Parquet

Support

Quality

Security

License

Reuse

Support

parquet-format has a medium active ecosystem.

It has 1357 star(s) with 402 fork(s). There are 62 watchers for this library.

It had no major release in the last 6 months.

parquet-format has no issues reported. There are 15 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of parquet-format is apache-parquet-format-2.9.0

Quality

parquet-format has no bugs reported.

Security

parquet-format has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

parquet-format is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

parquet-format releases are not available. You will need to build from source code and install.

Deployable package is available in Maven.

Build file is available. You can build the component from source.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed parquet-format and discovered the below as its top functions. This is intended to give you an instant insight into parquet-format implemented functionality, and help decide if they suit your requirements.

Consume a struct
Create a new object
Read a key - value pair
Reads data from the stream
Write a 64 - bit integer
Writes the set end
Write a string
End message end
Write end field end
Write stop field stop
Writes a set begin
Write a 32 - bit integer
Write double
Write a 16 - bit integer
Write list beginning
Write list end
Reads the content of the set
Reads the content of a list and passes it to the event consumer
Reads the map content and passes it to the event consumer
Writes message beginning
Writes the end of the map
Writes a map beginning

Get all kandi verified functions for this library.

parquet-format Key Features

No Key Features are available at this moment for parquet-format.

parquet-format Examples and Code Snippets

No Code Snippets are available at this moment for parquet-format.

Community Discussions

Trending Discussions on parquet-format

Reading a zst archive in Scala & Spark: native zStandard library not available

Is there a Parquet equivalent for Python?

How to convert a float to a Parquet TIMESTAMP Logical Type?

Apache Avro - Internal Representation

How to use the Parquet UUID Logical Type in a schema

NoSuchMethodError: com.fasterxml.jackson.datatype.jsr310.deser.JSR310DateTimeDeserializerBase.findFormatOverrides on Databricks

Read/Write Parquet with Struct column type

How to fix 'ClassCastException: cannot assign instance of' - Works local but not in standalone on cluster

How to handle null values when writing to parquet from Spark

Exception in thread "main" java.lang.NoClassDefFoundError: scala/Cloneable

QUESTION

Reading a zst archive in Scala & Spark: native zStandard library not available

Asked 2021-Apr-18 at 21:25

I'm trying to read a zst-compressed file using Spark on Scala.

...

ANSWER

Answered 2021-Apr-18 at 21:25

Since I didn't want to build Hadoop by myself, inspired by the workaround used here, I've configured Spark to use Hadoop native libraries:

Source https://stackoverflow.com/questions/67099204

QUESTION

Is there a Parquet equivalent for Python?

Asked 2020-Dec-18 at 12:01

I just discovered Parquet and it met my "big" data processing / (local) storage needs:

faster than relational databases, which are designed to run over the network (creating overhead) and just aren't as fast as a solution designed for local storage
compared to JSON or CSV: good for storing data efficiently into types (instead of everything being a string) and can read specific chunks from the file more dynamically than JSON or CSV

But to my dismay while Node.js has a fully functioning library for it, the only Parquet lib for Python seems to be quite literally a half-measure:

parquet-python is a pure-python implementation (currently with only read-support) of the parquet format ... Not all parts of the parquet-format have been implemented yet or tested e.g. nested data

So what gives? Is there something better than Parquet already supported by Python that lowers interest in developing a library to support it? Is there some close alternative?

...

ANSWER

Answered 2020-Dec-18 at 12:01

Actually, you can read and write parquet with pandas which is commonly use for data jobs (not ETL on big data tho). For handling parquet pandas use two common packages:

pyarrow is a cross-platform tool providing columnar format for memory. Parquet is also a columnar format, it has support for it though it has variety of formats and it is a broader lib.

fastparquet is solely designed to focus on parquet format to use on process for python-based bigdata flows.

Source https://stackoverflow.com/questions/65356595

QUESTION

How to convert a float to a Parquet TIMESTAMP Logical Type?

Asked 2020-Sep-23 at 14:48

Let say I have a pyarrow table with a column Timestamp containing float64. These floats are actually timestamps experessed in s. For instance:

...

ANSWER

Answered 2020-Sep-23 at 09:20

I don't think you'll be able to convert within arrow from floats to timestamp.

Arrow assumes timestamp are 64 bit integers of a given precision (ms, us, ns). In your case you have to multiply your seconds floats by the precision you want (1000 for ms), then convert to int64 and cast into timestamp.

Here's an example using pandas:

Source https://stackoverflow.com/questions/63991411

QUESTION

Apache Avro - Internal Representation

Asked 2020-Jun-13 at 12:14

I am in the process of learning Apache Avro and I would like to know how is it represented internally. If I were to describe Apache Parquet for the same question, I can say each Parquet file is composed of row_groups, each row_groups contains column chunks and column chunks has multiple pages with different encodings. Finally the metadata about all of these is stored on the file footer. This file representation is clearly documented in the Github page as well in its official Apache page.

To find the same internal representation for Apache Avro I looked into multiple pages like Github page, Apache Avro's home and the book Hadoop definitive guide and many more tutorials online but I am not able to find what I am looking for. I understand Apache Avro is row oriented file format and each of the file has the schema also along with the data in the file. All of them is fine but I wanted to know how the data is further broken down for interal organization perhaps like pages for RDBMS tables.

Any pointers related to this will be highly appreciated.

...

ANSWER

Answered 2020-Jun-13 at 12:14

The Avro container file format is specified in their documentation here. If you're into the whole brevity thing, then Wikipedia has a more pithy description:

An Avro Object Container File consists of:

A file header, followed by

one or more file data blocks.

A file header consists of:

Four bytes, ASCII 'O', 'b', 'j', followed by the Avro version number which is 1 (0x01) (Binary values 0x4F 0x62 0x6A 0x01).

File metadata, including the schema definition.

The 16-byte, randomly-generated sync marker for this file.

For data blocks Avro specifies two serialization encodings, binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.

You can verify this against their reference implementation, e.g. in DataFileWriter.java - start with the main create method and then look at the append(D datum) method.

The binary object encoding is described in their documentation here. The encoded data is simply a traversal of the encoded object (or objects), with each object and field encoded as described in the documentation.

Source https://stackoverflow.com/questions/62268076

QUESTION

How to use the Parquet UUID Logical Type in a schema

Asked 2020-Apr-14 at 20:21

Somewhat recently, the parquet-format project added a UUID logical type. Specifically, this was added in revision 2.4 of the parquet format. I'm interested in using the parquet-mr library in Java to create some parquet files, but I can't seem to figure out how to use the UUID logical type in a parquet schema. A simple schema like this does not seem to work as I would hope:

...

ANSWER

Answered 2020-Mar-25 at 20:03

The parquet-mr library currently doesn't support the UUID logical type. There's an issue to track progress in implementing this feature here.

Source https://stackoverflow.com/questions/60673164

QUESTION

NoSuchMethodError: com.fasterxml.jackson.datatype.jsr310.deser.JSR310DateTimeDeserializerBase.findFormatOverrides on Databricks

Asked 2020-Feb-19 at 08:46

I'm working on a rather big project. I need to use azure-security-keyvault-secrets, so I added following to my pom.xml file:

...

ANSWER

Answered 2019-Dec-27 at 18:36

So I managed to fix the problem with the maven-shade-plugin. I added following piece of code to my pom.xml file:

Source https://stackoverflow.com/questions/59498535

QUESTION

Read/Write Parquet with Struct column type

Asked 2020-Feb-18 at 08:59

I am trying to write a Dataframe like this to Parquet:

...

ANSWER

Answered 2020-Feb-17 at 04:37

You have at least 3 options here:

Option 1:

You don't need to use any extra libraries like fastparquet since Spark provides that functionality already:

Source https://stackoverflow.com/questions/60227123

QUESTION

How to fix 'ClassCastException: cannot assign instance of' - Works local but not in standalone on cluster

Asked 2019-Dec-04 at 16:49

I have a Spring web application(built in maven) with which I connect to my spark cluster(4 workers and 1 master) and to my cassandra cluster(4 nodes). The application starts, the workers communicate with the master and the cassandra cluster is also running. However when I do a PCA(spark mllib) or any other calculation(clustering, pearson, spearman) through the interface of my web-app I get the following error:

java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

which appears on this command:

...

ANSWER

Answered 2019-Oct-29 at 03:20

Try replace logback with log4j (remove logback dependency), at least it helped in our similar case.

Source https://stackoverflow.com/questions/57412125

QUESTION

How to handle null values when writing to parquet from Spark

Asked 2019-Oct-28 at 09:42

Until recently parquet did not support null values - a questionable premise. In fact a recent version did finally add that support:

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

However it will be a long time before spark supports that new parquet feature - if ever. Here is the associated (closed - will not fix) JIRA:

https://issues.apache.org/jira/browse/SPARK-10943

So what are folks doing with regards to null column values today when writing out dataframe's to parquet ? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null - short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).

...

ANSWER

Answered 2018-May-03 at 17:39

You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns.

The problem is that null alone carries no type information at all

Source https://stackoverflow.com/questions/50160682

QUESTION

Exception in thread "main" java.lang.NoClassDefFoundError: scala/Cloneable

Asked 2019-Jul-20 at 20:00

I have a project where I am using spark with scala. The Code does not give the compilation issue but when I run the code I get the below exception:-

...

ANSWER

Answered 2019-Jul-20 at 20:00

You are using Scala version 2.13 however Apache Spark has not yet been compiled for 2.13. Try changing your build.sbt to the following

Source https://stackoverflow.com/questions/57127875

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install parquet-format

You can download it from GitHub, Maven.
You can use parquet-format like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the parquet-format component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

Support

Comment on the issue and/or contact the parquet-dev mailing list with your questions and ideas. Changes to this core format definition are proposed and discussed in depth on the mailing list. You may also be interested in contributing to the Parquet-MR subproject, which contains all the Java-side implementation and APIs. See the "How To Contribute" section of the Parquet-MR project.

Find more information at: