parquet-mr | MR contains the java implementation

by apache Java Version: apache-parquet-1.13.1 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | parquet-mr Summary

parquet-mr is a Java library typically used in Big Data, Spark applications. parquet-mr has no bugs, it has build file available, it has a Permissive License and it has high support. However parquet-mr has 1 vulnerabilities. You can download it from GitHub, Maven.

Apache Parquet

Support

Quality

Security

License

Reuse

Support

parquet-mr has a highly active ecosystem.

It has 2042 star(s) with 1314 fork(s). There are 94 watchers for this library.

It had no major release in the last 6 months.

parquet-mr has no issues reported. There are 122 open pull requests and 0 closed requests.

It has a negative sentiment in the developer community.

The latest version of parquet-mr is apache-parquet-1.13.1

Quality

parquet-mr has no bugs reported.

Security

parquet-mr has 1 vulnerability issues reported (0 critical, 1 high, 0 medium, 0 low).

License

parquet-mr is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

parquet-mr releases are not available. You will need to build from source code and install.

Deployable package is available in Maven.

Build file is available. You can build the component from source.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed parquet-mr and discovered the below as its top functions. This is intended to give you an instant insight into parquet-mr implemented functionality, and help decide if they suit your requirements.

Binds the data column .
Converts a primitive type to its corresponding Java type .
Create a conversion container for the given conversion .
Infer schema from input stream .
Converts the specified JsonNode to an avro representation .
Sets the file crypto metadata .
Adds the row groups to the parquet .
Set the glob pattern .
Merge two schemas .
Gets converter .

Get all kandi verified functions for this library.

parquet-mr Key Features

No Key Features are available at this moment for parquet-mr.

parquet-mr Examples and Code Snippets

No Code Snippets are available at this moment for parquet-mr.

Community Discussions

Trending Discussions on parquet-mr

Parquet compression degradation when upgrading spark

Parquet writer: org.apache.parquet.io.ParquetEncodingException: writing empty page

Should Hadoop FileSystem be closed?

Read local Parquet file without Hadoop Path API

How to use the Parquet UUID Logical Type in a schema

spark parquet enable dictionary

Creating parquet files in spark with row-group size that is less than 100

parquet file size, firehose vs. spark

View schema in parquet with on command line parquet-tools

Why is dictionary page offset 0 for `plain_dictionary` encoding?

QUESTION

Parquet compression degradation when upgrading spark

Asked 2021-May-09 at 08:32

I have a spark job that writes data to parquet files with snappy compression. One of the columns in parquet is a repeated INT64.

When upgrading from spark 2.2 with parquet 1.8.2 to spark 3.1.1 with parquet 1.10.1, I witnessed a severe degradation in compression ratio.

For this file for example (saved with spark 2.2) I have the following metadata:

...

ANSWER

Answered 2021-May-09 at 08:32

So as updated above, snappy-java 1.1.2.6 resolved my issue. Any version higher than this, results in degraded compression. Also tried the purejava flag, but this results in Exception reading parquet. Will open tickets for snappy-java

Source https://stackoverflow.com/questions/67413589

QUESTION

Parquet writer: org.apache.parquet.io.ParquetEncodingException: writing empty page

Asked 2020-May-19 at 08:39

I'm using Apache Parquet Hadoop - ParquetRecordWriter with MapReduce and hit ParquetEncodingException: writing empty page. Despite I found, that this is happening in ColumnWriterBase when the valueCount is 0, I don't undrestand the real reason why this property is 0, why it has something with Endoding and how can such state happened? Any idea? Thanks for any tip.

...

ANSWER

Answered 2020-May-19 at 08:39

I think the problem is with the start/end calls. One issue is that startMessage() and endMessage() are invoked twice, once in write(MyData) and again in writeData(MyData). I would suggest using the ValidatingRecordConsumer as a wrapper for the recordConsumer you use. This way you may get more meaningful exceptions if something is wrong with the record serialization.

Source https://stackoverflow.com/questions/61776917

QUESTION

Should Hadoop FileSystem be closed?

Asked 2020-May-08 at 21:33

I'm building a spring-boot powered service that writes data to Hadoop using filesystem API. Some data is written to parquet file and large blocks are cached in memory so when the service is shut down, potentially several hundred Mb of data have to be written to Hadoop.

FileSystem closes automatically by default, so when the service is shut down, sometimes FileSystem gets closed before all the writers are closed resulting in corrupted parquet files.

There is fs.automatic.close flag in filesystem Configuration, but FileSystem instance is used from multiple threads and I don't know any clean way to wait for them all to finish before closing FileSystem manually. I tried using a dedicated filesysem closing bean implementing Spring SmartLifeCycle with max phase so it is destroyed last, but actually it is not destroyed last but notified of shutdown last while other beans are still in the process of shutting down.

Ideally every object that needs a FileSystem would get one and would be responsible for closing it. The problem is FileSystem.get(conf) returns a cached instance. There is FileSystem.newInstance(conf), but it is not clear what are the consequences of using multiple FileSystem instances performance-wise. There is another issue with that - there is no way to pass FileSystem instance to ParquetWriter - it gets one using path.getFileSystem(conf). And one would think that line would return a FileSystem instance assigned to that file only, but one would be wrong - most likely the same cached instance would be returned so closing it would be wrong.

Is there a recommended way of managing a lifecycle of a FileSystem? What would happen if a FileSystem is created with fs.automatic.close set to true and never closed manually? Maybe spring-boot supports a clean way to close FileSystem after all other beans are actually destroyed (not being destroyed)?

Thanks!

...

ANSWER

Answered 2020-May-08 at 21:33

You can disable the FileSystem cache using the fs..impl.disable.cache configuration (found here, some discussion here), where in your case would be hdfs (assuming you are using HDFS). This will force ParquetWriter to create a new FileSystem instance when it calls path.getFileSystem(conf). This configuration is undocumented for good reason--while widely used in unit tests within Hadoop itself, it can be very dangerous to use in a production system. To answer your question regarding performance, assuming you are using HDFS, each FileSystem instance will create a separate TCP connection to the HDFS NameNode. Application and library code is typically written with the assumption that calls like path.getFileSystem(conf) and FileSystem.get(conf) are cheap and lightweight, so they are used frequently. In a production system, I have seen a client system DDoS a NameNode server because it disabled caching. You need to carefully manage the lifecycle of not just FileSystem instances your code creates, but also those created by libraries you use. I would generally recommend against it.

It sounds like the issue is really coming from a bad interaction between the JVM shutdown hooks used by Spring and those employed by Hadoop, which is the mechanism used to automatically close FileSystem instances. Hadoop includes its own ShutdownHookManager which is used to sequence events during shutdown; FileSystem shutdown is purposefully placed at the end so that other shutdown hooks (e.g. cleaning up after a MapReduce task) can be completed first. But, Hadoop's ShutdownHookManager is only aware of shutdown tasks that have been registered to it, so it will not be aware of Spring's lifecycle management. It does sound like leveraging Spring's shutdown sequences and leveraging fs.automatic.close=false may be the right fit for your application; I don't have Spring experience so I can't help you in that regard. You may also be able to register Spring's entire shutdown sequence with Hadoop's ShutdownHookManager, using a very high priority to ensure that Spring's shutdown sequence is first in the shutdown queue.

To answer this portion specifically:

Is there a recommended way of managing a lifecycle of a FileSystem?

The recommended way is generally to not manage it, and let the system do it for you. There be dragons whenever you try to manage it yourself, so proceed with caution.

Source https://stackoverflow.com/questions/55168902

QUESTION

Read local Parquet file without Hadoop Path API

Asked 2020-May-03 at 09:03

I'm trying to read a local Parquet file, however the only APIs I can find are tightly coupled with Hadoop, and require a Hadoop Path as input (even for pointing to a local file).

...

ANSWER

Answered 2020-Feb-04 at 08:50

You can use ParquetFileReader class for that

Source https://stackoverflow.com/questions/59939309

QUESTION

How to use the Parquet UUID Logical Type in a schema

Asked 2020-Apr-14 at 20:21

Somewhat recently, the parquet-format project added a UUID logical type. Specifically, this was added in revision 2.4 of the parquet format. I'm interested in using the parquet-mr library in Java to create some parquet files, but I can't seem to figure out how to use the UUID logical type in a parquet schema. A simple schema like this does not seem to work as I would hope:

...

ANSWER

Answered 2020-Mar-25 at 20:03

The parquet-mr library currently doesn't support the UUID logical type. There's an issue to track progress in implementing this feature here.

Source https://stackoverflow.com/questions/60673164

QUESTION

spark parquet enable dictionary

Asked 2020-Mar-27 at 23:18

I am running a spark job to write to parquet. I want to enable dictionary encoding for the files written. When I check the files, I see they are 'plain dictionary'. However, I do not see any stats for these columns

Let me know if I am missing anything

...

ANSWER

Answered 2020-Mar-27 at 23:18

Got the answer. The parquet tools version I was using was 1.6. Upgrading to 1.10 solved the issue

Source https://stackoverflow.com/questions/60819720

QUESTION

Creating parquet files in spark with row-group size that is less than 100

Asked 2019-Dec-01 at 19:02

I have a spark data-frame having a small amount of fields. Some of the fields are huge binary blobs. The size of the entire row is approx 50 MB.

I am saving the data frame into a parquet format. I am controlling the size of the row-group using parquet.block.size parameter.

Spark will generate a parquet file, however I will always get at least 100 rows in a row group. This is a problem for me since chunk sizes could become gigabytes which does not work well with my application.

parquet.block.size works as expected as long as the size is big enough to accomodate more than 100 rows.

I modified InternalParquetRecordWriter.java to be MINIMUM_RECORD_COUNT_FOR_CHECK = 2, which fixed the issue, however, there is no configuration value I can find that would support tuning this hardcoded constant.

Is there a different/better way to get row-group sizes that are smaller than a 100?

This is a snippet of my code:

...

ANSWER

Answered 2018-Jan-10 at 02:33

Unfortunately I haven't found a way to do so. I reported this issue to remove the hard coded values and make them configurable. I have a patch for it if you're interested.

Source https://stackoverflow.com/questions/48177808

QUESTION

parquet file size, firehose vs. spark

Asked 2019-Jul-11 at 12:12

I'm generating Parquet files via two methods: a Kinesis Firehose and a Spark job. They are both written into the same partition structure on S3. Both sets of data can be queried using the same Athena table definition. Both use gzip compression.

I'm noticing, however, that the Parquet files generated by Spark are about 3x as large as those from Firehose. Any reason this should be the case? I do notice some schema and metadata differences when I load them using Pyarrow:

...

ANSWER

Answered 2019-Jul-05 at 01:57

Two things that I can think of than could attribute to the difference.
1. Parquet properties.
In Spark, you could find all the properties related to Parquet using the following snippets.
If properties were set using Hadoop configs,

Source https://stackoverflow.com/questions/56813435

QUESTION

View schema in parquet with on command line parquet-tools

Asked 2019-Jun-06 at 13:30

I'm trying to run a parquet-tools command to only view the file schema of my parquet file.

I am currently running:

...

ANSWER

Answered 2019-Jun-06 at 13:13

Try

Source https://stackoverflow.com/questions/56444439

QUESTION

Why is dictionary page offset 0 for `plain_dictionary` encoding?

Asked 2019-Mar-18 at 17:12

The parquet was generated by Spark v2.4 Parquet-mr v1.10

...

ANSWER

Answered 2019-Mar-18 at 17:12

The offset of the first data page is always larger than the offset of the dictionary. In other words, the dictionary comes first and only then the data pages. There are two metadata fields meant to store these offsets: dictionary_page_offset (aka DO) and data_page_offset (aka FPO). Unfortunately, these metadata fields are not filled in correctly by parquet-mr.

For example, if the dictionary starts at offset 1000 and the the first data page starts at offset 2000, then the correct values would be:

dictionary_page_offset = 1000
data_page_offset = 2000

Instead, parquet-mr stores

dictionary_page_offset = 0
data_page_offset = 1000

Applied to your example, this means that in spite of parquet-tools showing DO: 0, columns x and y are dictionary encoded nonetheless (column z is not).

It is worth mentioning that Impala follows the specification correctly, so you can not rely on every file having this deficiency.

This is how parquet-mr handles this situation during reading:

Source https://stackoverflow.com/questions/55225108

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install parquet-mr

You can download it from GitHub, Maven.
You can use parquet-mr like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the parquet-mr component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: