parquet-mr | MR contains the java implementation
kandi X-RAY | parquet-mr Summary
kandi X-RAY | parquet-mr Summary
Apache Parquet
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Binds the data column .
- Converts a primitive type to its corresponding Java type .
- Create a conversion container for the given conversion .
- Infer schema from input stream .
- Converts the specified JsonNode to an avro representation .
- Sets the file crypto metadata .
- Adds the row groups to the parquet .
- Set the glob pattern .
- Merge two schemas .
- Gets converter .
parquet-mr Key Features
parquet-mr Examples and Code Snippets
Community Discussions
Trending Discussions on parquet-mr
QUESTION
I have a spark job that writes data to parquet files with snappy compression. One of the columns in parquet is a repeated INT64.
When upgrading from spark 2.2 with parquet 1.8.2 to spark 3.1.1 with parquet 1.10.1, I witnessed a severe degradation in compression ratio.
For this file for example (saved with spark 2.2) I have the following metadata:
...ANSWER
Answered 2021-May-09 at 08:32So as updated above, snappy-java 1.1.2.6 resolved my issue. Any version higher than this, results in degraded compression. Also tried the purejava flag, but this results in Exception reading parquet. Will open tickets for snappy-java
QUESTION
I'm using Apache Parquet Hadoop - ParquetRecordWriter with MapReduce and hit ParquetEncodingException: writing empty page
. Despite I found, that this is happening in ColumnWriterBase when the valueCount is 0, I don't undrestand the real reason why this property is 0, why it has something with Endoding and how can such state happened? Any idea? Thanks for any tip.
ANSWER
Answered 2020-May-19 at 08:39I think the problem is with the start/end calls. One issue is that startMessage()
and endMessage()
are invoked twice, once in write(MyData)
and again in writeData(MyData)
.
I would suggest using the ValidatingRecordConsumer
as a wrapper for the recordConsumer you use. This way you may get more meaningful exceptions if something is wrong with the record serialization.
QUESTION
I'm building a spring-boot powered service that writes data to Hadoop using filesystem API. Some data is written to parquet file and large blocks are cached in memory so when the service is shut down, potentially several hundred Mb of data have to be written to Hadoop.
FileSystem
closes automatically by default, so when the service is shut down, sometimes FileSystem
gets closed before all the writers are closed resulting in corrupted parquet files.
There is fs.automatic.close
flag in filesystem Configuration
, but FileSystem
instance is used from multiple threads and I don't know any clean way to wait for them all to finish before closing FileSystem
manually. I tried using a dedicated filesysem closing bean implementing Spring SmartLifeCycle
with max phase
so it is destroyed last, but actually it is not destroyed last but notified of shutdown last while other beans are still in the process of shutting down.
Ideally every object that needs a FileSystem
would get one and would be responsible for closing it. The problem is FileSystem.get(conf)
returns a cached instance. There is FileSystem.newInstance(conf)
, but it is not clear what are the consequences of using multiple FileSystem
instances performance-wise. There is another issue with that - there is no way to pass FileSystem
instance to ParquetWriter
- it gets one using path.getFileSystem(conf)
. And one would think that line would return a FileSystem
instance assigned to that file only, but one would be wrong - most likely the same cached instance would be returned so closing it would be wrong.
Is there a recommended way of managing a lifecycle of a FileSystem
? What would happen if a FileSystem
is created with fs.automatic.close
set to true
and never closed manually? Maybe spring-boot supports a clean way to close FileSystem
after all other beans are actually destroyed (not being destroyed)?
Thanks!
...ANSWER
Answered 2020-May-08 at 21:33You can disable the FileSystem
cache using the fs..impl.disable.cache
configuration (found here, some discussion here), where in your case would be
hdfs
(assuming you are using HDFS). This will force ParquetWriter
to create a new FileSystem
instance when it calls path.getFileSystem(conf)
. This configuration is undocumented for good reason--while widely used in unit tests within Hadoop itself, it can be very dangerous to use in a production system. To answer your question regarding performance, assuming you are using HDFS, each FileSystem
instance will create a separate TCP connection to the HDFS NameNode. Application and library code is typically written with the assumption that calls like path.getFileSystem(conf)
and FileSystem.get(conf)
are cheap and lightweight, so they are used frequently. In a production system, I have seen a client system DDoS a NameNode server because it disabled caching. You need to carefully manage the lifecycle of not just FileSystem
instances your code creates, but also those created by libraries you use. I would generally recommend against it.
It sounds like the issue is really coming from a bad interaction between the JVM shutdown hooks used by Spring and those employed by Hadoop, which is the mechanism used to automatically close FileSystem
instances. Hadoop includes its own ShutdownHookManager which is used to sequence events during shutdown; FileSystem
shutdown is purposefully placed at the end so that other shutdown hooks (e.g. cleaning up after a MapReduce task) can be completed first. But, Hadoop's ShutdownHookManager
is only aware of shutdown tasks that have been registered to it, so it will not be aware of Spring's lifecycle management. It does sound like leveraging Spring's shutdown sequences and leveraging fs.automatic.close=false
may be the right fit for your application; I don't have Spring experience so I can't help you in that regard. You may also be able to register Spring's entire shutdown sequence with Hadoop's ShutdownHookManager
, using a very high priority to ensure that Spring's shutdown sequence is first in the shutdown queue.
To answer this portion specifically:
Is there a recommended way of managing a lifecycle of a FileSystem?
The recommended way is generally to not manage it, and let the system do it for you. There be dragons whenever you try to manage it yourself, so proceed with caution.
QUESTION
I'm trying to read a local Parquet file, however the only APIs I can find are tightly coupled with Hadoop, and require a Hadoop Path
as input (even for pointing to a local file).
ANSWER
Answered 2020-Feb-04 at 08:50You can use ParquetFileReader class for that
QUESTION
Somewhat recently, the parquet-format project added a UUID logical type. Specifically, this was added in revision 2.4 of the parquet format. I'm interested in using the parquet-mr library in Java to create some parquet files, but I can't seem to figure out how to use the UUID logical type in a parquet schema. A simple schema like this does not seem to work as I would hope:
...ANSWER
Answered 2020-Mar-25 at 20:03The parquet-mr library currently doesn't support the UUID logical type. There's an issue to track progress in implementing this feature here.
QUESTION
I am running a spark job to write to parquet. I want to enable dictionary encoding for the files written. When I check the files, I see they are 'plain dictionary'. However, I do not see any stats for these columns
Let me know if I am missing anything
...ANSWER
Answered 2020-Mar-27 at 23:18Got the answer. The parquet tools version I was using was 1.6. Upgrading to 1.10 solved the issue
QUESTION
I have a spark data-frame having a small amount of fields. Some of the fields are huge binary blobs. The size of the entire row is approx 50 MB.
I am saving the data frame into a parquet format. I am controlling the size of the row-group using parquet.block.size
parameter.
Spark will generate a parquet file, however I will always get at least 100 rows in a row group. This is a problem for me since chunk sizes could become gigabytes which does not work well with my application.
parquet.block.size
works as expected as long as the size is big enough to accomodate more than 100 rows.
I modified InternalParquetRecordWriter.java to be MINIMUM_RECORD_COUNT_FOR_CHECK = 2
, which fixed the issue, however, there is no configuration value I can find that would support tuning this hardcoded constant.
Is there a different/better way to get row-group sizes that are smaller than a 100?
This is a snippet of my code:
...ANSWER
Answered 2018-Jan-10 at 02:33Unfortunately I haven't found a way to do so. I reported this issue to remove the hard coded values and make them configurable. I have a patch for it if you're interested.
QUESTION
I'm generating Parquet files via two methods: a Kinesis Firehose and a Spark job. They are both written into the same partition structure on S3. Both sets of data can be queried using the same Athena table definition. Both use gzip compression.
I'm noticing, however, that the Parquet files generated by Spark are about 3x as large as those from Firehose. Any reason this should be the case? I do notice some schema and metadata differences when I load them using Pyarrow:
...ANSWER
Answered 2019-Jul-05 at 01:57Two things that I can think of than could attribute to the difference.
1. Parquet properties.
In Spark, you could find all the properties related to Parquet using the following snippets.
If properties were set using Hadoop configs,
QUESTION
I'm trying to run a parquet-tools command to only view the file schema of my parquet file.
I am currently running:
...ANSWER
Answered 2019-Jun-06 at 13:13Try
QUESTION
The parquet was generated by Spark v2.4 Parquet-mr v1.10
...ANSWER
Answered 2019-Mar-18 at 17:12The offset of the first data page is always larger than the offset of the dictionary. In other words, the dictionary comes first and only then the data pages. There are two metadata fields meant to store these offsets: dictionary_page_offset
(aka DO) and data_page_offset
(aka FPO).
Unfortunately, these metadata fields are not filled in correctly by parquet-mr.
For example, if the dictionary starts at offset 1000 and the the first data page starts at offset 2000, then the correct values would be:
dictionary_page_offset
= 1000data_page_offset
= 2000
Instead, parquet-mr stores
dictionary_page_offset
= 0data_page_offset
= 1000
Applied to your example, this means that in spite of parquet-tools showing DO: 0
, columns x and y are dictionary encoded nonetheless (column z is not).
It is worth mentioning that Impala follows the specification correctly, so you can not rely on every file having this deficiency.
This is how parquet-mr handles this situation during reading:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install parquet-mr
You can use parquet-mr like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the parquet-mr component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page