parquet-cpp | Apache Parquet
kandi X-RAY | parquet-cpp Summary
kandi X-RAY | parquet-cpp Summary
The Apache Arrow and Parquet have merged development process and build systems in the Arrow repository. Please submit pull requests in JIRA issues should continue to be opened in the PARQUET JIRA project.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of parquet-cpp
parquet-cpp Key Features
parquet-cpp Examples and Code Snippets
Community Discussions
Trending Discussions on parquet-cpp
QUESTION
I have essentially row-oriented/streaming data (Netflow) coming into my C++ application and I want to write the data to Parquet-gzip files.
Looking at the sample reader-writer.cc program in the parquet-cpp project, it seems that I can only feed the data to parquet-cpp in a columnar way:
...ANSWER
Answered 2017-Aug-09 at 19:21You will never be able to have no buffering at all as we need to transform from a row-wise to a columnar representation. The best possible path at the time of writing is to construct Apache Arrow tables that are then fed into parquet-cpp
.
parquet-cpp
provides special Arrow APIs that can then directly operate on these tables, mostly without any additional data copies. You can find the API in parquet/arrow/reader.h
and parquet/arrow/writer.h
.
The optimal but yet to be implemented solution could save some bytes by doing the following:
- ingest row-by-row in a new parquet-cpp API
- directly encode these values per column with the specified encoding and compression settings
- only buffer this in memory
- at the end of the row group, write out column after column
While this optimal solution may save you some memory, there are still some steps that need to be implemented by someone (feel free to contribute them or ask for help on implementing those), you are probably good with uaing the Apache Arrow based API.
QUESTION
I installed Basemap on Conda (Win 10 64, Python 3.7.3) in a non-root environment but I ended up with the problem that there is no epsg in the proj folder. Following the advice from github I found out I had version 1.2.0 and tried to install 1.2.1 without success.
EDIT: Apparently it is an incompatibility issue with proj
as can be seen when trying this:
conda create -n test python proj basemap=1.2.1 -c defaults -c conda-forge
First I set channel conda-forge to highest priority and half of my environment got updated due to this, Basemap didn't however.
Then I tried to force an install of 1.2.1 which lead to a detailed report of what packages are in conflict with each other:
...ANSWER
Answered 2019-Oct-30 at 21:40You may want to try destroying your environment and starting fresh. Also, it looks like you have almost cloned the base environment, are you sure you need all of those packages?
To remove the environment:
QUESTION
The parquet was generated by Spark v2.4 Parquet-mr v1.10
...ANSWER
Answered 2019-Mar-18 at 17:12The offset of the first data page is always larger than the offset of the dictionary. In other words, the dictionary comes first and only then the data pages. There are two metadata fields meant to store these offsets: dictionary_page_offset
(aka DO) and data_page_offset
(aka FPO).
Unfortunately, these metadata fields are not filled in correctly by parquet-mr.
For example, if the dictionary starts at offset 1000 and the the first data page starts at offset 2000, then the correct values would be:
dictionary_page_offset
= 1000data_page_offset
= 2000
Instead, parquet-mr stores
dictionary_page_offset
= 0data_page_offset
= 1000
Applied to your example, this means that in spite of parquet-tools showing DO: 0
, columns x and y are dictionary encoded nonetheless (column z is not).
It is worth mentioning that Impala follows the specification correctly, so you can not rely on every file having this deficiency.
This is how parquet-mr handles this situation during reading:
QUESTION
Using AWS Firehose I am converting incoming records to parquet. In one example, I have 150k identical records enter firehose, and a single 30kb parquet gets written to s3. Because of how firehose partitions data, we have a secondary process (lambda triggered by s3 put event) read in the parquet and repartitions it based on the date within the event itself. After this repartitioning process, the 30kb file size jumps to 900kb.
Inspecting both parquet files-
- The meta doesn't change
- The data doesn't change
- They both use SNAPPY compression
- The firehose parquet is created by parquet-mr, the pyarrow generated parquet is created by parquet-cpp
- The pyarrow generated parquet has additional pandas headers
The full repartitioning process-
...ANSWER
Answered 2018-Oct-31 at 13:34Parquet uses different column encodings to store low entropy data very efficiently. For example:
- It can use delta encoding to only store differences between values. For example
9192631770, 9192631773, 9192631795, 9192631797
would be stored effectively as9192631770, +3, +12, +2
. - It can use dictionary encoding to shortly refer to common values. For example,
Los Angeles, Los Angeles, Los Angeles, San Francisco, San Francisco
would be stored as a dictionary of0 = Los Angeles, 1 = San Francisco
and the references0, 0, 0, 1, 1
- It can use run-length encoding to only store the number of repeating values. For example,
Los Angeles, Los Angeles, Los Angeles
would be effectively stored asLos Angeles×3
. (Actually as far as I know pure RLE is only used for boolean types at this moment, but the idea is the same.) - A combination of the above, specifically RLE and dictionary encoding. For example,
Los Angeles, Los Angeles, Los Angeles, San Francisco, San Francisco
would be stored as a dictionary of0 = Los Angeles, 1 = San Francisco
and the references0×3, 1×2
With the 3 to 5 values of the examples above, the savings are not that significant, but the more values you have the bigger the gain. Since you have 150k identical records, the gains will be huge, since with RLE dictionary encoding, each column value will only have to be stored once, and then marked as repeating 150k times.
However, it seems that pyarrow does not use these space-saving encodings. You can confirm this by taking a look at the metadata of the two files using parquet-tools meta
. Here is a sample output:
QUESTION
I'm using to Docker to build a Python container with the intention of having a reproducible environment on several machines, which are a bunch of development Macbooks and several AWS EC2 servers.
The container is based on continuumio/miniconda3
, i.e. Dockerfile starts with
ANSWER
Answered 2018-Mar-06 at 11:35Perhaps you're using an outdated miniconda on one of the build machine, try doing docker build --pull --no-cache
.
Docker doesn't necessarily pull the latest image from the repository, so unless you do a --pull, it is possible that some of your machines may be starting the build with outdated base image.
QUESTION
From the parquet-cpp home page:
By default, Parquet links to Arrow's shared libraries. If you wish to statically-link the Arrow symbols instead, pass -DPARQUET_ARROW_LINKAGE=static.
I do want to statically link Arrow, because I want to use my program on other servers that won't have Arrow installed. I tried -DPARQUET_ARROW_LINKAGE=static
, but I get an error about "missing transitive dependencies":
ANSWER
Answered 2018-Jan-19 at 18:26I arranged a script to download dependencies sources, set the environment variables and run your cmake
line at the end. Just change the DEPDIR variable value, setting it to a directory of choice.
QUESTION
I'm trying to compile the project parquet-cpp: https://github.com/apache/parquet-cpp
When I make, this is the error I get:
...ANSWER
Answered 2017-Jun-19 at 14:52So cmake does use its own version of curl.
I had to download cmake sources here https://cmake.org/download/
and use ./bootstrap --system-curl
, make
and make install
to have a cmake version which uses the system curl. I also needed to install the package libcurl-devel
.
QUESTION
I'm trying to use the following github project https://github.com/apache/parquet-cpp. I was able to build it and the .so files are available in parquet-cpp/build/latest. I copied the .so files(both of libparquet as well as libarrow which had been built) in a separate directory and wrote a simple hello world, simply importing the library as:
...ANSWER
Answered 2017-Jun-12 at 20:10You forgot to include the path for the header files in the compilation instruction. You need to find directory containing parquet/api/reader.h
and include it in the compilation command
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install parquet-cpp
cmake .
You can customize build dependency locations through various environment variables:
ARROW_HOME customizes the Apache Arrow installed location.
THRIFT_HOME customizes the Apache Thrift (C++ libraries and compiler installed location.
GTEST_HOME customizes the googletest installed location (if you are building the unit tests).
GBENCHMARK_HOME customizes the Google Benchmark installed location (if you are building the benchmarks).
make
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page