parquet-cpp | Apache Parquet

by apache C++ Version: apache-parquet-cpp-1.5.0 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(8)Vulnerabilities Install Support

kandi X-RAY | parquet-cpp Summary

parquet-cpp is a C++ library typically used in Big Data, Spark, Amazon S3 applications. parquet-cpp has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

The Apache Arrow and Parquet have merged development process and build systems in the Arrow repository. Please submit pull requests in JIRA issues should continue to be opened in the PARQUET JIRA project.

Support

Quality

Security

License

Reuse

Support

parquet-cpp has a low active ecosystem.

It has 406 star(s) with 191 fork(s). There are 54 watchers for this library.

It had no major release in the last 6 months.

parquet-cpp has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of parquet-cpp is apache-parquet-cpp-1.5.0

Quality

parquet-cpp has no bugs reported.

Security

parquet-cpp has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

parquet-cpp is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

parquet-cpp releases are not available. You will need to build from source code and install.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of parquet-cpp

Get all kandi verified functions for this library.

parquet-cpp Key Features

No Key Features are available at this moment for parquet-cpp.

parquet-cpp Examples and Code Snippets

No Code Snippets are available at this moment for parquet-cpp.

Community Discussions

Trending Discussions on parquet-cpp

How can I write streaming/row-oriented data using parquet-cpp without buffering?

Fail to install basemap 1.2.1 with conda due to package conflicts, how to resolve?

Why is dictionary page offset 0 for `plain_dictionary` encoding?

Repartitioning parquet-mr generated parquets with pyarrow/parquet-cpp increases file size by x30?

Docker and Conda: Differences when building the same container on Mac and on Ubuntu

How can I statically link Arrow when building parquet-cpp?

cmake "libcurl was built with SSL disabled, https: not supported!"

Linking a shared library from a large github project

QUESTION

How can I write streaming/row-oriented data using parquet-cpp without buffering?

Asked 2020-Apr-06 at 05:48

I have essentially row-oriented/streaming data (Netflow) coming into my C++ application and I want to write the data to Parquet-gzip files.

Looking at the sample reader-writer.cc program in the parquet-cpp project, it seems that I can only feed the data to parquet-cpp in a columnar way:

...

ANSWER

Answered 2017-Aug-09 at 19:21

You will never be able to have no buffering at all as we need to transform from a row-wise to a columnar representation. The best possible path at the time of writing is to construct Apache Arrow tables that are then fed into parquet-cpp.

parquet-cpp provides special Arrow APIs that can then directly operate on these tables, mostly without any additional data copies. You can find the API in parquet/arrow/reader.h and parquet/arrow/writer.h.

The optimal but yet to be implemented solution could save some bytes by doing the following:

ingest row-by-row in a new parquet-cpp API
directly encode these values per column with the specified encoding and compression settings
only buffer this in memory
at the end of the row group, write out column after column

While this optimal solution may save you some memory, there are still some steps that need to be implemented by someone (feel free to contribute them or ask for help on implementing those), you are probably good with uaing the Apache Arrow based API.

Source https://stackoverflow.com/questions/45572962

QUESTION

Fail to install basemap 1.2.1 with conda due to package conflicts, how to resolve?

Asked 2019-Oct-31 at 07:52

I installed Basemap on Conda (Win 10 64, Python 3.7.3) in a non-root environment but I ended up with the problem that there is no epsg in the proj folder. Following the advice from github I found out I had version 1.2.0 and tried to install 1.2.1 without success.

EDIT: Apparently it is an incompatibility issue with proj as can be seen when trying this:

conda create -n test python proj basemap=1.2.1 -c defaults -c conda-forge

First I set channel conda-forge to highest priority and half of my environment got updated due to this, Basemap didn't however.

Then I tried to force an install of 1.2.1 which lead to a detailed report of what packages are in conflict with each other:

...

ANSWER

Answered 2019-Oct-30 at 21:40

You may want to try destroying your environment and starting fresh. Also, it looks like you have almost cloned the base environment, are you sure you need all of those packages?

To remove the environment:

Source https://stackoverflow.com/questions/58622260

QUESTION

Why is dictionary page offset 0 for `plain_dictionary` encoding?

Asked 2019-Mar-18 at 17:12

The parquet was generated by Spark v2.4 Parquet-mr v1.10

...

ANSWER

Answered 2019-Mar-18 at 17:12

The offset of the first data page is always larger than the offset of the dictionary. In other words, the dictionary comes first and only then the data pages. There are two metadata fields meant to store these offsets: dictionary_page_offset (aka DO) and data_page_offset (aka FPO). Unfortunately, these metadata fields are not filled in correctly by parquet-mr.

For example, if the dictionary starts at offset 1000 and the the first data page starts at offset 2000, then the correct values would be:

dictionary_page_offset = 1000
data_page_offset = 2000

Instead, parquet-mr stores

dictionary_page_offset = 0
data_page_offset = 1000

Applied to your example, this means that in spite of parquet-tools showing DO: 0, columns x and y are dictionary encoded nonetheless (column z is not).

It is worth mentioning that Impala follows the specification correctly, so you can not rely on every file having this deficiency.

This is how parquet-mr handles this situation during reading:

Source https://stackoverflow.com/questions/55225108

QUESTION

Repartitioning parquet-mr generated parquets with pyarrow/parquet-cpp increases file size by x30?

Asked 2018-Oct-31 at 13:34

Using AWS Firehose I am converting incoming records to parquet. In one example, I have 150k identical records enter firehose, and a single 30kb parquet gets written to s3. Because of how firehose partitions data, we have a secondary process (lambda triggered by s3 put event) read in the parquet and repartitions it based on the date within the event itself. After this repartitioning process, the 30kb file size jumps to 900kb.

Inspecting both parquet files-

The meta doesn't change
The data doesn't change
They both use SNAPPY compression
The firehose parquet is created by parquet-mr, the pyarrow generated parquet is created by parquet-cpp
The pyarrow generated parquet has additional pandas headers

The full repartitioning process-

...

ANSWER

Answered 2018-Oct-31 at 13:34

Parquet uses different column encodings to store low entropy data very efficiently. For example:

It can use delta encoding to only store differences between values. For example 9192631770, 9192631773, 9192631795, 9192631797 would be stored effectively as 9192631770, +3, +12, +2.
It can use dictionary encoding to shortly refer to common values. For example, Los Angeles, Los Angeles, Los Angeles, San Francisco, San Francisco would be stored as a dictionary of 0 = Los Angeles, 1 = San Francisco and the references 0, 0, 0, 1, 1
It can use run-length encoding to only store the number of repeating values. For example, Los Angeles, Los Angeles, Los Angeles would be effectively stored as Los Angeles×3. (Actually as far as I know pure RLE is only used for boolean types at this moment, but the idea is the same.)
A combination of the above, specifically RLE and dictionary encoding. For example, Los Angeles, Los Angeles, Los Angeles, San Francisco, San Francisco would be stored as a dictionary of 0 = Los Angeles, 1 = San Francisco and the references 0×3, 1×2

With the 3 to 5 values of the examples above, the savings are not that significant, but the more values you have the bigger the gain. Since you have 150k identical records, the gains will be huge, since with RLE dictionary encoding, each column value will only have to be stored once, and then marked as repeating 150k times.

However, it seems that pyarrow does not use these space-saving encodings. You can confirm this by taking a look at the metadata of the two files using parquet-tools meta. Here is a sample output:

Source https://stackoverflow.com/questions/53013038

QUESTION

Docker and Conda: Differences when building the same container on Mac and on Ubuntu

Asked 2018-Mar-06 at 11:35

I'm using to Docker to build a Python container with the intention of having a reproducible environment on several machines, which are a bunch of development Macbooks and several AWS EC2 servers.

The container is based on continuumio/miniconda3, i.e. Dockerfile starts with

...

ANSWER

Answered 2018-Mar-06 at 11:35

Perhaps you're using an outdated miniconda on one of the build machine, try doing docker build --pull --no-cache.

Docker doesn't necessarily pull the latest image from the repository, so unless you do a --pull, it is possible that some of your machines may be starting the build with outdated base image.

Source https://stackoverflow.com/questions/49106336

QUESTION

How can I statically link Arrow when building parquet-cpp?

Asked 2018-Jan-19 at 18:26

From the parquet-cpp home page:

By default, Parquet links to Arrow's shared libraries. If you wish to statically-link the Arrow symbols instead, pass -DPARQUET_ARROW_LINKAGE=static.

I do want to statically link Arrow, because I want to use my program on other servers that won't have Arrow installed. I tried -DPARQUET_ARROW_LINKAGE=static, but I get an error about "missing transitive dependencies":

...

ANSWER

Answered 2018-Jan-19 at 18:26

I arranged a script to download dependencies sources, set the environment variables and run your cmake line at the end. Just change the DEPDIR variable value, setting it to a directory of choice.

Source https://stackoverflow.com/questions/48157198

QUESTION

cmake "libcurl was built with SSL disabled, https: not supported!"

Asked 2017-Jun-19 at 14:52

I'm trying to compile the project parquet-cpp: https://github.com/apache/parquet-cpp

When I make, this is the error I get:

...

ANSWER

Answered 2017-Jun-19 at 14:52

So cmake does use its own version of curl.

I had to download cmake sources here https://cmake.org/download/ and use ./bootstrap --system-curl, make and make install to have a cmake version which uses the system curl. I also needed to install the package libcurl-devel.

Source https://stackoverflow.com/questions/44633043

QUESTION

Linking a shared library from a large github project

Asked 2017-Jun-12 at 20:10

I'm trying to use the following github project https://github.com/apache/parquet-cpp. I was able to build it and the .so files are available in parquet-cpp/build/latest. I copied the .so files(both of libparquet as well as libarrow which had been built) in a separate directory and wrote a simple hello world, simply importing the library as:

...

ANSWER

Answered 2017-Jun-12 at 20:10

You forgot to include the path for the header files in the compilation instruction. You need to find directory containing parquet/api/reader.h and include it in the compilation command

Source https://stackoverflow.com/questions/44507840

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install parquet-cpp

You can customize build dependency locations through various environment variables:. ARROW_HOME customizes the Apache Arrow installed location. THRIFT_HOME customizes the Apache Thrift (C++ libraries and compiler installed location. GTEST_HOME customizes the googletest installed location (if you are building the unit tests). GBENCHMARK_HOME customizes the Google Benchmark installed location (if you are building the benchmarks). The binaries will be built to ./debug which contains the libraries to link against as well as a few example executables. To disable the testing (which requires googletest), pass -DPARQUET_BUILD_TESTS=Off to cmake. For release-level builds (enable optimizations and disable debugging), pass -DCMAKE_BUILD_TYPE=Release to cmake. To build only the library with minimal dependencies, pass -DPARQUET_MINIMAL_DEPENDENCY=ON to cmake. Note that the executables, tests, and benchmarks should be disabled as well. Incremental builds can be done afterwords with just make.
cmake .
You can customize build dependency locations through various environment variables:
ARROW_HOME customizes the Apache Arrow installed location.
THRIFT_HOME customizes the Apache Thrift (C++ libraries and compiler installed location.
GTEST_HOME customizes the googletest installed location (if you are building the unit tests).
GBENCHMARK_HOME customizes the Google Benchmark installed location (if you are building the benchmarks).
make

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: