parquet-tools | easy install parquet-tools | Cloud Storage library

by ktrueda Python Version: 0.2.16 License: MIT

X-Ray Key Features Code Snippets(3)Community Discussions(6)Vulnerabilities Install Support

kandi X-RAY | parquet-tools Summary

parquet-tools is a Python library typically used in Storage, Cloud Storage, Hadoop applications. parquet-tools has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. However parquet-tools build file is not available. You can install using 'pip install parquet-tools' or download it from GitHub, PyPI.

easy install parquet-tools

Support

Quality

Security

License

Reuse

Support

parquet-tools has a low active ecosystem.

It has 111 star(s) with 15 fork(s). There are 3 watchers for this library.

It had no major release in the last 12 months.

There are 7 open issues and 16 have been closed. On average issues are closed in 50 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of parquet-tools is 0.2.16

Quality

parquet-tools has 0 bugs and 176 code smells.

Security

parquet-tools has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

parquet-tools code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

parquet-tools is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

parquet-tools releases are available to install and integrate.

Deployable package is available in PyPI.

parquet-tools has no build file. You will be need to create the build yourself to build the component from source.

Installation instructions are not available. Examples and code snippets are available.

It has 4961 lines of code, 419 functions and 22 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed parquet-tools and discovered the below as its top functions. This is intended to give you an instant insight into parquet-tools implemented functionality, and help decide if they suit your requirements.

Implements the command line interface
Execute select
Return a pandas dataframe
Get the dataframe from a list of objects
Read the data from an IProt

Get all kandi verified functions for this library.

parquet-tools Key Features

No Key Features are available at this moment for parquet-tools.

parquet-tools Examples and Code Snippets

parquet int96 timestamp conversion to datetime/date via python

Python

Lines of Code : 37

License : Strong Copyleft (CC BY-SA 4.0)

Copy

spark-shell --conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS

2020-03-16 11:37:50 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting defaul

How to fetch results from spark sql using pyspark?

Python

Lines of Code : 46

License : Strong Copyleft (CC BY-SA 4.0)

Copy

nt = sqlCtx.sql("SELECT COUNT(*) AS pageCount FROM table1 WHERE pp_count>=500") \
           .collect()

$> parquet-tools head data.parquet/
a = 1
pp_count = 500

a = 2
pp_count = 750

a = 3
pp_count = 400

a

Drop partition columns when writing parquet in pyspark

Python

Lines of Code : 9

License : Strong Copyleft (CC BY-SA 4.0)

Copy

val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.

hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7

Community Discussions

Trending Discussions on parquet-tools

Parquet write to gcs is not queryable by bigquery in nodejs

Issue with loading Parquet data into Snowflake Cloud Database when written with v1.11.0

How to load key-value pairs (MAP) into Athena from Parquet file?

INT32 type error when scanning parquet federated table. Bug or Expected behavior?

spark parquet enable dictionary

parquet int96 timestamp conversion to datetime/date via python

QUESTION

Parquet write to gcs is not queryable by bigquery in nodejs

Asked 2021-Nov-29 at 15:07

i'm using parquetjs to create parquet files and push to google cloud storage.

Problem is that bigquery cannot read the data from file but when i use parquet-tools everything looks healthy.

...

ANSWER

Answered 2021-Nov-29 at 15:07

just pass useDataPageV2: false as option to parquet.ParquetWriter.openFile(...)

Like this:

Source https://stackoverflow.com/questions/70083801

QUESTION

Issue with loading Parquet data into Snowflake Cloud Database when written with v1.11.0

Asked 2020-Jun-22 at 09:19

I am new to Snowflake, but my company has been using it successfully.

Parquet files are currently being written with an existing Avro Schema, using Java parquet-avro v1.10.1.

I have been updating the dependencies in order to use latest Avro, and part of that bumped Parquet to 1.11.0.

The Avro Schema is unchanged. However when using the COPY INTO Snowflake command, I receive a LOAD FAILED with error: Error parsing the parquet file: Logical type Null can not be applied to group node but no other error details :(

The problem is that there are no null columns in the files.

I've cut the Avro schema down, and found that the presence of a MAP type in the Avro schema is causing the issue.

The field is

...

ANSWER

Answered 2020-Jun-22 at 09:19

Logical type Null can not be applied to group node

Looking up the error above, it appears that a version of Apache Arrow's parquet libraries is being used to read the file.

However, looking closer, the real problem lies in the use of legacy types within the Avro based Parquet Writer implementation (the following assumes Java was used to write the files).

The new logicalTypes schema metadata introduced in Parquet defines many types including a singular MAP type. Historically, the former convertedTypes schema field supported use of MAP AND MAP_KEY_VALUE for legacy readers. The new writers that use logicalTypes (1.11.0+) should not be using the legacy map type anymore, but work hasn't been done yet to update the Avro to Parquet schema conversions to drop the MAP_KEY_VALUE types entirely.

As a result, the schema field for MAP_KEY_VALUE gets written out with an UNKNOWN value of logicalType, which trips up Arrow's implementation that only understands logicalType values of MAP and LIST (understandably).

Consider logging this as a bug against the Apache Parquet project to update their Avro writers to stop nesting the legacy MAP_KEY_VALUE type when transforming an Avro schema to a Parquet one. It should've ideally been done as part of PARQUET-1410.

Unfortunately this is hard-coded behaviour and there are no configuration options that influence map-types that can aid in producing a correct file for Apache Arrow (and for Snowflake by extension). You'll need to use an older version of the writer until a proper fix is released by the Apache Parquet developers.

Source https://stackoverflow.com/questions/62504757

QUESTION

How to load key-value pairs (MAP) into Athena from Parquet file?

Asked 2020-Jun-21 at 20:49

I have an S3 bucket full of .gz.parquet files. I want to make them accessible in Athena. In order to do this I am creating a table in Athena that points at the s3 bucket:

...

ANSWER

Answered 2020-Jun-17 at 20:16

You can use an AWS Glue Crawler to automatically derive the schema from your Parquet files.

Defining AWS Glue Crawlers: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html

Source https://stackoverflow.com/questions/62436918

QUESTION

INT32 type error when scanning parquet federated table. Bug or Expected behavior?

Asked 2020-Apr-13 at 15:53

I am using BigQuery to query an external data source (also known as a federated table), where the source data is a hive-partitioned parquet table stored in google cloud storage. I used this guide to define the table.

My first query to test this table looks like the following

...

ANSWER

Answered 2020-Apr-13 at 15:53

Note that, the schema of the external table is inferred from the last file sorted by the file names lexicographically among the list of all files that match the source URI of the table. So any chance that particular Parquet file in your case has a different schema than the one you described, e.g., a INT32 column with DATE logical type for the "visitor_partition" field -- which BigQuery would infer as DATE type.

Source https://stackoverflow.com/questions/61119928

QUESTION

spark parquet enable dictionary

Asked 2020-Mar-27 at 23:18

I am running a spark job to write to parquet. I want to enable dictionary encoding for the files written. When I check the files, I see they are 'plain dictionary'. However, I do not see any stats for these columns

Let me know if I am missing anything

...

ANSWER

Answered 2020-Mar-27 at 23:18

Got the answer. The parquet tools version I was using was 1.6. Upgrading to 1.10 solved the issue

Source https://stackoverflow.com/questions/60819720

QUESTION

parquet int96 timestamp conversion to datetime/date via python

Asked 2020-Mar-16 at 19:01

TL;DR

I'd like to convert an int96 value such as ACIE4NxJAAAKhSUA into a readable timestamp format like 2020-03-02 14:34:22 or whatever that could be normally interpreted...I mostly use python so I'm looking to build a function that does this conversion. If there's another function that can do the reverse -- even better.

Background

I'm using parquet-tools to convert a raw parquet file (with snappy compression) to raw JSON via this commmand:

...

ANSWER

Answered 2020-Mar-16 at 19:01

parquet-tools will not be able to change format type from INT96 to INT64. What you are observing in json output is a String representation of the timestamp stored in INT96 TimestampType. You will need spark to re-write this parquet with timestamp in INT64 TimestampType and then the json output will produce a timestamp (in the format you desire).

You will need to set a specific config in Spark -

Source https://stackoverflow.com/questions/60698056

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install parquet-tools

You can install using 'pip install parquet-tools' or download it from GitHub, PyPI.
You can use parquet-tools like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: