parquet | A library for reading and writing parquet files | Serialization library

by parsyl Go Version: v0.7.1 License: MIT

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | parquet Summary

parquet is a Go library typically used in Utilities, Serialization, Xamarin applications. parquet has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

Parquet generates a parquet reader and writer based on a struct. The struct can be defined by you or it can be generated by reading an existing parquet file. We (Parsyl) will respond to pull requests and issues to the best of our abilities. However, sometimes we will have higher priorities and the response might not be immediate. NOTE: If you generate the code based on a parquet file there are quite a few limitations. The PageType of each PageHeader must be DATA_PAGE and the Codec (defined in ColumnMetaData) must be PLAIN or SNAPPY. Also, the parquet file's schema must consist of the currently supported types. But wait, there's more! Some of the encodings, like DELTA_BINARY_PACKED, BIT_PACKED, PLAIN_DICTIONARY, and DELTA_BYTE_ARRAY are also not supported. I would guess there are other parquet options that will cause problems since there are so many possibilities.

Support

Quality

Security

License

Reuse

Support

parquet has a low active ecosystem.

It has 68 star(s) with 12 fork(s). There are 7 watchers for this library.

It had no major release in the last 12 months.

There are 0 open issues and 4 have been closed. On average issues are closed in 18 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of parquet is v0.7.1

Quality

parquet has 0 bugs and 0 code smells.

Security

parquet has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

parquet code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

parquet is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

parquet releases are available to install and integrate.

Installation instructions, examples and code snippets are available.

It has 18824 lines of code, 1221 functions and 48 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed parquet and discovered the below as its top functions. This is intended to give you an instant insight into parquet implemented functionality, and help decide if they suit your requirements.

FromStruct creates a struct from a struct .
newPerson generates a random person
ConvertedTypeFromString converts a string to a converted type
FromParquet converts a Parquet to a struct
getField returns the field corresponding to x .
pageData extracts the page data from r .
getChildren returns all children of parent field .
doReadRepeated reads a variable definition .
NewParquetReader returns a new ParquetReader .
write writes a ParquetWriter .

Get all kandi verified functions for this library.

parquet Key Features

No Key Features are available at this moment for parquet.

parquet Examples and Code Snippets

No Code Snippets are available at this moment for parquet.

Community Discussions

Trending Discussions on parquet

Can I convert RDD to DataFrame in Glue?

Write custom metadata to Parquet file in Julia

Read / Write Parquet files without reading into memory (using Python)

Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

How can I have nice file names & efficient storage usage in my Foundry Magritte dataset export?

Loading pandas DataFrame from parquet - lists are deserialized as numpy's ndarrays

FileNotFoundException on _temporary/0 directory when saving Parquet files

Schema for pyarrow.ParquetDataset > partition columns

How reproducible / deterministic is Parquet format?

Spark Dataset - "edit" parquet file for each row

QUESTION

Can I convert RDD to DataFrame in Glue?

Asked 2022-Mar-20 at 13:58

my lambda function triggers glue job by boto3 glue.start_job_run

and here is my glue job script

...

ANSWER

Answered 2022-Mar-20 at 13:58

You can't define schema types using toDF(). By using toDF() method, we don't have the control over schema customization. Having said that, using createDataFrame() method we have complete control over the schema customization.

See below logic -

Source https://stackoverflow.com/questions/71547278

QUESTION

Write custom metadata to Parquet file in Julia

Asked 2022-Mar-05 at 18:36

I am currently storing the output (a Julia Dataframe) of my Julia simulation in a Parquet file using Parquet.jl. I would also like to save some of the simulation parameters (eg. a list of (byte-)strings) to that same output file.

Preferably, these parameters are different for each column as each column is the result of different starting conditions of my code. However, I could also work with a global parameter list and then untangle it afterwards by indexing.

I have found a solution for Python using pyarrow

https://mungingdata.com/pyarrow/arbitrary-metadata-parquet-table/.

Do you know a way how to do it in Julia?

...

ANSWER

Answered 2022-Mar-05 at 18:36

It's not quite done yet, and it's not registered, but my rewrite of the Julia parquet package, Parquet2.jl does support both custom file metadata and individual column metadata (the keyword arguments metadata and column_metadata in Parquet2.writefile.

I haven't gotten to documentation for writing yet, but if you are feeling adventurous you can give it a shot. I do expect to finish up this package and register it within the next couple of weeks. I don't have unit tests in for writing yet, so of course, if you try it and have problems, please open an issue.

It's probably also worth mentioning that the main use case I recommend for parquet is if you must have parquet for compatibility reasons. Most of the time, Julia users are probably better off with Arrow.jl as the format has a number of advantages over parquet for most use cases, please see my FAQ answer on this. Of course, the reason I undertook writing the package is because parquet is arguably the only ubiquitous binary format in "big data world" so a robust writer is desperately needed.

Source https://stackoverflow.com/questions/71310140

QUESTION

Read / Write Parquet files without reading into memory (using Python)

Asked 2022-Feb-28 at 11:12

I looked at the standard documentation that I would expect to capture my need (Apache Arrow and Pandas), and I could not seem to figure it out.

I know Python best, so I would like to use Python, but it is not a strict requirement.

Problem

I need to move Parquet files from one location (a URL) to another (an Azure storage account, in this case using the Azure machine learning platform, but this is irrelevant to my problem).

These files are too large to simply perform pd.read_parquet("https://my-file-location.parquet"), since this reads the whole thing into an object.

Expectation

I thought that there must be a simple way to create a file object and stream that object line by line -- or maybe column chunk by column chunk. Something like

...

ANSWER

Answered 2021-Aug-24 at 06:21

This is possible but takes a little bit of work because in addition to being columnar Parquet also requires a schema.

The rough workflow is:

Open a parquet file for reading.
Then use iter_batches to read back chunks of rows incrementally (you can also pass specific columns you want to read from the file to save IO/CPU).
You can then transform each pa.RecordBatch from iter_batches further. Once you are done transforming the first batch you can get its schema and create a new ParquetWriter.
For each transformed batch call write_table. You have to first convert it to a pa.Table.
Close the files.

Parquet requires random access, so it can't be streamed easily from a URI (pyarrow should support it if you opened the file via HTTP FSSpec) but I think you might get blocked on writes.

Source https://stackoverflow.com/questions/68819790

QUESTION

Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

Asked 2022-Feb-10 at 13:45

When switching from Glue 2.0 to 3.0, which means also switching from Spark 2.4 to 3.1.1, my jobs start to fail when processing timestamps prior to 1900 with this error:

...

ANSWER

Answered 2022-Feb-10 at 13:45

I made it work by setting --conf to spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED.

This is a workaround though and Glue Dev team is working on a fix, although there is no ETA.

Also this is still very buggy. You can not call .show() on a DynamicFrame for example, you need to call it on a DataFrame. Also all my jobs failed where I call data_frame.rdd.isEmpty(), don't ask me why.

Update 24.11.2021: I reached out to the Glue Dev Team and they told me that this is the intended way of fixing it. There is a workaround that can be done inside of the script though:

Source https://stackoverflow.com/questions/68891312

QUESTION

How can I have nice file names & efficient storage usage in my Foundry Magritte dataset export?

Asked 2022-Feb-10 at 05:12

I'm working on exporting data from Foundry datasets in parquet format using various Magritte export tasks to an ABFS system (but the same issue occurs with SFTP, S3, HDFS, and other file based exports).

The datasets I'm exporting are relatively small, under 512 MB in size, which means they don't really need to be split across multiple parquet files, and putting all the data in one file is enough. I've done this by ending the previous transform with a .coalesce(1) to get all of the data in a single file.

The issues are:

By default the file name is part-0000-.snappy.parquet, with a different rid on every build. This means that, whenever a new file is uploaded, it appears in the same folder as an additional file, the only way to tell which is the newest version is by last modified date.
Every version of the data is stored in my external system, this takes up unnecessary storage unless I frequently go in and delete old files.

All of this is unnecessary complexity being added to my downstream system, I just want to be able to pull the latest version of data in a single step.

...

ANSWER

Answered 2022-Jan-13 at 15:27

This is possible by renaming the single parquet file in the dataset so that it always has the same file name, that way the export task will overwrite the previous file in the external system.

This can be done using raw file system access. The write_single_named_parquet_file function below validates its inputs, creates a file with a given name in the output dataset, then copies the file in the input dataset to it. The result is a schemaless output dataset that contains a single named parquet file.

Notes

The build will fail if the input contains more than one parquet file, as pointed out in the question, calling .coalesce(1) (or .repartition(1)) is necessary in the upstream transform
If you require transaction history in your external store, or your dataset is much larger than 512 MB this method is not appropriate, as only the latest version is kept, and you likely want multiple parquet files for use in your downstream system. The createTransactionFolders (put each new export in a different folder) and flagFile (create a flag file once all files have been written) options can be useful in this case.
The transform does not require any spark executors, so it is possible to use @configure() to give it a driver only profile. Giving the driver additional memory should fix out of memory errors when working with larger datasets.
shutil.copyfileobj is used because the 'files' that are opened are actually just file objects.

Full code snippet

example_transform.py

Source https://stackoverflow.com/questions/70652943

QUESTION

Loading pandas DataFrame from parquet - lists are deserialized as numpy's ndarrays

Asked 2022-Jan-19 at 13:46

import pandas as pd
df = pd.DataFrame({
    "col1" : ["a", "b", "c"],
    "col2" : [[1,2,3], [4,5,6,7], [8,9,10,11,12]]
})
df.to_parquet("./df_as_pq.parquet")
df = pd.read_parquet("./df_as_pq.parquet")
[type(val) for val in df["col2"].tolist()]

...

ANSWER

Answered 2021-Dec-15 at 09:24

You can't change this behavior in the API, either when loading the parquet file into an arrow table or converting the arrow table to pandas.

But you can write your own function that would look at the schema of the arrow table and convert every list field to a python list

Source https://stackoverflow.com/questions/70351937

QUESTION

FileNotFoundException on _temporary/0 directory when saving Parquet files

Asked 2021-Dec-17 at 16:58

Using Python on an Azure HDInsight cluster, we are saving Spark dataframes as Parquet files to an Azure Data Lake Storage Gen2, using the following code:

...

ANSWER

Answered 2021-Dec-17 at 16:58

ABFS is a "real" file system, so the S3A zero rename committers are not needed. Indeed, they won't work. And the client is entirely open source - look into the hadoop-azure module.

the ADLS gen2 store does have scale problems, but unless you are trying to commit 10,000 files, or clean up massively deep directory trees -you won't hit these. If you do get error messages about Elliott to rename individual files and you are doing Jobs of that scale (a) talk to Microsoft about increasing your allocated capacity and (b) pick this up https://github.com/apache/hadoop/pull/2971

This isn't it. I would guess that actually you have multiple jobs writing to the same output path, and one is cleaning up while the other is setting up. In particular -they both seem to have a job ID of "0". Because of the same job ID is being used, what only as task set up and task cleanup getting mixed up, it is possible that when an job one commits it includes the output from job 2 from all task attempts which have successfully been committed.

I believe that this has been a known problem with spark standalone deployments, though I can't find a relevant JIRA. SPARK-24552 is close, but should have been fixed in your version. SPARK-33402 Jobs launched in same second have duplicate MapReduce JobIDs. That is about job IDs just coming from the system current time, not 0. But: you can try upgrading your spark version to see if it goes away.

My suggestions

make sure your jobs are not writing to the same table simultaneously. Things will get in a mess.
grab the most recent version spark you are happy with

Source https://stackoverflow.com/questions/70393987

QUESTION

Schema for pyarrow.ParquetDataset > partition columns

Asked 2021-Dec-11 at 20:37

I have a pandas DataFrame:

...

ANSWER

Answered 2021-Dec-11 at 12:02

I think you need give ParquetDataset a hint of the partition keys schema.

Source https://stackoverflow.com/questions/70308728

QUESTION

How reproducible / deterministic is Parquet format?

Asked 2021-Dec-09 at 03:55

I'm seeking advice from people deeply familiar with the binary layout of Apache Parquet:

Having a data transformation F(a) = b where F is fully deterministic, and same exact versions of the entire software stack (framework, arrow & parquet libraries) are used - how likely am I to get an identical binary representation of dataframe b on different hosts every time b is saved into Parquet?

In other words how reproducible Parquet is on binary level? When data is logically the same what can cause binary differences?

Can there be some uninit memory in between values due to alignment?
Assuming all serialization settings (compression, chunking, use of dictionaries etc.) are the same, can result still drift?

Context

I'm working on a system for fully reproducible and deterministic data processing and computing dataset hashes to assert these guarantees.

My key goal has been to ensure that dataset b contains an idendital set of records as dataset b' - this is of course very different from hashing a binary representation of Arrow/Parquet. Not wanting to deal with the reproducibility of storage formats I've been computing logical data hashes in memory. This is slow but flexible, e.g. my hash stays the same even if records are re-ordered (which I consider an equivalent dataset).

But when thinking about integrating with IPFS and other content-addressable storages that rely on hashes of files - it would simplify the design a lot to have just one hash (physical) instead of two (logical + physical), but this means I have to guarantee that Parquet files are reproducible.

Update

I decided to continue using logical hashing for now.

I've created a new Rust crate arrow-digest that implements the stable hashing for Arrow arrays and record batches and tries hard to hide the encoding-related differences. The crate's README describes the hashing algorithm if someone finds it useful and wants to implement it in another language.

I'll continue to expand the set of supported types as I'm integrating it into the decentralized data processing tool I'm working on.

In the long term, I'm not sure logical hashing is the best way forward - a subset of Parquet that makes some efficiency sacrifices just to make file layout deterministic might be a better choice for content-addressability.

...

ANSWER

Answered 2021-Dec-05 at 04:30

At least in arrow's implementation I would expect, but haven't verified the exact same input (including identical metadata) in the same order to yield deterministic outputs (we try not to leave uninitialized values for security reasons) with the same configuration (assuming the compression algorithm chosen also makes the deterministic guarantee). It is possible there is some hash-map iteration for metadata or elsewhere that might also break this assumption.

As @Pace pointed out I would not rely on this and recommend against relying on it). There is nothing in the spec that guarantees this and since the writer version is persisted when writing a file you are guaranteed a breakage if you ever decided to upgrade. Things will also break if additional metadata is added or removed ( I believe in the past there have been some big fixes for round tripping data sets that would have caused non-determinism).

So in summary this might or might not work today but even if it does I would expect this would be very brittle.

Source https://stackoverflow.com/questions/70220970

QUESTION

Spark Dataset - "edit" parquet file for each row

Asked 2021-Nov-26 at 09:09

Context

I am trying to use Spark/Scala in order to "edit" multiple parquet files (potentially 50k+) efficiently. The only edit that needs to be done is deletion (i.e. deleting records/rows) based on a given set of row IDs.

The parquet files are stored in s3 as a partitioned DataFrame where an example partition looks like this:

...

ANSWER

Answered 2021-Nov-25 at 17:11

s3path and ids parameters that are passed to deleteIDs are not actually strings and sets respectively. They are instead columns.

In order to operate over these values you can instead create a UDF that accepts columns instead of intrinsic types, or you can collect your dataset if it is small enough so that you can use the values in the deleteIDs function directly. The former is likely your best bet if you seek to take advantage of Spark's parallelism.

You can read about UDFs here

Source https://stackoverflow.com/questions/70113356

Community Discussions, Code Snippets contain sources that include Stack Exchange Network