parquet | A library for reading and writing parquet files | Serialization library

 by   parsyl Go Version: v0.7.1 License: MIT

kandi X-RAY | parquet Summary

kandi X-RAY | parquet Summary

parquet is a Go library typically used in Utilities, Serialization, Xamarin applications. parquet has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

Parquet generates a parquet reader and writer based on a struct. The struct can be defined by you or it can be generated by reading an existing parquet file. We (Parsyl) will respond to pull requests and issues to the best of our abilities. However, sometimes we will have higher priorities and the response might not be immediate. NOTE: If you generate the code based on a parquet file there are quite a few limitations. The PageType of each PageHeader must be DATA_PAGE and the Codec (defined in ColumnMetaData) must be PLAIN or SNAPPY. Also, the parquet file's schema must consist of the currently supported types. But wait, there's more! Some of the encodings, like DELTA_BINARY_PACKED, BIT_PACKED, PLAIN_DICTIONARY, and DELTA_BYTE_ARRAY are also not supported. I would guess there are other parquet options that will cause problems since there are so many possibilities.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              parquet has a low active ecosystem.
              It has 68 star(s) with 12 fork(s). There are 7 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              There are 0 open issues and 4 have been closed. On average issues are closed in 18 days. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of parquet is v0.7.1

            kandi-Quality Quality

              parquet has 0 bugs and 0 code smells.

            kandi-Security Security

              parquet has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              parquet code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              parquet is licensed under the MIT License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              parquet releases are available to install and integrate.
              Installation instructions, examples and code snippets are available.
              It has 18824 lines of code, 1221 functions and 48 files.
              It has high code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed parquet and discovered the below as its top functions. This is intended to give you an instant insight into parquet implemented functionality, and help decide if they suit your requirements.
            • FromStruct creates a struct from a struct .
            • newPerson generates a random person
            • ConvertedTypeFromString converts a string to a converted type
            • FromParquet converts a Parquet to a struct
            • getField returns the field corresponding to x .
            • pageData extracts the page data from r .
            • getChildren returns all children of parent field .
            • doReadRepeated reads a variable definition .
            • NewParquetReader returns a new ParquetReader .
            • write writes a ParquetWriter .
            Get all kandi verified functions for this library.

            parquet Key Features

            No Key Features are available at this moment for parquet.

            parquet Examples and Code Snippets

            No Code Snippets are available at this moment for parquet.

            Community Discussions

            QUESTION

            Can I convert RDD to DataFrame in Glue?
            Asked 2022-Mar-20 at 13:58

            my lambda function triggers glue job by boto3 glue.start_job_run

            and here is my glue job script

            ...

            ANSWER

            Answered 2022-Mar-20 at 13:58

            You can't define schema types using toDF(). By using toDF() method, we don't have the control over schema customization. Having said that, using createDataFrame() method we have complete control over the schema customization.

            See below logic -

            Source https://stackoverflow.com/questions/71547278

            QUESTION

            Write custom metadata to Parquet file in Julia
            Asked 2022-Mar-05 at 18:36

            I am currently storing the output (a Julia Dataframe) of my Julia simulation in a Parquet file using Parquet.jl. I would also like to save some of the simulation parameters (eg. a list of (byte-)strings) to that same output file.

            Preferably, these parameters are different for each column as each column is the result of different starting conditions of my code. However, I could also work with a global parameter list and then untangle it afterwards by indexing.

            I have found a solution for Python using pyarrow

            https://mungingdata.com/pyarrow/arbitrary-metadata-parquet-table/.

            Do you know a way how to do it in Julia?

            ...

            ANSWER

            Answered 2022-Mar-05 at 18:36

            It's not quite done yet, and it's not registered, but my rewrite of the Julia parquet package, Parquet2.jl does support both custom file metadata and individual column metadata (the keyword arguments metadata and column_metadata in Parquet2.writefile.

            I haven't gotten to documentation for writing yet, but if you are feeling adventurous you can give it a shot. I do expect to finish up this package and register it within the next couple of weeks. I don't have unit tests in for writing yet, so of course, if you try it and have problems, please open an issue.

            It's probably also worth mentioning that the main use case I recommend for parquet is if you must have parquet for compatibility reasons. Most of the time, Julia users are probably better off with Arrow.jl as the format has a number of advantages over parquet for most use cases, please see my FAQ answer on this. Of course, the reason I undertook writing the package is because parquet is arguably the only ubiquitous binary format in "big data world" so a robust writer is desperately needed.

            Source https://stackoverflow.com/questions/71310140

            QUESTION

            Read / Write Parquet files without reading into memory (using Python)
            Asked 2022-Feb-28 at 11:12

            I looked at the standard documentation that I would expect to capture my need (Apache Arrow and Pandas), and I could not seem to figure it out.

            I know Python best, so I would like to use Python, but it is not a strict requirement.

            Problem

            I need to move Parquet files from one location (a URL) to another (an Azure storage account, in this case using the Azure machine learning platform, but this is irrelevant to my problem).

            These files are too large to simply perform pd.read_parquet("https://my-file-location.parquet"), since this reads the whole thing into an object.

            Expectation

            I thought that there must be a simple way to create a file object and stream that object line by line -- or maybe column chunk by column chunk. Something like

            ...

            ANSWER

            Answered 2021-Aug-24 at 06:21

            This is possible but takes a little bit of work because in addition to being columnar Parquet also requires a schema.

            The rough workflow is:

            1. Open a parquet file for reading.

            2. Then use iter_batches to read back chunks of rows incrementally (you can also pass specific columns you want to read from the file to save IO/CPU).

            3. You can then transform each pa.RecordBatch from iter_batches further. Once you are done transforming the first batch you can get its schema and create a new ParquetWriter.

            4. For each transformed batch call write_table. You have to first convert it to a pa.Table.

            5. Close the files.

            Parquet requires random access, so it can't be streamed easily from a URI (pyarrow should support it if you opened the file via HTTP FSSpec) but I think you might get blocked on writes.

            Source https://stackoverflow.com/questions/68819790

            QUESTION

            Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0
            Asked 2022-Feb-10 at 13:45

            When switching from Glue 2.0 to 3.0, which means also switching from Spark 2.4 to 3.1.1, my jobs start to fail when processing timestamps prior to 1900 with this error:

            ...

            ANSWER

            Answered 2022-Feb-10 at 13:45

            I made it work by setting --conf to spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED.

            This is a workaround though and Glue Dev team is working on a fix, although there is no ETA.

            Also this is still very buggy. You can not call .show() on a DynamicFrame for example, you need to call it on a DataFrame. Also all my jobs failed where I call data_frame.rdd.isEmpty(), don't ask me why.

            Update 24.11.2021: I reached out to the Glue Dev Team and they told me that this is the intended way of fixing it. There is a workaround that can be done inside of the script though:

            Source https://stackoverflow.com/questions/68891312

            QUESTION

            How can I have nice file names & efficient storage usage in my Foundry Magritte dataset export?
            Asked 2022-Feb-10 at 05:12

            I'm working on exporting data from Foundry datasets in parquet format using various Magritte export tasks to an ABFS system (but the same issue occurs with SFTP, S3, HDFS, and other file based exports).

            The datasets I'm exporting are relatively small, under 512 MB in size, which means they don't really need to be split across multiple parquet files, and putting all the data in one file is enough. I've done this by ending the previous transform with a .coalesce(1) to get all of the data in a single file.

            The issues are:

            • By default the file name is part-0000-.snappy.parquet, with a different rid on every build. This means that, whenever a new file is uploaded, it appears in the same folder as an additional file, the only way to tell which is the newest version is by last modified date.
            • Every version of the data is stored in my external system, this takes up unnecessary storage unless I frequently go in and delete old files.

            All of this is unnecessary complexity being added to my downstream system, I just want to be able to pull the latest version of data in a single step.

            ...

            ANSWER

            Answered 2022-Jan-13 at 15:27

            This is possible by renaming the single parquet file in the dataset so that it always has the same file name, that way the export task will overwrite the previous file in the external system.

            This can be done using raw file system access. The write_single_named_parquet_file function below validates its inputs, creates a file with a given name in the output dataset, then copies the file in the input dataset to it. The result is a schemaless output dataset that contains a single named parquet file.

            Notes

            • The build will fail if the input contains more than one parquet file, as pointed out in the question, calling .coalesce(1) (or .repartition(1)) is necessary in the upstream transform
            • If you require transaction history in your external store, or your dataset is much larger than 512 MB this method is not appropriate, as only the latest version is kept, and you likely want multiple parquet files for use in your downstream system. The createTransactionFolders (put each new export in a different folder) and flagFile (create a flag file once all files have been written) options can be useful in this case.
            • The transform does not require any spark executors, so it is possible to use @configure() to give it a driver only profile. Giving the driver additional memory should fix out of memory errors when working with larger datasets.
            • shutil.copyfileobj is used because the 'files' that are opened are actually just file objects.

            Full code snippet

            example_transform.py

            Source https://stackoverflow.com/questions/70652943

            QUESTION

            Loading pandas DataFrame from parquet - lists are deserialized as numpy's ndarrays
            Asked 2022-Jan-19 at 13:46
            import pandas as pd
            df = pd.DataFrame({
                "col1" : ["a", "b", "c"],
                "col2" : [[1,2,3], [4,5,6,7], [8,9,10,11,12]]
            })
            df.to_parquet("./df_as_pq.parquet")
            df = pd.read_parquet("./df_as_pq.parquet")
            [type(val) for val in df["col2"].tolist()]
            
            ...

            ANSWER

            Answered 2021-Dec-15 at 09:24

            You can't change this behavior in the API, either when loading the parquet file into an arrow table or converting the arrow table to pandas.

            But you can write your own function that would look at the schema of the arrow table and convert every list field to a python list

            Source https://stackoverflow.com/questions/70351937

            QUESTION

            FileNotFoundException on _temporary/0 directory when saving Parquet files
            Asked 2021-Dec-17 at 16:58

            Using Python on an Azure HDInsight cluster, we are saving Spark dataframes as Parquet files to an Azure Data Lake Storage Gen2, using the following code:

            ...

            ANSWER

            Answered 2021-Dec-17 at 16:58

            ABFS is a "real" file system, so the S3A zero rename committers are not needed. Indeed, they won't work. And the client is entirely open source - look into the hadoop-azure module.

            the ADLS gen2 store does have scale problems, but unless you are trying to commit 10,000 files, or clean up massively deep directory trees -you won't hit these. If you do get error messages about Elliott to rename individual files and you are doing Jobs of that scale (a) talk to Microsoft about increasing your allocated capacity and (b) pick this up https://github.com/apache/hadoop/pull/2971

            This isn't it. I would guess that actually you have multiple jobs writing to the same output path, and one is cleaning up while the other is setting up. In particular -they both seem to have a job ID of "0". Because of the same job ID is being used, what only as task set up and task cleanup getting mixed up, it is possible that when an job one commits it includes the output from job 2 from all task attempts which have successfully been committed.

            I believe that this has been a known problem with spark standalone deployments, though I can't find a relevant JIRA. SPARK-24552 is close, but should have been fixed in your version. SPARK-33402 Jobs launched in same second have duplicate MapReduce JobIDs. That is about job IDs just coming from the system current time, not 0. But: you can try upgrading your spark version to see if it goes away.

            My suggestions

            1. make sure your jobs are not writing to the same table simultaneously. Things will get in a mess.
            2. grab the most recent version spark you are happy with

            Source https://stackoverflow.com/questions/70393987

            QUESTION

            Schema for pyarrow.ParquetDataset > partition columns
            Asked 2021-Dec-11 at 20:37
            1. I have a pandas DataFrame:
            ...

            ANSWER

            Answered 2021-Dec-11 at 12:02

            I think you need give ParquetDataset a hint of the partition keys schema.

            Source https://stackoverflow.com/questions/70308728

            QUESTION

            How reproducible / deterministic is Parquet format?
            Asked 2021-Dec-09 at 03:55

            I'm seeking advice from people deeply familiar with the binary layout of Apache Parquet:

            Having a data transformation F(a) = b where F is fully deterministic, and same exact versions of the entire software stack (framework, arrow & parquet libraries) are used - how likely am I to get an identical binary representation of dataframe b on different hosts every time b is saved into Parquet?

            In other words how reproducible Parquet is on binary level? When data is logically the same what can cause binary differences?

            • Can there be some uninit memory in between values due to alignment?
            • Assuming all serialization settings (compression, chunking, use of dictionaries etc.) are the same, can result still drift?
            Context

            I'm working on a system for fully reproducible and deterministic data processing and computing dataset hashes to assert these guarantees.

            My key goal has been to ensure that dataset b contains an idendital set of records as dataset b' - this is of course very different from hashing a binary representation of Arrow/Parquet. Not wanting to deal with the reproducibility of storage formats I've been computing logical data hashes in memory. This is slow but flexible, e.g. my hash stays the same even if records are re-ordered (which I consider an equivalent dataset).

            But when thinking about integrating with IPFS and other content-addressable storages that rely on hashes of files - it would simplify the design a lot to have just one hash (physical) instead of two (logical + physical), but this means I have to guarantee that Parquet files are reproducible.

            Update

            I decided to continue using logical hashing for now.

            I've created a new Rust crate arrow-digest that implements the stable hashing for Arrow arrays and record batches and tries hard to hide the encoding-related differences. The crate's README describes the hashing algorithm if someone finds it useful and wants to implement it in another language.

            I'll continue to expand the set of supported types as I'm integrating it into the decentralized data processing tool I'm working on.

            In the long term, I'm not sure logical hashing is the best way forward - a subset of Parquet that makes some efficiency sacrifices just to make file layout deterministic might be a better choice for content-addressability.

            ...

            ANSWER

            Answered 2021-Dec-05 at 04:30

            At least in arrow's implementation I would expect, but haven't verified the exact same input (including identical metadata) in the same order to yield deterministic outputs (we try not to leave uninitialized values for security reasons) with the same configuration (assuming the compression algorithm chosen also makes the deterministic guarantee). It is possible there is some hash-map iteration for metadata or elsewhere that might also break this assumption.

            As @Pace pointed out I would not rely on this and recommend against relying on it). There is nothing in the spec that guarantees this and since the writer version is persisted when writing a file you are guaranteed a breakage if you ever decided to upgrade. Things will also break if additional metadata is added or removed ( I believe in the past there have been some big fixes for round tripping data sets that would have caused non-determinism).

            So in summary this might or might not work today but even if it does I would expect this would be very brittle.

            Source https://stackoverflow.com/questions/70220970

            QUESTION

            Spark Dataset - "edit" parquet file for each row
            Asked 2021-Nov-26 at 09:09
            Context

            I am trying to use Spark/Scala in order to "edit" multiple parquet files (potentially 50k+) efficiently. The only edit that needs to be done is deletion (i.e. deleting records/rows) based on a given set of row IDs.

            The parquet files are stored in s3 as a partitioned DataFrame where an example partition looks like this:

            ...

            ANSWER

            Answered 2021-Nov-25 at 17:11

            s3path and ids parameters that are passed to deleteIDs are not actually strings and sets respectively. They are instead columns.

            In order to operate over these values you can instead create a UDF that accepts columns instead of intrinsic types, or you can collect your dataset if it is small enough so that you can use the values in the deleteIDs function directly. The former is likely your best bet if you seek to take advantage of Spark's parallelism.

            You can read about UDFs here

            Source https://stackoverflow.com/questions/70113356

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install parquet

            This will also install parquet's only two dependencies: thift and snappy.

            Support

            The struct used to define the parquet data can have the following types:.
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries

            Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link