parquet-tools | parquet-tools and dependency jar files

 by   viirya Shell Version: Current License: No License

kandi X-RAY | parquet-tools Summary

kandi X-RAY | parquet-tools Summary

parquet-tools is a Shell library typically used in Big Data, Nodejs, Hadoop applications. parquet-tools has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

parquet-tools and dependency jar files
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              parquet-tools has a low active ecosystem.
              It has 8 star(s) with 1 fork(s). There are 1 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              parquet-tools has no issues reported. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of parquet-tools is current.

            kandi-Quality Quality

              parquet-tools has no bugs reported.

            kandi-Security Security

              parquet-tools has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              parquet-tools does not have a standard license declared.
              Check the repository for any license declaration and review the terms closely.
              OutlinedDot
              Without a license, all rights are reserved, and you cannot use the library in your applications.

            kandi-Reuse Reuse

              parquet-tools releases are not available. You will need to build from source code and install.
              Installation instructions are not available. Examples and code snippets are available.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of parquet-tools
            Get all kandi verified functions for this library.

            parquet-tools Key Features

            No Key Features are available at this moment for parquet-tools.

            parquet-tools Examples and Code Snippets

            No Code Snippets are available at this moment for parquet-tools.

            Community Discussions

            QUESTION

            Issue with loading Parquet data into Snowflake Cloud Database when written with v1.11.0
            Asked 2020-Jun-22 at 09:19

            I am new to Snowflake, but my company has been using it successfully.

            Parquet files are currently being written with an existing Avro Schema, using Java parquet-avro v1.10.1.

            I have been updating the dependencies in order to use latest Avro, and part of that bumped Parquet to 1.11.0.

            The Avro Schema is unchanged. However when using the COPY INTO Snowflake command, I receive a LOAD FAILED with error: Error parsing the parquet file: Logical type Null can not be applied to group node but no other error details :(

            The problem is that there are no null columns in the files.

            I've cut the Avro schema down, and found that the presence of a MAP type in the Avro schema is causing the issue.

            The field is

            ...

            ANSWER

            Answered 2020-Jun-22 at 09:19

            Logical type Null can not be applied to group node

            Looking up the error above, it appears that a version of Apache Arrow's parquet libraries is being used to read the file.

            However, looking closer, the real problem lies in the use of legacy types within the Avro based Parquet Writer implementation (the following assumes Java was used to write the files).

            The new logicalTypes schema metadata introduced in Parquet defines many types including a singular MAP type. Historically, the former convertedTypes schema field supported use of MAP AND MAP_KEY_VALUE for legacy readers. The new writers that use logicalTypes (1.11.0+) should not be using the legacy map type anymore, but work hasn't been done yet to update the Avro to Parquet schema conversions to drop the MAP_KEY_VALUE types entirely.

            As a result, the schema field for MAP_KEY_VALUE gets written out with an UNKNOWN value of logicalType, which trips up Arrow's implementation that only understands logicalType values of MAP and LIST (understandably).

            Consider logging this as a bug against the Apache Parquet project to update their Avro writers to stop nesting the legacy MAP_KEY_VALUE type when transforming an Avro schema to a Parquet one. It should've ideally been done as part of PARQUET-1410.

            Unfortunately this is hard-coded behaviour and there are no configuration options that influence map-types that can aid in producing a correct file for Apache Arrow (and for Snowflake by extension). You'll need to use an older version of the writer until a proper fix is released by the Apache Parquet developers.

            Source https://stackoverflow.com/questions/62504757

            QUESTION

            How to load key-value pairs (MAP) into Athena from Parquet file?
            Asked 2020-Jun-21 at 20:49

            I have an S3 bucket full of .gz.parquet files. I want to make them accessible in Athena. In order to do this I am creating a table in Athena that points at the s3 bucket:

            ...

            ANSWER

            Answered 2020-Jun-17 at 20:16

            You can use an AWS Glue Crawler to automatically derive the schema from your Parquet files.

            Defining AWS Glue Crawlers: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html

            Source https://stackoverflow.com/questions/62436918

            QUESTION

            INT32 type error when scanning parquet federated table. Bug or Expected behavior?
            Asked 2020-Apr-13 at 15:53

            I am using BigQuery to query an external data source (also known as a federated table), where the source data is a hive-partitioned parquet table stored in google cloud storage. I used this guide to define the table.

            My first query to test this table looks like the following

            ...

            ANSWER

            Answered 2020-Apr-13 at 15:53

            Note that, the schema of the external table is inferred from the last file sorted by the file names lexicographically among the list of all files that match the source URI of the table. So any chance that particular Parquet file in your case has a different schema than the one you described, e.g., a INT32 column with DATE logical type for the "visitor_partition" field -- which BigQuery would infer as DATE type.

            Source https://stackoverflow.com/questions/61119928

            QUESTION

            spark parquet enable dictionary
            Asked 2020-Mar-27 at 23:18

            I am running a spark job to write to parquet. I want to enable dictionary encoding for the files written. When I check the files, I see they are 'plain dictionary'. However, I do not see any stats for these columns

            Let me know if I am missing anything

            ...

            ANSWER

            Answered 2020-Mar-27 at 23:18

            Got the answer. The parquet tools version I was using was 1.6. Upgrading to 1.10 solved the issue

            Source https://stackoverflow.com/questions/60819720

            QUESTION

            parquet int96 timestamp conversion to datetime/date via python
            Asked 2020-Mar-16 at 19:01
            TL;DR

            I'd like to convert an int96 value such as ACIE4NxJAAAKhSUA into a readable timestamp format like 2020-03-02 14:34:22 or whatever that could be normally interpreted...I mostly use python so I'm looking to build a function that does this conversion. If there's another function that can do the reverse -- even better.

            Background

            I'm using parquet-tools to convert a raw parquet file (with snappy compression) to raw JSON via this commmand:

            ...

            ANSWER

            Answered 2020-Mar-16 at 19:01

            parquet-tools will not be able to change format type from INT96 to INT64. What you are observing in json output is a String representation of the timestamp stored in INT96 TimestampType. You will need spark to re-write this parquet with timestamp in INT64 TimestampType and then the json output will produce a timestamp (in the format you desire).

            You will need to set a specific config in Spark -

            Source https://stackoverflow.com/questions/60698056

            QUESTION

            How to edit Parquet file header programmatically or using editor?
            Asked 2019-Nov-08 at 03:16

            Using parquet-tools I can view header but I dont have a way to edit.

            parquet-tools head file.parquet

            Do we have a way to edit the header using java code programmatically or using editor?

            ...

            ANSWER

            Answered 2019-Nov-08 at 03:16

            Parquet files are immutable, so if you need to modify a file you generally need to create a new file with the modifications and replace the old file with it.

            Source https://stackoverflow.com/questions/58759845

            QUESTION

            Effectively merge big parquet files
            Asked 2019-Sep-11 at 14:36

            I'm using parquet-tools to merge parquet files. But it seems that parquet-tools needs an amount of memory as big as the merged file. Do we have other ways or configurable options in parquet-tools to use memory more effectively? Cause I run the merge job in as a map job on hadoop env. And the container gets killed every time cause it used more memory than it is provided.

            Thank you.

            ...

            ANSWER

            Answered 2019-Sep-11 at 14:36

            I wouldn't recommend using parquet-tools merge, since it just places row groups one after the another, so you will still have small groups, just packed together in a single file. The resulting file will typically not have noticably better performance, and under certain circumstances it may even perform worse than separate files. See PARQUET-1115 for details.

            Currently the only proper way to merge Parquet files is to read all data from them and write it to a new Parquet file. You can do it with a MapReduce job (requires writing custom code for this purpose) or using Spark, Hive or Impala.

            Source https://stackoverflow.com/questions/50299815

            QUESTION

            parquet file size, firehose vs. spark
            Asked 2019-Jul-11 at 12:12

            I'm generating Parquet files via two methods: a Kinesis Firehose and a Spark job. They are both written into the same partition structure on S3. Both sets of data can be queried using the same Athena table definition. Both use gzip compression.

            I'm noticing, however, that the Parquet files generated by Spark are about 3x as large as those from Firehose. Any reason this should be the case? I do notice some schema and metadata differences when I load them using Pyarrow:

            ...

            ANSWER

            Answered 2019-Jul-05 at 01:57

            Two things that I can think of than could attribute to the difference.
            1. Parquet properties.
            In Spark, you could find all the properties related to Parquet using the following snippets.
            If properties were set using Hadoop configs,

            Source https://stackoverflow.com/questions/56813435

            QUESTION

            View schema in parquet with on command line parquet-tools
            Asked 2019-Jun-06 at 13:30

            I'm trying to run a parquet-tools command to only view the file schema of my parquet file.

            I am currently running:

            ...

            ANSWER

            Answered 2019-Jun-06 at 13:13

            QUESTION

            HIVE_METASTORE_ERROR expected 'STRING' but 'STRING' is found
            Asked 2019-Apr-03 at 23:13

            I've been unable to get any query to work against my AWS Glue Partitioned table. The error I'm getting is

            HIVE_METASTORE_ERROR: com.facebook.presto.spi.PrestoException: Error: type expected at the position 0 of 'STRING' but 'STRING' is found. (Service: null; Status Code: 0; Error Code: null; Request ID: null)

            I've found one other thread that brings up the fact that the database name and table cannot have characters other than alphanumeric and underscores. So, I made sure the database name, table name and all column names adhere to this restriction. The only object that does not adhere to this restriction is my s3 bucket name which would be very difficult to change.

            Here are the table definitions and parquet-tools dumps of the data.

            AWS Glue Table Definition ...

            ANSWER

            Answered 2019-Jan-18 at 17:13

            I was declaring the partition keys as fields in the table. I also ran into the Parquet vs Hive difference in TIMESTAMP and switched those to ISO8601 strings. From there, I pretty much gave up because Athena throws a schema error if all parquet files in the s3 buckets do not have the same schema as Athena. However, with optional fields and sparse columns this is guaranteed to happen

            Source https://stackoverflow.com/questions/53936037

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install parquet-tools

            You can download it from GitHub.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/viirya/parquet-tools.git

          • CLI

            gh repo clone viirya/parquet-tools

          • sshUrl

            git@github.com:viirya/parquet-tools.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link