tpcds-kit | TPC-DS benchmark kit with some modifications/fixes | SQL Database library

 by   gregrahn C Version: Current License: No License

kandi X-RAY | tpcds-kit Summary

kandi X-RAY | tpcds-kit Summary

tpcds-kit is a C library typically used in Database, SQL Database applications. tpcds-kit has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

TPC-DS benchmark kit with some modifications/fixes
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              tpcds-kit has a low active ecosystem.
              It has 268 star(s) with 181 fork(s). There are 11 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 15 open issues and 38 have been closed. On average issues are closed in 154 days. There are 2 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of tpcds-kit is current.

            kandi-Quality Quality

              tpcds-kit has 0 bugs and 0 code smells.

            kandi-Security Security

              tpcds-kit has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              tpcds-kit code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              tpcds-kit does not have a standard license declared.
              Check the repository for any license declaration and review the terms closely.
              OutlinedDot
              Without a license, all rights are reserved, and you cannot use the library in your applications.

            kandi-Reuse Reuse

              tpcds-kit releases are not available. You will need to build from source code and install.
              Installation instructions are not available. Examples and code snippets are available.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of tpcds-kit
            Get all kandi verified functions for this library.

            tpcds-kit Key Features

            No Key Features are available at this moment for tpcds-kit.

            tpcds-kit Examples and Code Snippets

            No Code Snippets are available at this moment for tpcds-kit.

            Community Discussions

            QUESTION

            Snowflake query performance is unexpectedly slower for external Parquet tables vs. internal tables
            Asked 2022-Feb-07 at 14:34

            When I run queries on external Parquet tables in Snowflake, the queries are orders of magnitude slower than on the same tables copied into Snowflake or with any other cloud data warehouse I have tested on the same files.

            Context:

            I have tables belonging to the 10TB TPC-DS dataset in Parquet format on GCS and a Snowflake account in the same region (US Central). I have loaded those tables into Snowflake using create as select. I can run TPC-DS queries(here #28) on these internal tables with excellent performance. I was also able to query those files on GCS directly with data lake engines with excellent performance, as the files are "optimally" sized and internally sorted. However, when I query the same external tables on Snowflake, the query does not seem to finish in reasonable time (>4 minutes and counting, as opposed to 30 seconds, on the same virtual warehouse). Looking at the query profile, it seems that the number of records read in the table scans keeps growing indefinitely, resulting in a proportional amount of spilling to disk.

            The table happens to be partitioned but it those not matter on the query of interest (which I tested with other engines).

            What I would expect:

            Assuming proper data "formatting", I would expect no major performance degradation compared to internal tables, as the setup is technically the same - data stored in columnar format in cloud object store - and as it is advertised as such by Snowflake. For example I saw no performance degradation with BigQuery on the exact same experiment.

            Other than double checking my setup, I see don't see many things to try...

            This is what the "in progress" part of the plan looks like 4 minutes into execution on the external table. All other operators are at 0% progress. You can see external bytes scanned=bytes spilled and 26G!! rows are produced. And this is what it looked like on a finished execution on the internal table executed in ~20 seconds. You can see that the left-most table scan should produce 1.4G rows but had produced 23G rows with the external table.

            This is a sample of the DDL I used (I also tested without defining the partitioning column):

            ...

            ANSWER

            Answered 2022-Jan-18 at 12:20

            Probably Snowflake plan assumes it must read every parquet file because it cannot tell beforehand if the files are sorted, number of unique values, nulls, minimum and maximum values for each column, etc.

            This information is stored as an optional field in Parquet, but you'll need to read the parquet metadata first to find out.

            When Snowflake uses internal tables, it has full control about storage, has information about indexes (if any), column stats, and how to optimize a query both from a logical and physical perspective.

            Source https://stackoverflow.com/questions/70755218

            QUESTION

            Spark error when running TPCDS benchmark datasets - Could not find dsdgen
            Asked 2020-Mar-29 at 08:29

            Im trying to build the TPCDS benchmark datasets, by following this website.

            https://xuechendi.github.io/2019/07/12/Prepare-TPCDS-For-Spark

            when I run this:

            ...

            ANSWER

            Answered 2020-Mar-29 at 08:29
            Could not find dsdgen at /home/troberts/spark-sql-perf/tpcds-kit/tools/dsdgen or //home/troberts/spark-sql-perf/tpcds-kit/tools/dsdgen
            

            Source https://stackoverflow.com/questions/60906687

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install tpcds-kit

            You can download it from GitHub.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/gregrahn/tpcds-kit.git

          • CLI

            gh repo clone gregrahn/tpcds-kit

          • sshUrl

            git@github.com:gregrahn/tpcds-kit.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link