tpcds-kit | TPC-DS benchmark kit with some modifications/fixes | SQL Database library
kandi X-RAY | tpcds-kit Summary
kandi X-RAY | tpcds-kit Summary
TPC-DS benchmark kit with some modifications/fixes
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of tpcds-kit
tpcds-kit Key Features
tpcds-kit Examples and Code Snippets
Community Discussions
Trending Discussions on tpcds-kit
QUESTION
When I run queries on external Parquet tables in Snowflake, the queries are orders of magnitude slower than on the same tables copied into Snowflake or with any other cloud data warehouse I have tested on the same files.
Context:
I have tables belonging to the 10TB TPC-DS dataset in Parquet format on GCS and a Snowflake account in the same region (US Central). I have loaded those tables into Snowflake using create as select. I can run TPC-DS queries(here #28) on these internal tables with excellent performance. I was also able to query those files on GCS directly with data lake engines with excellent performance, as the files are "optimally" sized and internally sorted. However, when I query the same external tables on Snowflake, the query does not seem to finish in reasonable time (>4 minutes and counting, as opposed to 30 seconds, on the same virtual warehouse). Looking at the query profile, it seems that the number of records read in the table scans keeps growing indefinitely, resulting in a proportional amount of spilling to disk.
The table happens to be partitioned but it those not matter on the query of interest (which I tested with other engines).
What I would expect:
Assuming proper data "formatting", I would expect no major performance degradation compared to internal tables, as the setup is technically the same - data stored in columnar format in cloud object store - and as it is advertised as such by Snowflake. For example I saw no performance degradation with BigQuery on the exact same experiment.
Other than double checking my setup, I see don't see many things to try...
This is what the "in progress" part of the plan looks like 4 minutes into execution on the external table. All other operators are at 0% progress. You can see external bytes scanned=bytes spilled and 26G!! rows are produced. And this is what it looked like on a finished execution on the internal table executed in ~20 seconds. You can see that the left-most table scan should produce 1.4G rows but had produced 23G rows with the external table.
This is a sample of the DDL I used (I also tested without defining the partitioning column):
...ANSWER
Answered 2022-Jan-18 at 12:20Probably Snowflake plan assumes it must read every parquet file because it cannot tell beforehand if the files are sorted, number of unique values, nulls, minimum and maximum values for each column, etc.
This information is stored as an optional field in Parquet, but you'll need to read the parquet metadata first to find out.
When Snowflake uses internal tables, it has full control about storage, has information about indexes (if any), column stats, and how to optimize a query both from a logical and physical perspective.
QUESTION
Im trying to build the TPCDS benchmark datasets, by following this website.
https://xuechendi.github.io/2019/07/12/Prepare-TPCDS-For-Spark
when I run this:
...ANSWER
Answered 2020-Mar-29 at 08:29Could not find dsdgen at /home/troberts/spark-sql-perf/tpcds-kit/tools/dsdgen or //home/troberts/spark-sql-perf/tpcds-kit/tools/dsdgen
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install tpcds-kit
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page