kandi X-RAY | spark-sql-perf Summary
kandi X-RAY | spark-sql-perf Summary
This is a performance testing framework for Spark SQL in Apache Spark 2.2+.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-sql-perf
spark-sql-perf Key Features
spark-sql-perf Examples and Code Snippets
Community Discussions
Trending Discussions on spark-sql-perf
QUESTION
I am running a TPC-DS benchmark for Spark 3.0.1 in local mode and using sparkMeasure to get workload statistics. I have 16 total cores and SparkContext is available as
Spark context available as 'sc' (master = local[*], app id = local-1623251009819)
Q1. For local[*]
, driver and executors are created in a single JVM with 16 threads. Considering Spark's configuration which of the following will be true?
- 1 worker instance, 1 executor having 16 cores/threads
- 1 worker instance, 16 executors each having 1 core
For a particular query, sparkMeasure reports shuffle data as follows
shuffleRecordsRead => 183364403
shuffleTotalBlocksFetched => 52582
shuffleTotalBlocksFetched => 52582
shuffleLocalBlocksFetched => 52582
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 1570948723 (1498.0 MB)
shuffleLocalBytesRead => 1570948723 (1498.0 MB)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 1570948723 (1498.0 MB)
shuffleRecordsWritten => 183364480
Q2. Regardless of the query specifics, why is there data shuffling when everything is inside a single JVM?
...ANSWER
Answered 2021-Jun-11 at 05:56- executor is a jvm process when you use
local[*]
you run Spark locally with as many worker threads as logical cores on your machine so : 1 executor and as many worker threads as logical cores. when you configureSPARK_WORKER_INSTANCES=5
inspark-env.sh
and execute these commandsstart-master.sh
andstart-slave.sh spark://local:7077
to bring up a standalone spark cluster in your local machine you have one master and 5 workers, if you want to send your application to this cluster you must configure application likeSparkSession.builder().appName("app").master("spark://localhost:7077")
in this case you can't specify[*]
or[2]
for example. but when you specify master to belocal[*]
a jvm process is created and master and all workers will be in that jvm process and after your application finished that jvm instance will be destroyed.local[*]
andspark://localhost:7077
are two separate things. - workers do their job using tasks and each task actually is a thread
i.e.
task = thread
. workers have memory and they assign a memory partition to each task in order to they do their job such as reading a part of a dataset into its own memory partition or do a transformation on read data. when a task such as join needs other partitions, shuffle occurs regardless weather the job is ran in cluster or local. if you were in cluster there is a possibility that two tasks were in different machines so Network transmission will be added to other stuffs such as writing the result and then reading by another task. in local if task B needs the data in the partition of the task A, task A should write it down and then task B will read it to do its job
QUESTION
Im trying to build the TPCDS benchmark datasets, by following this website.
https://xuechendi.github.io/2019/07/12/Prepare-TPCDS-For-Spark
when I run this:
...ANSWER
Answered 2020-Mar-29 at 08:29Could not find dsdgen at /home/troberts/spark-sql-perf/tpcds-kit/tools/dsdgen or //home/troberts/spark-sql-perf/tpcds-kit/tools/dsdgen
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install spark-sql-perf
Before running any query, a dataset needs to be setup by creating a Benchmark object. Generating the TPCDS data requires dsdgen built and available on the machines. We have a fork of dsdgen that you will need. The fork includes changes to generate TPCDS data to stdout, so that this library can pipe them directly to Spark, without intermediate files. Therefore, this library will not work with the vanilla TPCDS kit.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page