spark-sql-perf

by databricks Scala Version: v0.2.4 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(2)Vulnerabilities Install Support

kandi X-RAY | spark-sql-perf Summary

spark-sql-perf is a Scala library typically used in Big Data, Spark applications. spark-sql-perf has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

This is a performance testing framework for Spark SQL in Apache Spark 2.2+.

Support

Quality

Security

License

Reuse

Support

spark-sql-perf has a medium active ecosystem.

It has 525 star(s) with 373 fork(s). There are 311 watchers for this library.

It had no major release in the last 6 months.

There are 41 open issues and 19 have been closed. On average issues are closed in 18 days. There are 13 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of spark-sql-perf is v0.2.4

Quality

spark-sql-perf has 0 bugs and 0 code smells.

Security

spark-sql-perf has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

spark-sql-perf code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

spark-sql-perf is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

spark-sql-perf releases are not available. You will need to build from source code and install.

Installation instructions, examples and code snippets are available.

It has 10463 lines of code, 282 functions and 75 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-sql-perf

Get all kandi verified functions for this library.

spark-sql-perf Key Features

No Key Features are available at this moment for spark-sql-perf.

spark-sql-perf Examples and Code Snippets

No Code Snippets are available at this moment for spark-sql-perf.

Community Discussions

Trending Discussions on spark-sql-perf

Spark executors and shuffle in local mode

Spark error when running TPCDS benchmark datasets - Could not find dsdgen

QUESTION

Spark executors and shuffle in local mode

Asked 2021-Jun-12 at 16:13

I am running a TPC-DS benchmark for Spark 3.0.1 in local mode and using sparkMeasure to get workload statistics. I have 16 total cores and SparkContext is available as

Spark context available as 'sc' (master = local[*], app id = local-1623251009819)

Q1. For local[*], driver and executors are created in a single JVM with 16 threads. Considering Spark's configuration which of the following will be true?

1 worker instance, 1 executor having 16 cores/threads
1 worker instance, 16 executors each having 1 core

For a particular query, sparkMeasure reports shuffle data as follows

shuffleRecordsRead => 183364403
shuffleTotalBlocksFetched => 52582
shuffleTotalBlocksFetched => 52582
shuffleLocalBlocksFetched => 52582
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 1570948723 (1498.0 MB)
shuffleLocalBytesRead => 1570948723 (1498.0 MB)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 1570948723 (1498.0 MB)
shuffleRecordsWritten => 183364480

Q2. Regardless of the query specifics, why is there data shuffling when everything is inside a single JVM?

...

ANSWER

Answered 2021-Jun-11 at 05:56

executor is a jvm process when you use local[*] you run Spark locally with as many worker threads as logical cores on your machine so : 1 executor and as many worker threads as logical cores. when you configure SPARK_WORKER_INSTANCES=5 in spark-env.sh and execute these commands start-master.sh and start-slave.sh spark://local:7077 to bring up a standalone spark cluster in your local machine you have one master and 5 workers, if you want to send your application to this cluster you must configure application like SparkSession.builder().appName("app").master("spark://localhost:7077") in this case you can't specify [*] or [2] for example. but when you specify master to be local[*] a jvm process is created and master and all workers will be in that jvm process and after your application finished that jvm instance will be destroyed. local[*] and spark://localhost:7077 are two separate things.
workers do their job using tasks and each task actually is a thread i.e. task = thread. workers have memory and they assign a memory partition to each task in order to they do their job such as reading a part of a dataset into its own memory partition or do a transformation on read data. when a task such as join needs other partitions, shuffle occurs regardless weather the job is ran in cluster or local. if you were in cluster there is a possibility that two tasks were in different machines so Network transmission will be added to other stuffs such as writing the result and then reading by another task. in local if task B needs the data in the partition of the task A, task A should write it down and then task B will read it to do its job

Source https://stackoverflow.com/questions/67923596

QUESTION

Spark error when running TPCDS benchmark datasets - Could not find dsdgen

Asked 2020-Mar-29 at 08:29

Im trying to build the TPCDS benchmark datasets, by following this website.

https://xuechendi.github.io/2019/07/12/Prepare-TPCDS-For-Spark

when I run this:

...

ANSWER

Answered 2020-Mar-29 at 08:29

Could not find dsdgen at /home/troberts/spark-sql-perf/tpcds-kit/tools/dsdgen or //home/troberts/spark-sql-perf/tpcds-kit/tools/dsdgen

Source https://stackoverflow.com/questions/60906687

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install spark-sql-perf

Use sbt package or sbt assembly to build the library jar. Use sbt +package to build for scala 2.11 and 2.12.
Before running any query, a dataset needs to be setup by creating a Benchmark object. Generating the TPCDS data requires dsdgen built and available on the machines. We have a fork of dsdgen that you will need. The fork includes changes to generate TPCDS data to stdout, so that this library can pipe them directly to Spark, without intermediate files. Therefore, this library will not work with the vanilla TPCDS kit.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: