sparkMeasure | development repository sparkMeasure , a tool
kandi X-RAY | sparkMeasure Summary
kandi X-RAY | sparkMeasure Summary
This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task metrics data.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of sparkMeasure
sparkMeasure Key Features
sparkMeasure Examples and Code Snippets
Community Discussions
Trending Discussions on sparkMeasure
QUESTION
I am running a TPC-DS benchmark for Spark 3.0.1 in local mode and using sparkMeasure to get workload statistics. I have 16 total cores and SparkContext is available as
Spark context available as 'sc' (master = local[*], app id = local-1623251009819)
Q1. For local[*]
, driver and executors are created in a single JVM with 16 threads. Considering Spark's configuration which of the following will be true?
- 1 worker instance, 1 executor having 16 cores/threads
- 1 worker instance, 16 executors each having 1 core
For a particular query, sparkMeasure reports shuffle data as follows
shuffleRecordsRead => 183364403
shuffleTotalBlocksFetched => 52582
shuffleTotalBlocksFetched => 52582
shuffleLocalBlocksFetched => 52582
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 1570948723 (1498.0 MB)
shuffleLocalBytesRead => 1570948723 (1498.0 MB)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 1570948723 (1498.0 MB)
shuffleRecordsWritten => 183364480
Q2. Regardless of the query specifics, why is there data shuffling when everything is inside a single JVM?
...ANSWER
Answered 2021-Jun-11 at 05:56- executor is a jvm process when you use
local[*]
you run Spark locally with as many worker threads as logical cores on your machine so : 1 executor and as many worker threads as logical cores. when you configureSPARK_WORKER_INSTANCES=5
inspark-env.sh
and execute these commandsstart-master.sh
andstart-slave.sh spark://local:7077
to bring up a standalone spark cluster in your local machine you have one master and 5 workers, if you want to send your application to this cluster you must configure application likeSparkSession.builder().appName("app").master("spark://localhost:7077")
in this case you can't specify[*]
or[2]
for example. but when you specify master to belocal[*]
a jvm process is created and master and all workers will be in that jvm process and after your application finished that jvm instance will be destroyed.local[*]
andspark://localhost:7077
are two separate things. - workers do their job using tasks and each task actually is a thread
i.e.
task = thread
. workers have memory and they assign a memory partition to each task in order to they do their job such as reading a part of a dataset into its own memory partition or do a transformation on read data. when a task such as join needs other partitions, shuffle occurs regardless weather the job is ran in cluster or local. if you were in cluster there is a possibility that two tasks were in different machines so Network transmission will be added to other stuffs such as writing the result and then reading by another task. in local if task B needs the data in the partition of the task A, task A should write it down and then task B will read it to do its job
QUESTION
I am trying to install libraries on my databricks cluster, based on the external json file.
The file is constructed like this:
...ANSWER
Answered 2021-Mar-18 at 16:42You can just use simple replace in that case from the output that you are getting like below:
I am just using your Output and then taking it forward
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install sparkMeasure
Note: sparkMeasure is available on Maven Central
Spark 3.0.x and 2.4.x with scala 2.12: Scala: bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 Python: bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 note: pip install sparkmeasure to get the Python wrapper API.
Spark 2.x with Scala 2.11: Scala: bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.11:0.17 Python: bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.11:0.17 note: pip install sparkmeasure to get the Python wrapper API.
Bleeding edge: build sparkMeasure jar using sbt: sbt +package and use --jars with the jar just built instead of using --packages. Note: find the latest jars already built as artifacts in the GitHub actions
Scala notebook on Databricks
Python notebook on Databricks
Jupyter notebook on Google Colab Research
Jupyter notebook hosted on Microsoft Azure Notebooks
Local Python/Jupyter Notebook
CLI: spark-shell and PySpark # Scala CLI, Spark 3.0 bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) stageMetrics.runAndMeasure(spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show()) # Python CLI, Spark 3.0 pip install sparkmeasure bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 from sparkmeasure import StageMetrics stagemetrics = StageMetrics(spark) stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(100)").show()')
CLI: spark-shell, measure workload metrics aggregating from raw task metrics # Scala CLI, Spark 3.0 bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 val taskMetrics = ch.cern.sparkmeasure.TaskMetrics(spark) taskMetrics.runAndMeasure(spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show())
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page