sparkMeasure | development repository sparkMeasure , a tool

by LucaCanali Scala Version: v0.23 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(2)Vulnerabilities Install Support

kandi X-RAY | sparkMeasure Summary

sparkMeasure is a Scala library typically used in Big Data, Spark applications. sparkMeasure has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task metrics data.

Support

Quality

Security

License

Reuse

Support

sparkMeasure has a low active ecosystem.

It has 561 star(s) with 126 fork(s). There are 33 watchers for this library.

It had no major release in the last 12 months.

There are 0 open issues and 33 have been closed. On average issues are closed in 138 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of sparkMeasure is v0.23

Quality

sparkMeasure has no bugs reported.

Security

sparkMeasure has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

sparkMeasure is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

sparkMeasure releases are available to install and integrate.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of sparkMeasure

Get all kandi verified functions for this library.

sparkMeasure Key Features

No Key Features are available at this moment for sparkMeasure.

sparkMeasure Examples and Code Snippets

No Code Snippets are available at this moment for sparkMeasure.

Community Discussions

Trending Discussions on sparkMeasure

Spark executors and shuffle in local mode

Powershell: reading nested json

QUESTION

Spark executors and shuffle in local mode

Asked 2021-Jun-12 at 16:13

I am running a TPC-DS benchmark for Spark 3.0.1 in local mode and using sparkMeasure to get workload statistics. I have 16 total cores and SparkContext is available as

Spark context available as 'sc' (master = local[*], app id = local-1623251009819)

Q1. For local[*], driver and executors are created in a single JVM with 16 threads. Considering Spark's configuration which of the following will be true?

1 worker instance, 1 executor having 16 cores/threads
1 worker instance, 16 executors each having 1 core

For a particular query, sparkMeasure reports shuffle data as follows

shuffleRecordsRead => 183364403
shuffleTotalBlocksFetched => 52582
shuffleTotalBlocksFetched => 52582
shuffleLocalBlocksFetched => 52582
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 1570948723 (1498.0 MB)
shuffleLocalBytesRead => 1570948723 (1498.0 MB)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 1570948723 (1498.0 MB)
shuffleRecordsWritten => 183364480

Q2. Regardless of the query specifics, why is there data shuffling when everything is inside a single JVM?

...

ANSWER

Answered 2021-Jun-11 at 05:56

executor is a jvm process when you use local[*] you run Spark locally with as many worker threads as logical cores on your machine so : 1 executor and as many worker threads as logical cores. when you configure SPARK_WORKER_INSTANCES=5 in spark-env.sh and execute these commands start-master.sh and start-slave.sh spark://local:7077 to bring up a standalone spark cluster in your local machine you have one master and 5 workers, if you want to send your application to this cluster you must configure application like SparkSession.builder().appName("app").master("spark://localhost:7077") in this case you can't specify [*] or [2] for example. but when you specify master to be local[*] a jvm process is created and master and all workers will be in that jvm process and after your application finished that jvm instance will be destroyed. local[*] and spark://localhost:7077 are two separate things.
workers do their job using tasks and each task actually is a thread i.e. task = thread. workers have memory and they assign a memory partition to each task in order to they do their job such as reading a part of a dataset into its own memory partition or do a transformation on read data. when a task such as join needs other partitions, shuffle occurs regardless weather the job is ran in cluster or local. if you were in cluster there is a possibility that two tasks were in different machines so Network transmission will be added to other stuffs such as writing the result and then reading by another task. in local if task B needs the data in the partition of the task A, task A should write it down and then task B will read it to do its job

Source https://stackoverflow.com/questions/67923596

QUESTION

Powershell: reading nested json

Asked 2021-Mar-18 at 16:42

I am trying to install libraries on my databricks cluster, based on the external json file.

The file is constructed like this:

...

ANSWER

Answered 2021-Mar-18 at 16:42

You can just use simple replace in that case from the output that you are getting like below:

I am just using your Output and then taking it forward

Source https://stackoverflow.com/questions/66695188

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install sparkMeasure

CLI: spark-shell and PySpark.
Note: sparkMeasure is available on Maven Central
Spark 3.0.x and 2.4.x with scala 2.12: Scala: bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 Python: bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 note: pip install sparkmeasure to get the Python wrapper API.
Spark 2.x with Scala 2.11: Scala: bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.11:0.17 Python: bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.11:0.17 note: pip install sparkmeasure to get the Python wrapper API.
Bleeding edge: build sparkMeasure jar using sbt: sbt +package and use --jars with the jar just built instead of using --packages. Note: find the latest jars already built as artifacts in the GitHub actions
Scala notebook on Databricks
Python notebook on Databricks
Jupyter notebook on Google Colab Research
Jupyter notebook hosted on Microsoft Azure Notebooks
Local Python/Jupyter Notebook
CLI: spark-shell and PySpark # Scala CLI, Spark 3.0 bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) stageMetrics.runAndMeasure(spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show()) # Python CLI, Spark 3.0 pip install sparkmeasure bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 from sparkmeasure import StageMetrics stagemetrics = StageMetrics(spark) stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(100)").show()')
CLI: spark-shell, measure workload metrics aggregating from raw task metrics # Scala CLI, Spark 3.0 bin/spark-shell --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 val taskMetrics = ch.cern.sparkmeasure.TaskMetrics(spark) taskMetrics.runAndMeasure(spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show())

Support

SparkMeasure simplifies the collection and analysis of Spark performance metrics. Use sparkMeasure for troubleshooting interactive and batch Spark workloads. Use it also to collect metrics for long-term retention or as part of a CI/CD pipeline. SparkMeasure is also intended as a working example of how to use Spark Listeners for collecting Spark task metrics data.

Find more information at: