livy | open source REST interface for interacting with Apache Spark
kandi X-RAY | livy Summary
kandi X-RAY | livy Summary
Livy is an open source REST interface for interacting with Apache Spark from anywhere
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of livy
livy Key Features
livy Examples and Code Snippets
Community Discussions
Trending Discussions on livy
QUESTION
The cluster is HDInsight 4.0 and has 250 GB RAM and 75 VCores.
I am running only one job and the cluster is always allocating 66 GB, 7 VCores and 7 Containers to the job even though we have 250 GB and 75 VCores available for use. This is not particular to one job. I have ran 3 different jobs and all have this issue. when I run 3 jobs in parallel , the cluster is still allocating 66 GB RAM to each job. Looks like there is some static setting configured.
The following is the queue setup
I am using a tool called Talend(ETL tool similar to informatica) where I have a GUI to create a job . The tool is eclipse based and below is the code generated for spark configuration. This is then given to LIVY for submission in the cluster.
...ANSWER
Answered 2022-Feb-12 at 19:54The behavior is expected as 6
execuors * 10
GB per executor memory = 60G
.
If want to use allocate more resources, try to increase exeucotr number such as 24
QUESTION
All,
We have a Apache Spark v3.12 + Yarn on AKS (SQLServer 2019 BDC). We ran a refactored python code to Pyspark which resulted in the error below:
Application application_1635264473597_0181 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1635264473597_0181_000001 exited with exitCode: -104
Failing this attempt.Diagnostics: [2021-11-12 15:00:16.915]Container [pid=12990,containerID=container_1635264473597_0181_01_000001] is running 7282688B beyond the 'PHYSICAL' memory limit. Current usage: 2.0 GB of 2 GB physical memory used; 4.9 GB of 4.2 GB virtual memory used. Killing container.
Dump of the process-tree for container_1635264473597_0181_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 13073 12999 12990 12990 (python3) 7333 112 1516236800 235753 /opt/bin/python3 /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp/3677222184783620782
|- 12999 12990 12990 12990 (java) 6266 586 3728748544 289538 /opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1 -Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.livy.rsc.driver.RSCDriverBootstrapper --properties-file /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties
|- 12990 12987 12990 12990 (bash) 0 0 4304896 775 /bin/bash -c /opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1 -Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.livy.rsc.driver.RSCDriverBootstrapper' --properties-file /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties 1> /var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stdout 2> /var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stderr
[2021-11-12 15:00:16.921]Container killed on request. Exit code is 143
[2021-11-12 15:00:16.940]Container exited with a non-zero exit code 143.
For more detailed output, check the application tracking page: https://sparkhead-0.mssql-cluster.everestre.net:8090/cluster/app/application_1635264473597_0181 Then click on links to logs of each attempt.
. Failing the application.
The default setting is as below and there are no runtime settings:
"settings": {
"spark-defaults-conf.spark.driver.cores": "1",
"spark-defaults-conf.spark.driver.memory": "1664m",
"spark-defaults-conf.spark.driver.memoryOverhead": "384",
"spark-defaults-conf.spark.executor.instances": "1",
"spark-defaults-conf.spark.executor.cores": "2",
"spark-defaults-conf.spark.executor.memory": "3712m",
"spark-defaults-conf.spark.executor.memoryOverhead": "384",
"yarn-site.yarn.nodemanager.resource.memory-mb": "12288",
"yarn-site.yarn.nodemanager.resource.cpu-vcores": "6",
"yarn-site.yarn.scheduler.maximum-allocation-mb": "12288",
"yarn-site.yarn.scheduler.maximum-allocation-vcores": "6",
"yarn-site.yarn.scheduler.capacity.maximum-am-resource-percent": "0.34".
}
Is the AM Container mentioned the Application Master Container or Application Manager (of YARN). If this is the case, then in a Cluster Mode setting, the Driver and the Application Master run in the same Container?
What runtime parameter do I change to make the Pyspark code successfully.
Thanks,
grajee
ANSWER
Answered 2021-Nov-19 at 13:36Likely you don't change any settings 143 could mean a lot of things, including you ran out of memory. To test if you ran out of memory. I'd reduce the amount of data you are using and see if you code starts to work. If it does it's likely you ran out of memory and should consider refactoring your code. In general I suggest trying code changes first before making spark config changes.
For an understanding of how spark driver works on yarn, here's a reasonable explanation: https://sujithjay.com/spark/with-yarn
QUESTION
I am trying to write a template to create a EMR cluster using cloudformation.
So far I came up with this
...ANSWER
Answered 2021-Nov-03 at 01:17There is Ec2KeyName parameter:
The name of the EC2 key pair that can be used to connect to the master node using SSH as the user called "hadoop."
QUESTION
I have the pyspark dataframe df below. It has the schema shown below. I've also supplied some sample data, and the desired out put I'm looking for. The problem I'm having is the attributes column has values that are dictionaries. I would like to create new columns for each key in the dictionaries but the values in the attribute column are string. So I'm having trouble using explode or from_json.
I made an attempt based on another SO post using explode, the code I ran and the error are below the example data and desired output.
Also I don't know what all the keys in the dict might be, since different records have different length dicts.
does anyone have a suggestion how to do this? I was thinking of converting it to pandas and trying to solve it that way, but I'm hoping there's a better/faster pyspark solution.
...ANSWER
Answered 2021-Oct-05 at 01:58Try using from_json
function with the corresponding schema to parse the json string
QUESTION
I am wondering, is there any way where i create the spark-context once in the YARN cluster, then the incoming jobs will re-use that context. The context creation takes 20s or sometimes more in my cluster. I am using pyspark for scripting and livy to submit jobs.
...ANSWER
Answered 2021-Jul-28 at 12:44No, you can't just have a standing SparkContext running in Yarn. Maybe another idea is to run in client mode, where the client has it's own SparkContext (this is the method used by tools like Apache Zeppelin and the spark-shell).
QUESTION
I'm trying to run my application using Livy that resides inside GCP Dataproc but I'm getting this: "Caused by: java.lang.ClassNotFoundException: bigquery.DefaultSource"
I'm able to run hadoop fs -ls gs://xxxx inside Dataproc and I checked if Spark is pointing to the right location in order to find gcs-connector.jar and that's ok too.
I included Livy in Dataproc using initialization (https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/livy/)
How can I include bigquery-connector in Livy's classpath? Could you help me, please? Thank you all!
...ANSWER
Answered 2021-Jul-02 at 18:48It looks like your application is depending on the BigQuery connector, not the GCS connector (bigquery.DefaultSource
).
The GCS connector should always be included in the HADOOP classpath by default, but you will have to manually add the BigQuery connector jar to your application.
Assuming this is a Spark application, you can set the Spark jar property to pull in the bigquery connector jar from GCS at runtime: spark.jars='gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'
For more installation options, see https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/README.md
QUESTION
For reference, livy is a rest endpoint used to pull data from a cluster. Within the same account, my lambda function always times out when attempting to access the livy endpoint using by EMR.
Endpoint: http://ip-xxxx-xx-xxx-xx.ec2.internal:8998/
Error:
...ANSWER
Answered 2021-Jul-16 at 18:46Firstly, your http endpoint is .ec2.internal
. So Im assuming its in some VPC (not public). So your lambda must be in within the same VPC.
You are trying to access HTTP endpoint with a TCP connection, so your lambda security group must be whitelisted to EMR master node's security group.
Lastly, You don't need AmazonEMRFullAccessPolicy_v2
because you are not accessing AWS API.
QUESTION
I created MWAA using public network option (version 2.0.2). Created a sample of airflow dag in which start emr with next properties:
...ANSWER
Answered 2021-Jun-23 at 21:34Issue resolved setting:
QUESTION
I am little confused with how to pass the arguments as REST API JSON.
Consider below spark submit command.
...ANSWER
Answered 2021-Jun-22 at 15:04Posting here, if it may help someone.
WE found out that we can pass args as a list in a http request (to the livy server). in args, we can pass all the hudi related confs like ["key1","value1","key2",","value2","--hoodie-conf","confname=value"... etc]. We are able to submit jobs via livy server.
QUESTION
zeppelin 0.9.0 does not work with Kerberos
I have add "zeppelin.server.kerberos.keytab" and "zeppelin.server.kerberos.principal" in zeppelin-site.xml
But I aldo get error "Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "bigdser5/10.3.87.27"; destination host is: "bigdser1":8020;"
And add "spark.yarn.keytab","spark.yarn.principal" in spark interpreters,it does not work yet.
In my spark-shell that can work with Kerberos
My kerberos step
1.admin.local -q "addprinc jzyc/hadoop"
kadmin.local -q "xst -k jzyc.keytab jzyc/hadoop@JJKK.COM"
copy jzyc.keytab to other server
kinit -kt jzyc.keytab jzyc/hadoop@JJKK.COM
In my livy I get error "javax.servlet.ServletException: org.apache.hadoop.security.authentication.client.AuthenticationException: javax.security.auth.login.LoginException: No key to store"
...ANSWER
Answered 2021-Apr-15 at 09:01INFO [2021-04-15 16:44:46,522] ({dispatcher-event-loop-1} Logging.scala[logInfo]:57) - Got an error when resolving hostNames. Falling back to /default-rack for all
INFO [2021-04-15 16:44:46,561] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Attempting to login to KDC using principal: jzyc/bigdser4@JOIN.COM
INFO [2021-04-15 16:44:46,574] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Successfully logged into KDC.
INFO [2021-04-15 16:44:47,124] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1346508100_40, ugi=jzyc/bigdser4@JOIN.COM (auth:KERBEROS)]] with renewer yarn/bigdser1@JOIN.COM
INFO [2021-04-15 16:44:47,265] ({FIFOScheduler-interpreter_1099886208-Worker-1} DFSClient.java[getDelegationToken]:700) - Created token for jzyc: HDFS_DELEGATION_TOKEN owner=jzyc/bigdser4@JOIN.COM, renewer=yarn, realUser=, issueDate=1618476287222, maxDate=1619081087222, sequenceNumber=171, masterKeyId=21 on ha-hdfs:nameservice1
INFO [2021-04-15 16:44:47,273] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1346508100_40, ugi=jzyc/bigdser4@JOIN.COM (auth:KERBEROS)]] with renewer jzyc/bigdser4@JOIN.COM
INFO [2021-04-15 16:44:47,278] ({FIFOScheduler-interpreter_1099886208-Worker-1} DFSClient.java[getDelegationToken]:700) - Created token for jzyc: HDFS_DELEGATION_TOKEN owner=jzyc/bigdser4@JOIN.COM, renewer=jzyc, realUser=, issueDate=1618476287276, maxDate=1619081087276, sequenceNumber=172, masterKeyId=21 on ha-hdfs:nameservice1
INFO [2021-04-15 16:44:47,331] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Renewal interval is 86400051 for token HDFS_DELEGATION_TOKEN
INFO [2021-04-15 16:44:47,492] ({dispatcher-event-loop-0} Logging.scala[logInfo]:57) - Got an error when resolving hostNames. Falling back to /default-rack for all
INFO [2021-04-15 16:44:47,493] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Scheduling renewal in 18.0 h.
INFO [2021-04-15 16:44:47,494] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Updating delegation tokens.
INFO [2021-04-15 16:44:47,521] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Updating delegation tokens for current user.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install livy
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page