livy | open source REST interface for interacting with Apache Spark

by cloudera Scala Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | livy Summary

livy is a Scala library typically used in Big Data, Spark applications. livy has no bugs and it has medium support. However livy has 1 vulnerabilities. You can download it from GitHub.

Livy is an open source REST interface for interacting with Apache Spark from anywhere

Support

Quality

Security

License

Reuse

Support

livy has a medium active ecosystem.

It has 990 star(s) with 326 fork(s). There are 91 watchers for this library.

It had no major release in the last 6 months.

livy has no issues reported. There are 24 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of livy is current.

Quality

livy has 0 bugs and 0 code smells.

Security

livy has 1 vulnerability issues reported (0 critical, 0 high, 1 medium, 0 low).

livy code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

livy does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

livy releases are not available. You will need to build from source code and install.

It has 21197 lines of code, 1368 functions and 224 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of livy

Get all kandi verified functions for this library.

livy Key Features

No Key Features are available at this moment for livy.

livy Examples and Code Snippets

No Code Snippets are available at this moment for livy.

Community Discussions

Trending Discussions on livy

Spark cluster is not dynamically allocating resources to jobs

pyspark erroring with a AM Container limit error

adding ec2 key pair to EMR cluster using cloudformation

convert column of dictionaries to columns in pyspark dataframe

Spark context is created every time a batch job is executed in yarn

How to include BigQuery Connector inside Dataproc using Livy

AWS - Lambda cannot access Livy endpoint for EMR

AWS MWAA - connection timed out when try do rest call to Livy

need help on submitting hudi delta streamer job via apache livy

How can I run zeppelin with keberos in CDH 6.3.2

QUESTION

Spark cluster is not dynamically allocating resources to jobs

Asked 2022-Feb-12 at 19:54

The cluster is HDInsight 4.0 and has 250 GB RAM and 75 VCores.

I am running only one job and the cluster is always allocating 66 GB, 7 VCores and 7 Containers to the job even though we have 250 GB and 75 VCores available for use. This is not particular to one job. I have ran 3 different jobs and all have this issue. when I run 3 jobs in parallel , the cluster is still allocating 66 GB RAM to each job. Looks like there is some static setting configured.

The following is the queue setup

I am using a tool called Talend(ETL tool similar to informatica) where I have a GUI to create a job . The tool is eclipse based and below is the code generated for spark configuration. This is then given to LIVY for submission in the cluster.

...

ANSWER

Answered 2022-Feb-12 at 19:54

The behavior is expected as 6 execuors * 10 GB per executor memory = 60G.

If want to use allocate more resources, try to increase exeucotr number such as 24

Source https://stackoverflow.com/questions/71081007

QUESTION

pyspark erroring with a AM Container limit error

Asked 2021-Nov-19 at 13:36

All,

We have a Apache Spark v3.12 + Yarn on AKS (SQLServer 2019 BDC). We ran a refactored python code to Pyspark which resulted in the error below:

Application application_1635264473597_0181 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1635264473597_0181_000001 exited with exitCode: -104

Failing this attempt.Diagnostics: [2021-11-12 15:00:16.915]Container [pid=12990,containerID=container_1635264473597_0181_01_000001] is running 7282688B beyond the 'PHYSICAL' memory limit. Current usage: 2.0 GB of 2 GB physical memory used; 4.9 GB of 4.2 GB virtual memory used. Killing container.

Dump of the process-tree for container_1635264473597_0181_01_000001 :

|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE

|- 13073 12999 12990 12990 (python3) 7333 112 1516236800 235753 /opt/bin/python3 /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp/3677222184783620782

|- 12999 12990 12990 12990 (java) 6266 586 3728748544 289538 /opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1 -Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.livy.rsc.driver.RSCDriverBootstrapper --properties-file /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties

|- 12990 12987 12990 12990 (bash) 0 0 4304896 775 /bin/bash -c /opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1 -Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.livy.rsc.driver.RSCDriverBootstrapper' --properties-file /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties 1> /var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stdout 2> /var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stderr

[2021-11-12 15:00:16.921]Container killed on request. Exit code is 143

[2021-11-12 15:00:16.940]Container exited with a non-zero exit code 143.

For more detailed output, check the application tracking page: https://sparkhead-0.mssql-cluster.everestre.net:8090/cluster/app/application_1635264473597_0181 Then click on links to logs of each attempt.

. Failing the application.

The default setting is as below and there are no runtime settings:

"settings": {
"spark-defaults-conf.spark.driver.cores": "1",
"spark-defaults-conf.spark.driver.memory": "1664m",
"spark-defaults-conf.spark.driver.memoryOverhead": "384",
"spark-defaults-conf.spark.executor.instances": "1",
"spark-defaults-conf.spark.executor.cores": "2",
"spark-defaults-conf.spark.executor.memory": "3712m",
"spark-defaults-conf.spark.executor.memoryOverhead": "384",
"yarn-site.yarn.nodemanager.resource.memory-mb": "12288",
"yarn-site.yarn.nodemanager.resource.cpu-vcores": "6",
"yarn-site.yarn.scheduler.maximum-allocation-mb": "12288",
"yarn-site.yarn.scheduler.maximum-allocation-vcores": "6",
"yarn-site.yarn.scheduler.capacity.maximum-am-resource-percent": "0.34".
}

Is the AM Container mentioned the Application Master Container or Application Manager (of YARN). If this is the case, then in a Cluster Mode setting, the Driver and the Application Master run in the same Container?

What runtime parameter do I change to make the Pyspark code successfully.

Thanks,
grajee

...

ANSWER

Answered 2021-Nov-19 at 13:36

Likely you don't change any settings 143 could mean a lot of things, including you ran out of memory. To test if you ran out of memory. I'd reduce the amount of data you are using and see if you code starts to work. If it does it's likely you ran out of memory and should consider refactoring your code. In general I suggest trying code changes first before making spark config changes.

For an understanding of how spark driver works on yarn, here's a reasonable explanation: https://sujithjay.com/spark/with-yarn

Source https://stackoverflow.com/questions/69960411

QUESTION

adding ec2 key pair to EMR cluster using cloudformation

Asked 2021-Nov-03 at 01:17

I am trying to write a template to create a EMR cluster using cloudformation.

So far I came up with this

...

ANSWER

Answered 2021-Nov-03 at 01:17

There is Ec2KeyName parameter:

The name of the EC2 key pair that can be used to connect to the master node using SSH as the user called "hadoop."

Source https://stackoverflow.com/questions/69813218

QUESTION

convert column of dictionaries to columns in pyspark dataframe

Asked 2021-Oct-05 at 01:58

I have the pyspark dataframe df below. It has the schema shown below. I've also supplied some sample data, and the desired out put I'm looking for. The problem I'm having is the attributes column has values that are dictionaries. I would like to create new columns for each key in the dictionaries but the values in the attribute column are string. So I'm having trouble using explode or from_json.

I made an attempt based on another SO post using explode, the code I ran and the error are below the example data and desired output.

Also I don't know what all the keys in the dict might be, since different records have different length dicts.

does anyone have a suggestion how to do this? I was thinking of converting it to pandas and trying to solve it that way, but I'm hoping there's a better/faster pyspark solution.

...

ANSWER

Answered 2021-Oct-05 at 01:58

Try using from_json function with the corresponding schema to parse the json string

Source https://stackoverflow.com/questions/69443677

QUESTION

Spark context is created every time a batch job is executed in yarn

Asked 2021-Aug-01 at 20:52

I am wondering, is there any way where i create the spark-context once in the YARN cluster, then the incoming jobs will re-use that context. The context creation takes 20s or sometimes more in my cluster. I am using pyspark for scripting and livy to submit jobs.

...

ANSWER

Answered 2021-Jul-28 at 12:44

No, you can't just have a standing SparkContext running in Yarn. Maybe another idea is to run in client mode, where the client has it's own SparkContext (this is the method used by tools like Apache Zeppelin and the spark-shell).

Source https://stackoverflow.com/questions/68559885

QUESTION

How to include BigQuery Connector inside Dataproc using Livy

Asked 2021-Jul-17 at 05:00

I'm trying to run my application using Livy that resides inside GCP Dataproc but I'm getting this: "Caused by: java.lang.ClassNotFoundException: bigquery.DefaultSource"

I'm able to run hadoop fs -ls gs://xxxx inside Dataproc and I checked if Spark is pointing to the right location in order to find gcs-connector.jar and that's ok too.

I included Livy in Dataproc using initialization (https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/livy/)

How can I include bigquery-connector in Livy's classpath? Could you help me, please? Thank you all!

...

ANSWER

Answered 2021-Jul-02 at 18:48

It looks like your application is depending on the BigQuery connector, not the GCS connector (bigquery.DefaultSource).

The GCS connector should always be included in the HADOOP classpath by default, but you will have to manually add the BigQuery connector jar to your application.

Assuming this is a Spark application, you can set the Spark jar property to pull in the bigquery connector jar from GCS at runtime: spark.jars='gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'

For more installation options, see https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/README.md

Source https://stackoverflow.com/questions/68215418

QUESTION

AWS - Lambda cannot access Livy endpoint for EMR

Asked 2021-Jul-16 at 18:46

For reference, livy is a rest endpoint used to pull data from a cluster. Within the same account, my lambda function always times out when attempting to access the livy endpoint using by EMR.

Endpoint: http://ip-xxxx-xx-xxx-xx.ec2.internal:8998/

Error:

...

ANSWER

Answered 2021-Jul-16 at 18:46

Firstly, your http endpoint is .ec2.internal. So Im assuming its in some VPC (not public). So your lambda must be in within the same VPC.

You are trying to access HTTP endpoint with a TCP connection, so your lambda security group must be whitelisted to EMR master node's security group.

Lastly, You don't need AmazonEMRFullAccessPolicy_v2 because you are not accessing AWS API.

Source https://stackoverflow.com/questions/68413768

QUESTION

AWS MWAA - connection timed out when try do rest call to Livy

Asked 2021-Jun-23 at 21:34

I created MWAA using public network option (version 2.0.2). Created a sample of airflow dag in which start emr with next properties:

...

ANSWER

Answered 2021-Jun-23 at 21:34

Issue resolved setting:

Source https://stackoverflow.com/questions/68089560

QUESTION

need help on submitting hudi delta streamer job via apache livy

Asked 2021-Jun-22 at 15:04

I am little confused with how to pass the arguments as REST API JSON.

Consider below spark submit command.

...

ANSWER

Answered 2021-Jun-22 at 15:04

Posting here, if it may help someone.

WE found out that we can pass args as a list in a http request (to the livy server). in args, we can pass all the hudi related confs like ["key1","value1","key2",","value2","--hoodie-conf","confname=value"... etc]. We are able to submit jobs via livy server.

Source https://stackoverflow.com/questions/68021329

QUESTION

How can I run zeppelin with keberos in CDH 6.3.2

Asked 2021-May-26 at 03:24

zeppelin 0.9.0 does not work with Kerberos

I have add "zeppelin.server.kerberos.keytab" and "zeppelin.server.kerberos.principal" in zeppelin-site.xml

But I aldo get error "Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "bigdser5/10.3.87.27"; destination host is: "bigdser1":8020;"

And add "spark.yarn.keytab","spark.yarn.principal" in spark interpreters,it does not work yet.

In my spark-shell that can work with Kerberos

My kerberos step

1.admin.local -q "addprinc jzyc/hadoop"

kadmin.local -q "xst -k jzyc.keytab jzyc/hadoop@JJKK.COM"
copy jzyc.keytab to other server
kinit -kt jzyc.keytab jzyc/hadoop@JJKK.COM

In my livy I get error "javax.servlet.ServletException: org.apache.hadoop.security.authentication.client.AuthenticationException: javax.security.auth.login.LoginException: No key to store"

...

ANSWER

Answered 2021-Apr-15 at 09:01

INFO [2021-04-15 16:44:46,522] ({dispatcher-event-loop-1} Logging.scala[logInfo]:57) - Got an error when resolving hostNames. Falling back to /default-rack for all
 INFO [2021-04-15 16:44:46,561] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Attempting to login to KDC using principal: jzyc/bigdser4@JOIN.COM
 INFO [2021-04-15 16:44:46,574] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Successfully logged into KDC.
 INFO [2021-04-15 16:44:47,124] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1346508100_40, ugi=jzyc/bigdser4@JOIN.COM (auth:KERBEROS)]] with renewer yarn/bigdser1@JOIN.COM
 INFO [2021-04-15 16:44:47,265] ({FIFOScheduler-interpreter_1099886208-Worker-1} DFSClient.java[getDelegationToken]:700) - Created token for jzyc: HDFS_DELEGATION_TOKEN owner=jzyc/bigdser4@JOIN.COM, renewer=yarn, realUser=, issueDate=1618476287222, maxDate=1619081087222, sequenceNumber=171, masterKeyId=21 on ha-hdfs:nameservice1
 INFO [2021-04-15 16:44:47,273] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1346508100_40, ugi=jzyc/bigdser4@JOIN.COM (auth:KERBEROS)]] with renewer jzyc/bigdser4@JOIN.COM
 INFO [2021-04-15 16:44:47,278] ({FIFOScheduler-interpreter_1099886208-Worker-1} DFSClient.java[getDelegationToken]:700) - Created token for jzyc: HDFS_DELEGATION_TOKEN owner=jzyc/bigdser4@JOIN.COM, renewer=jzyc, realUser=, issueDate=1618476287276, maxDate=1619081087276, sequenceNumber=172, masterKeyId=21 on ha-hdfs:nameservice1
 INFO [2021-04-15 16:44:47,331] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Renewal interval is 86400051 for token HDFS_DELEGATION_TOKEN
 INFO [2021-04-15 16:44:47,492] ({dispatcher-event-loop-0} Logging.scala[logInfo]:57) - Got an error when resolving hostNames. Falling back to /default-rack for all
 INFO [2021-04-15 16:44:47,493] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Scheduling renewal in 18.0 h.
 INFO [2021-04-15 16:44:47,494] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Updating delegation tokens.
 INFO [2021-04-15 16:44:47,521] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Updating delegation tokens for current user.

Source https://stackoverflow.com/questions/67023694

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install livy

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: