livy | open source REST interface for interacting with Apache Spark

 by   cloudera Scala Version: Current License: No License

kandi X-RAY | livy Summary

kandi X-RAY | livy Summary

livy is a Scala library typically used in Big Data, Spark applications. livy has no bugs and it has medium support. However livy has 1 vulnerabilities. You can download it from GitHub.

Livy is an open source REST interface for interacting with Apache Spark from anywhere
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              livy has a medium active ecosystem.
              It has 990 star(s) with 326 fork(s). There are 91 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              livy has no issues reported. There are 24 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of livy is current.

            kandi-Quality Quality

              livy has 0 bugs and 0 code smells.

            kandi-Security Security

              livy has 1 vulnerability issues reported (0 critical, 0 high, 1 medium, 0 low).
              livy code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              livy does not have a standard license declared.
              Check the repository for any license declaration and review the terms closely.
              OutlinedDot
              Without a license, all rights are reserved, and you cannot use the library in your applications.

            kandi-Reuse Reuse

              livy releases are not available. You will need to build from source code and install.
              It has 21197 lines of code, 1368 functions and 224 files.
              It has medium code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of livy
            Get all kandi verified functions for this library.

            livy Key Features

            No Key Features are available at this moment for livy.

            livy Examples and Code Snippets

            No Code Snippets are available at this moment for livy.

            Community Discussions

            QUESTION

            Spark cluster is not dynamically allocating resources to jobs
            Asked 2022-Feb-12 at 19:54

            The cluster is HDInsight 4.0 and has 250 GB RAM and 75 VCores.

            I am running only one job and the cluster is always allocating 66 GB, 7 VCores and 7 Containers to the job even though we have 250 GB and 75 VCores available for use. This is not particular to one job. I have ran 3 different jobs and all have this issue. when I run 3 jobs in parallel , the cluster is still allocating 66 GB RAM to each job. Looks like there is some static setting configured.

            The following is the queue setup

            I am using a tool called Talend(ETL tool similar to informatica) where I have a GUI to create a job . The tool is eclipse based and below is the code generated for spark configuration. This is then given to LIVY for submission in the cluster.

            ...

            ANSWER

            Answered 2022-Feb-12 at 19:54

            The behavior is expected as 6 execuors * 10 GB per executor memory = 60G.

            If want to use allocate more resources, try to increase exeucotr number such as 24

            Source https://stackoverflow.com/questions/71081007

            QUESTION

            pyspark erroring with a AM Container limit error
            Asked 2021-Nov-19 at 13:36

            All,

            We have a Apache Spark v3.12 + Yarn on AKS (SQLServer 2019 BDC). We ran a refactored python code to Pyspark which resulted in the error below:

            Application application_1635264473597_0181 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1635264473597_0181_000001 exited with exitCode: -104

            Failing this attempt.Diagnostics: [2021-11-12 15:00:16.915]Container [pid=12990,containerID=container_1635264473597_0181_01_000001] is running 7282688B beyond the 'PHYSICAL' memory limit. Current usage: 2.0 GB of 2 GB physical memory used; 4.9 GB of 4.2 GB virtual memory used. Killing container.

            Dump of the process-tree for container_1635264473597_0181_01_000001 :

            |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE

            |- 13073 12999 12990 12990 (python3) 7333 112 1516236800 235753 /opt/bin/python3 /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp/3677222184783620782

            |- 12999 12990 12990 12990 (java) 6266 586 3728748544 289538 /opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1 -Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.livy.rsc.driver.RSCDriverBootstrapper --properties-file /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties

            |- 12990 12987 12990 12990 (bash) 0 0 4304896 775 /bin/bash -c /opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1 -Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.livy.rsc.driver.RSCDriverBootstrapper' --properties-file /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties 1> /var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stdout 2> /var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stderr

            [2021-11-12 15:00:16.921]Container killed on request. Exit code is 143

            [2021-11-12 15:00:16.940]Container exited with a non-zero exit code 143.

            For more detailed output, check the application tracking page: https://sparkhead-0.mssql-cluster.everestre.net:8090/cluster/app/application_1635264473597_0181 Then click on links to logs of each attempt.

            . Failing the application.

            The default setting is as below and there are no runtime settings:

            "settings": {
            "spark-defaults-conf.spark.driver.cores": "1",
            "spark-defaults-conf.spark.driver.memory": "1664m",
            "spark-defaults-conf.spark.driver.memoryOverhead": "384",
            "spark-defaults-conf.spark.executor.instances": "1",
            "spark-defaults-conf.spark.executor.cores": "2",
            "spark-defaults-conf.spark.executor.memory": "3712m",
            "spark-defaults-conf.spark.executor.memoryOverhead": "384",
            "yarn-site.yarn.nodemanager.resource.memory-mb": "12288",
            "yarn-site.yarn.nodemanager.resource.cpu-vcores": "6",
            "yarn-site.yarn.scheduler.maximum-allocation-mb": "12288",
            "yarn-site.yarn.scheduler.maximum-allocation-vcores": "6",
            "yarn-site.yarn.scheduler.capacity.maximum-am-resource-percent": "0.34".
            }

            Is the AM Container mentioned the Application Master Container or Application Manager (of YARN). If this is the case, then in a Cluster Mode setting, the Driver and the Application Master run in the same Container?

            What runtime parameter do I change to make the Pyspark code successfully.

            Thanks,
            grajee

            ...

            ANSWER

            Answered 2021-Nov-19 at 13:36

            Likely you don't change any settings 143 could mean a lot of things, including you ran out of memory. To test if you ran out of memory. I'd reduce the amount of data you are using and see if you code starts to work. If it does it's likely you ran out of memory and should consider refactoring your code. In general I suggest trying code changes first before making spark config changes.

            For an understanding of how spark driver works on yarn, here's a reasonable explanation: https://sujithjay.com/spark/with-yarn

            Source https://stackoverflow.com/questions/69960411

            QUESTION

            adding ec2 key pair to EMR cluster using cloudformation
            Asked 2021-Nov-03 at 01:17

            I am trying to write a template to create a EMR cluster using cloudformation.

            So far I came up with this

            ...

            ANSWER

            Answered 2021-Nov-03 at 01:17

            There is Ec2KeyName parameter:

            The name of the EC2 key pair that can be used to connect to the master node using SSH as the user called "hadoop."

            Source https://stackoverflow.com/questions/69813218

            QUESTION

            convert column of dictionaries to columns in pyspark dataframe
            Asked 2021-Oct-05 at 01:58

            I have the pyspark dataframe df below. It has the schema shown below. I've also supplied some sample data, and the desired out put I'm looking for. The problem I'm having is the attributes column has values that are dictionaries. I would like to create new columns for each key in the dictionaries but the values in the attribute column are string. So I'm having trouble using explode or from_json.

            I made an attempt based on another SO post using explode, the code I ran and the error are below the example data and desired output.

            Also I don't know what all the keys in the dict might be, since different records have different length dicts.

            does anyone have a suggestion how to do this? I was thinking of converting it to pandas and trying to solve it that way, but I'm hoping there's a better/faster pyspark solution.

            ...

            ANSWER

            Answered 2021-Oct-05 at 01:58

            Try using from_json function with the corresponding schema to parse the json string

            Source https://stackoverflow.com/questions/69443677

            QUESTION

            Spark context is created every time a batch job is executed in yarn
            Asked 2021-Aug-01 at 20:52

            I am wondering, is there any way where i create the spark-context once in the YARN cluster, then the incoming jobs will re-use that context. The context creation takes 20s or sometimes more in my cluster. I am using pyspark for scripting and livy to submit jobs.

            ...

            ANSWER

            Answered 2021-Jul-28 at 12:44

            No, you can't just have a standing SparkContext running in Yarn. Maybe another idea is to run in client mode, where the client has it's own SparkContext (this is the method used by tools like Apache Zeppelin and the spark-shell).

            Source https://stackoverflow.com/questions/68559885

            QUESTION

            How to include BigQuery Connector inside Dataproc using Livy
            Asked 2021-Jul-17 at 05:00

            I'm trying to run my application using Livy that resides inside GCP Dataproc but I'm getting this: "Caused by: java.lang.ClassNotFoundException: bigquery.DefaultSource"

            I'm able to run hadoop fs -ls gs://xxxx inside Dataproc and I checked if Spark is pointing to the right location in order to find gcs-connector.jar and that's ok too.

            I included Livy in Dataproc using initialization (https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/livy/)

            How can I include bigquery-connector in Livy's classpath? Could you help me, please? Thank you all!

            ...

            ANSWER

            Answered 2021-Jul-02 at 18:48

            It looks like your application is depending on the BigQuery connector, not the GCS connector (bigquery.DefaultSource).

            The GCS connector should always be included in the HADOOP classpath by default, but you will have to manually add the BigQuery connector jar to your application.

            Assuming this is a Spark application, you can set the Spark jar property to pull in the bigquery connector jar from GCS at runtime: spark.jars='gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'

            For more installation options, see https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/README.md

            Source https://stackoverflow.com/questions/68215418

            QUESTION

            AWS - Lambda cannot access Livy endpoint for EMR
            Asked 2021-Jul-16 at 18:46

            For reference, livy is a rest endpoint used to pull data from a cluster. Within the same account, my lambda function always times out when attempting to access the livy endpoint using by EMR.

            Endpoint: http://ip-xxxx-xx-xxx-xx.ec2.internal:8998/

            Error:

            ...

            ANSWER

            Answered 2021-Jul-16 at 18:46

            Firstly, your http endpoint is .ec2.internal. So Im assuming its in some VPC (not public). So your lambda must be in within the same VPC.

            You are trying to access HTTP endpoint with a TCP connection, so your lambda security group must be whitelisted to EMR master node's security group.

            Lastly, You don't need AmazonEMRFullAccessPolicy_v2 because you are not accessing AWS API.

            Source https://stackoverflow.com/questions/68413768

            QUESTION

            AWS MWAA - connection timed out when try do rest call to Livy
            Asked 2021-Jun-23 at 21:34

            I created MWAA using public network option (version 2.0.2). Created a sample of airflow dag in which start emr with next properties:

            ...

            ANSWER

            Answered 2021-Jun-23 at 21:34

            Issue resolved setting:

            Source https://stackoverflow.com/questions/68089560

            QUESTION

            need help on submitting hudi delta streamer job via apache livy
            Asked 2021-Jun-22 at 15:04

            I am little confused with how to pass the arguments as REST API JSON.

            Consider below spark submit command.

            ...

            ANSWER

            Answered 2021-Jun-22 at 15:04

            Posting here, if it may help someone.

            WE found out that we can pass args as a list in a http request (to the livy server). in args, we can pass all the hudi related confs like ["key1","value1","key2",","value2","--hoodie-conf","confname=value"... etc]. We are able to submit jobs via livy server.

            Source https://stackoverflow.com/questions/68021329

            QUESTION

            How can I run zeppelin with keberos in CDH 6.3.2
            Asked 2021-May-26 at 03:24

            zeppelin 0.9.0 does not work with Kerberos

            I have add "zeppelin.server.kerberos.keytab" and "zeppelin.server.kerberos.principal" in zeppelin-site.xml

            But I aldo get error "Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "bigdser5/10.3.87.27"; destination host is: "bigdser1":8020;"

            And add "spark.yarn.keytab","spark.yarn.principal" in spark interpreters,it does not work yet.

            In my spark-shell that can work with Kerberos

            My kerberos step

            1.admin.local -q "addprinc jzyc/hadoop"

            1. kadmin.local -q "xst -k jzyc.keytab jzyc/hadoop@JJKK.COM"

            2. copy jzyc.keytab to other server

            3. kinit -kt jzyc.keytab jzyc/hadoop@JJKK.COM

            In my livy I get error "javax.servlet.ServletException: org.apache.hadoop.security.authentication.client.AuthenticationException: javax.security.auth.login.LoginException: No key to store"

            ...

            ANSWER

            Answered 2021-Apr-15 at 09:01
            INFO [2021-04-15 16:44:46,522] ({dispatcher-event-loop-1} Logging.scala[logInfo]:57) - Got an error when resolving hostNames. Falling back to /default-rack for all
             INFO [2021-04-15 16:44:46,561] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Attempting to login to KDC using principal: jzyc/bigdser4@JOIN.COM
             INFO [2021-04-15 16:44:46,574] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Successfully logged into KDC.
             INFO [2021-04-15 16:44:47,124] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1346508100_40, ugi=jzyc/bigdser4@JOIN.COM (auth:KERBEROS)]] with renewer yarn/bigdser1@JOIN.COM
             INFO [2021-04-15 16:44:47,265] ({FIFOScheduler-interpreter_1099886208-Worker-1} DFSClient.java[getDelegationToken]:700) - Created token for jzyc: HDFS_DELEGATION_TOKEN owner=jzyc/bigdser4@JOIN.COM, renewer=yarn, realUser=, issueDate=1618476287222, maxDate=1619081087222, sequenceNumber=171, masterKeyId=21 on ha-hdfs:nameservice1
             INFO [2021-04-15 16:44:47,273] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1346508100_40, ugi=jzyc/bigdser4@JOIN.COM (auth:KERBEROS)]] with renewer jzyc/bigdser4@JOIN.COM
             INFO [2021-04-15 16:44:47,278] ({FIFOScheduler-interpreter_1099886208-Worker-1} DFSClient.java[getDelegationToken]:700) - Created token for jzyc: HDFS_DELEGATION_TOKEN owner=jzyc/bigdser4@JOIN.COM, renewer=jzyc, realUser=, issueDate=1618476287276, maxDate=1619081087276, sequenceNumber=172, masterKeyId=21 on ha-hdfs:nameservice1
             INFO [2021-04-15 16:44:47,331] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Renewal interval is 86400051 for token HDFS_DELEGATION_TOKEN
             INFO [2021-04-15 16:44:47,492] ({dispatcher-event-loop-0} Logging.scala[logInfo]:57) - Got an error when resolving hostNames. Falling back to /default-rack for all
             INFO [2021-04-15 16:44:47,493] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Scheduling renewal in 18.0 h.
             INFO [2021-04-15 16:44:47,494] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Updating delegation tokens.
             INFO [2021-04-15 16:44:47,521] ({FIFOScheduler-interpreter_1099886208-Worker-1} Logging.scala[logInfo]:57) - Updating delegation tokens for current user.
            

            Source https://stackoverflow.com/questions/67023694

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install livy

            You can download it from GitHub.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/cloudera/livy.git

          • CLI

            gh repo clone cloudera/livy

          • sshUrl

            git@github.com:cloudera/livy.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link