initialization-actions | Run in all nodes of your cluster before the cluster starts

 by   GoogleCloudDataproc Shell Version: Current License: Apache-2.0

kandi X-RAY | initialization-actions Summary

kandi X-RAY | initialization-actions Summary

initialization-actions is a Shell library. initialization-actions has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

When creating a Dataproc cluster, you can specify initialization actions in executables and/or scripts that Dataproc will run on all nodes in your Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              initialization-actions has a low active ecosystem.
              It has 575 star(s) with 507 fork(s). There are 67 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 26 open issues and 263 have been closed. On average issues are closed in 114 days. There are 18 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of initialization-actions is current.

            kandi-Quality Quality

              initialization-actions has 0 bugs and 0 code smells.

            kandi-Security Security

              initialization-actions has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              initialization-actions code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              initialization-actions is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              initialization-actions releases are not available. You will need to build from source code and install.
              Installation instructions are not available. Examples and code snippets are available.
              It has 3405 lines of code, 205 functions and 87 files.
              It has medium code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of initialization-actions
            Get all kandi verified functions for this library.

            initialization-actions Key Features

            No Key Features are available at this moment for initialization-actions.

            initialization-actions Examples and Code Snippets

            No Code Snippets are available at this moment for initialization-actions.

            Community Discussions

            QUESTION

            GCP Dataproc - cluster creation failing when using connectors.sh in initialization-actions
            Asked 2022-Feb-01 at 20:01

            I'm creating a Dataproc cluster, and it is timing out when i'm adding the connectors.sh in the initialization actions.

            here is the command & error

            ...

            ANSWER

            Answered 2022-Feb-01 at 20:01

            It seems you are using an old version of the init action script. Based on the documentation from the Dataproc GitHub repo, you can set the version of the Hadoop GCS connector without the script in the following manner:

            Source https://stackoverflow.com/questions/70944833

            QUESTION

            Dataproc Cluster creation is failing with PIP error "Could not build wheels"
            Asked 2022-Jan-24 at 13:04

            We use to spin cluster with below configurations. It used to run fine till last week but now failing with error ERROR: Failed cleaning build dir for libcst Failed to build libcst ERROR: Could not build wheels for libcst which use PEP 517 and cannot be installed directly

            ...

            ANSWER

            Answered 2022-Jan-19 at 21:50

            Seems you need to upgrade pip, see this question.

            But there can be multiple pips in a Dataproc cluster, you need to choose the right one.

            1. For init actions, at cluster creation time, /opt/conda/default is a symbolic link to either /opt/conda/miniconda3 or /opt/conda/anaconda, depending on which Conda env you choose, the default is Miniconda3, but in your case it is Anaconda. So you can run either /opt/conda/default/bin/pip install --upgrade pip or /opt/conda/anaconda/bin/pip install --upgrade pip.

            2. For custom images, at image creation time, you want to use the explicit full path, /opt/conda/anaconda/bin/pip install --upgrade pip for Anaconda, or /opt/conda/miniconda3/bin/pip install --upgrade pip for Miniconda3.

            So, you can simply use /opt/conda/anaconda/bin/pip install --upgrade pip for both init actions and custom images.

            Source https://stackoverflow.com/questions/70743642

            QUESTION

            Facing Issue with DataprocCreateClusterOperator (Airflow 2.0)
            Asked 2022-Jan-04 at 22:26

            I'm trying to migrate from airflow 1.10 to Airflow 2 which has a change of name for some operators which includes - DataprocClusterCreateOperator. Here is an extract of the code.

            ...

            ANSWER

            Answered 2022-Jan-04 at 22:26

            It seems that in this version the type of metadata parameter is no longer dict. From the docs:

            metadata (Sequence[Tuple[str, str]]) -- Additional metadata that is provided to the method.

            Try with:

            Source https://stackoverflow.com/questions/70423687

            QUESTION

            Facing Issue in passing metadata field with DataprocCreateClusterOperator (Airflow 2.0)
            Asked 2021-Dec-22 at 20:29

            I am facing some issues while installing Packages in the Dataproc cluster using DataprocCreateClusterOperator I am trying to upgrade to Airflow 2.0

            Error Message:

            ...

            ANSWER

            Answered 2021-Dec-22 at 20:29

            the following dag is working as expected, changed:

            • the cluster name (cluster_name -> cluster-name).
            • path for scripts.
            • Dag definition.

            Source https://stackoverflow.com/questions/70434588

            QUESTION

            creating dataproc cluster with multiple jars
            Asked 2021-Nov-27 at 22:40

            I am trying to create a dataproc cluster that will connect dataproc to pubsub. I need to add multiple jars on cluster creation in the spark.jars flag

            ...

            ANSWER

            Answered 2021-Nov-27 at 22:40

            The answer you linked is the correct way to do it: How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?

            If you also post the command you tried with the escaping syntax and the resulting error message then others could more easily verify what you did wrong. It looks like you're specifying an additional spark property in addition to your list of jars spark:spark.driver.memory=3000m, and tried to just space-separate that from your jars flag, which isn't allowed.

            Per the linked result, you'd need to use the newly assigned separator character to separate the second spark property:

            Source https://stackoverflow.com/questions/70139181

            QUESTION

            How to include BigQuery Connector inside Dataproc using Livy
            Asked 2021-Jul-17 at 05:00

            I'm trying to run my application using Livy that resides inside GCP Dataproc but I'm getting this: "Caused by: java.lang.ClassNotFoundException: bigquery.DefaultSource"

            I'm able to run hadoop fs -ls gs://xxxx inside Dataproc and I checked if Spark is pointing to the right location in order to find gcs-connector.jar and that's ok too.

            I included Livy in Dataproc using initialization (https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/livy/)

            How can I include bigquery-connector in Livy's classpath? Could you help me, please? Thank you all!

            ...

            ANSWER

            Answered 2021-Jul-02 at 18:48

            It looks like your application is depending on the BigQuery connector, not the GCS connector (bigquery.DefaultSource).

            The GCS connector should always be included in the HADOOP classpath by default, but you will have to manually add the BigQuery connector jar to your application.

            Assuming this is a Spark application, you can set the Spark jar property to pull in the bigquery connector jar from GCS at runtime: spark.jars='gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'

            For more installation options, see https://github.com/GoogleCloudDataproc/spark-bigquery-connector/blob/master/README.md

            Source https://stackoverflow.com/questions/68215418

            QUESTION

            BigQuery TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'
            Asked 2021-Apr-05 at 08:02
            Environment details
            • OS type and version: 1.5.29-debian10
            • Python version: 3.7
            • google-cloud-bigquery version: 2.8.0

            I'm provisioning a dataproc cluster which gets the data from BigQuery into a pandas dataframe. As my data is growing I was looking to boost the performance and heard about using the BigQuery storage client.

            I had the same problem in the past and this was solved by setting the google-cloud-bigquery to version 1.26.1. If I use that version I get the following message.

            ...

            ANSWER

            Answered 2021-Feb-15 at 14:42

            Dataproc installs by default pyarrow 0.15.0 while the bigquery-storage-api needs a more recent version. Manually setting pyarrow to 3.0.0 at install solved the issue. That being said, PySpark has a compability setting for Pyarrow >= 0.15.0 https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#apache-arrow-in-spark I've taken a look at the release notes of dataproc and this env variable is set as default since May 2020.

            Source https://stackoverflow.com/questions/66152733

            QUESTION

            How to pass spark parameter to a dataproc workflow template?
            Asked 2021-Mar-12 at 14:58

            Here's what I have:

            ...

            ANSWER

            Answered 2021-Jan-21 at 20:28

            QUESTION

            Where do I configure spark executors and executor memory of a spark job in a dataproc cluster?
            Asked 2021-Feb-22 at 09:58

            I am new to GCP and been asked to work on dataproc to create spark application to bring data from source database to Bigquery on GCP. I created a dataproc cluster with the following options:

            ...

            ANSWER

            Answered 2021-Feb-22 at 09:56

            You can pass them via the --properties option:

            --properties=[PROPERTY=VALUE,…] List of key value pairs to configure Spark. For a list of available properties, see: https://spark.apache.org/docs/latest/configuration.html#available-properties.

            Example using gcloud command:

            Source https://stackoverflow.com/questions/66312936

            QUESTION

            How do I specify multiple shell scripts as initialization actions for Dataproc cluster creation?
            Asked 2021-Jan-22 at 20:22

            Google's documentation says that --initialization-actions takes a list of GCS URLs. If I specify one:

            ...

            ANSWER

            Answered 2021-Jan-22 at 18:36

            Just figured it out, the format needs to be:

            Source https://stackoverflow.com/questions/65850509

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install initialization-actions

            You can download it from GitHub.

            Support

            See CONTRIBUTING.md
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/GoogleCloudDataproc/initialization-actions.git

          • CLI

            gh repo clone GoogleCloudDataproc/initialization-actions

          • sshUrl

            git@github.com:GoogleCloudDataproc/initialization-actions.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link

            Consider Popular Shell Libraries

            awesome

            by sindresorhus

            ohmyzsh

            by ohmyzsh

            realworld

            by gothinkster

            nvm

            by nvm-sh

            papers-we-love

            by papers-we-love

            Try Top Libraries by GoogleCloudDataproc

            spark-bigquery-connector

            by GoogleCloudDataprocJava

            hadoop-connectors

            by GoogleCloudDataprocJava

            cloud-dataproc

            by GoogleCloudDataprocJupyter Notebook

            bdutil

            by GoogleCloudDataprocShell

            hive-bigquery-storage-handler

            by GoogleCloudDataprocJava