docker-spark | repository contains a Docker file | Continuous Deployment library

 by   sequenceiq Shell Version: Current License: Apache-2.0

kandi X-RAY | docker-spark Summary

kandi X-RAY | docker-spark Summary

docker-spark is a Shell library typically used in Devops, Continuous Deployment, Nginx, Docker, Kafka, Spark applications. docker-spark has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

Apache Spark on Docker.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              docker-spark has a medium active ecosystem.
              It has 769 star(s) with 296 fork(s). There are 65 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 22 open issues and 25 have been closed. On average issues are closed in 64 days. There are 6 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of docker-spark is current.

            kandi-Quality Quality

              docker-spark has no bugs reported.

            kandi-Security Security

              docker-spark has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              docker-spark is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              docker-spark releases are not available. You will need to build from source code and install.
              Installation instructions are not available. Examples and code snippets are available.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of docker-spark
            Get all kandi verified functions for this library.

            docker-spark Key Features

            No Key Features are available at this moment for docker-spark.

            docker-spark Examples and Code Snippets

            No Code Snippets are available at this moment for docker-spark.

            Community Discussions

            QUESTION

            spark app socket communication between container on docker spark cluster
            Asked 2020-Dec-21 at 09:24

            So I have a Spark cluster running in Docker using Docker Compose. I'm using docker-spark images.

            Then i add 2 more containers, 1 is behave as server (plain python) and 1 as client (spark streaming app). They both run on the same network.

            For server (plain python) i have something like

            ...

            ANSWER

            Answered 2020-Dec-20 at 16:17

            Okay so i found that i can use the IP of the container, as long as all my containers are on the same network. So i check the IP by running

            Source https://stackoverflow.com/questions/65340921

            QUESTION

            Docker Images - What are these layers?
            Asked 2020-Sep-03 at 23:39

            I am looking at this image and it seems the layers are redundant and these redundant layers ended up in the image ? If they are , how they ended up in the image leading to large amount of space ? How could i strip these layers ?

            https://microbadger.com/images/openkbs/docker-spark-bde2020-zeppelin:latest

            ...

            ANSWER

            Answered 2020-Sep-03 at 23:39

            What you are seeing are not layers, but images that were pushed to the same registry. Basically, those are the different versions of one image.

            In a repository, each image is accessible through an unique ID, its SHA value. Furthermore, one can tag images with convenient names, e.g. V1.0 or latest. These tags are not fixed, however. When an image is pushed with a tag that is already assigned to another image, the old image loses the tag and the new image gains it. Thus, a tag can move from one image to another. The tag latest is no exception. It has, however, one special property: the tag is always assigned to the most recently pushed version of an image.

            The person/owner of the registry has pushed new versions of the image and not tagged the old versions. Thus, all old versions show up as "untagged".

            If we pull a specific image, we will receive this image and this image only, not the complete registry.

            Source https://stackoverflow.com/questions/63655488

            QUESTION

            Docker - sharing layers between images
            Asked 2020-Aug-30 at 13:52

            I downloaded two images and the sizes are as follows :

            ...

            ANSWER

            Answered 2020-Aug-30 at 13:04

            Yes, same layers are "shared". Docker using hashes (including filesystem and commands) to identify these layers.

            So docker shows you the size of the images (including the base-images) but that doesn't mean that they needs the same disk space.

            Source https://stackoverflow.com/questions/63657644

            QUESTION

            Unable to access Spark nodes in Docker
            Asked 2020-Jul-22 at 16:45

            I am using this setup (https://github.com/mvillarrealb/docker-spark-cluster.git) to established a Spark Cluster but none of the IPs mentioned there like 10.5.0.2 area accessible via browser and giving timeout. I am unable to figure out what's wrong am I doing?

            I am using Docker 2.3 on macOS Catalina.

            In the spark-base Dockerfile I am using the following settings instead of one given there:

            ...

            ANSWER

            Answered 2020-Jul-22 at 16:45

            The Dockerfile tells the container what port to expose.
            The compose-file tells the host which ports to expose and to which ports should be the traffic forwarded inside the container.
            If the source port is not specified, a random port should be generated. This statement helps in this scenario because you have multiple workers and you cannot specify a unique source port for all of them - this would result in a conflict.

            Source https://stackoverflow.com/questions/63035419

            QUESTION

            Copying avro jars into docker jars directory
            Asked 2020-Apr-17 at 23:28

            I'm learning spark I'd like to use an avro data file as avro is external to spark. I've downloaded the jar. But my problem is how to copy it into that specific place 'jars dir' into my container? I've read relative post here but I do not understand.

            I've see also this command below from spark main website but I think I need the jar file copied before running it.

            ...

            ANSWER

            Answered 2020-Apr-17 at 23:28

            Quoting docker cp Documentation,

            docker cp SRC_PATH CONTAINER:DEST_PATH

            If SRC_PATH specifies a file and DEST_PATH does not exist then the file is saved to a file created at DEST_PATH

            From the command you tried,

            The destination path /jars does not exist in the container since the actual destination should have been /usr/spark-2.4.1/jars/. Thus the jar was copied to the container with the name jars under the root (/) directory.

            Try this command instead to add the jar to spark jars,

            Source https://stackoverflow.com/questions/61282034

            QUESTION

            Apache Spark - ModuleNotFoundError: No module named 'mysql'
            Asked 2019-Nov-13 at 21:25

            I'm trying to submit Apache Spark driver program to the remote cluster. I'm having difficulties with the python package called mysql. I installed this package on all Spark nodes. Cluster is running inside docker-compose, images are based on bde2020.

            ...

            ANSWER

            Answered 2019-Nov-13 at 21:25

            While the node has mysql installed, the container does not. What the logs are telling you is that impressions-agg_1 contains a script at /app/app.py which is trying to load mysql but cannot find it.

            Did you create impressions-agg_1? Add a RUN pip install mysql step to its Dockerfile.

            Source https://stackoverflow.com/questions/58843895

            QUESTION

            Running Spark driver program in Docker container - no connection back from executor to the driver?
            Asked 2019-Oct-14 at 19:50

            UPDATE: The problem is resolved. The Docker image is here: docker-spark-submit

            I run spark-submit with a fat jar inside a Docker container. My standalone Spark cluster runs on 3 virtual machines - one master and two workers. From an executor log on a worker machine, I see that the executor has the following driver URL:

            "--driver-url" "spark://CoarseGrainedScheduler@172.17.0.2:5001"

            172.17.0.2 is actually the address of the container with the driver program, not the host machine where the container is running. This IP is not accessible from the worker machine, therefore the worker is not able to communicate to the driver program. As I see from the source code of StandaloneSchedulerBackend, it builds driverUrl using spark.driver.host setting:

            ...

            ANSWER

            Answered 2017-Aug-21 at 19:49

            So the working configuration is:

            • set spark.driver.host to the IP address of the host machine
            • set spark.driver.bindAddress to the IP address of the container

            The working Docker image is here: docker-spark-submit.

            Source https://stackoverflow.com/questions/45489248

            QUESTION

            Using numpy from the host OS for a spark container
            Asked 2019-May-30 at 06:35

            I want to use the Docker image with Apache Spark on Ubuntu 18.04.

            The more popular image from the hub has Spark 1.6. The second image has a more recent version Spark 2.2

            No image has numpy installed. The basic examples for Spark MLlib main guide require it.

            I've tried running Dockerfile for installing numpy unsuccessfully, adding this to the original Dockerfile for Spark 2.2 image:

            ...

            ANSWER

            Answered 2019-May-30 at 06:35

            QUESTION

            How to pass arguments to spark-submit using docker
            Asked 2019-Mar-19 at 17:31

            I have a docker container running on my laptop with a master and three workers, I can launch the typical wordcount example by entering the ip of the master using a command like this:

            ...

            ANSWER

            Answered 2019-Mar-19 at 17:31

            This is the command that solves my problem:

            Source https://stackoverflow.com/questions/55242533

            QUESTION

            Spark on Fargate can't find local IP
            Asked 2018-Jun-24 at 15:40

            I have a build job I'm trying to set up in an AWS Fargate cluster of 1 node. When I try to run Spark to build my data, I get an error that seems to be about Java not being able to find "localHost".

            I set up the config by running a script that adds the spark-env.sh file, updates the /etc/hosts file and updates the spark-defaults.conf file.

            In the $SPARK_HOME/conf/spark-env.sh file, I add:

            • SPARK_LOCAL_IP
            • SPARK_MASTER_HOST

            In the $SPARK_HOME/conf/spark-defaults.conf

            • spark.jars.packages
            • spark.master
            • spark.driver.bindAddress
            • spark.driver.host

            In the /etc/hosts file, I append:

            • master

            Invoking the spark-submit script by passing in the -master argument with an IP or URL doesn't seem to help.

            I've tried using local[*], spark://:, and : variations, to no avail. Using 127.0.0.1 and localhost don't seem to make a difference, compared to using things like master and the IP returned from metadata.

            On the AWS side, the Fargate cluster is running in a private subnet with a NatGateway attached, so it does have egress and ingress network routes, as far as I can tell. I've tried using a public network and ENABLEDing the setting for ECS to automatically attach a public IP to the container. All the standard ports from the Spark docs are opened up on the container too.

            It seems to run fine up until the point at which it tries to gather its own IP.

            The error that I get back has this, in the stack:

            ...

            ANSWER

            Answered 2018-Jun-24 at 15:40

            The solution is to avoid user error...

            This was a total face-palm situation but I hope my misunderstanding of the Spark system can help some poor fool, like myself, who has spent too much time stuck on the same type of problem.

            The answer for the last iteration (gettyimages/docker-spark Docker image) was that I was trying to run the spark-submit command without having a master or worker(s) started. In the gettyimages/docker-spark repo, you can find a docker-compose file that shows you that it creates the master and the worker nodes before any spark work is done. The way that image creates a master or a worker is by using the spark-class script and passing in the org.apache.spark.deploy.. class, respectively.

            So, putting it all together, I can use the configuration I was using but I have to create the master and worker(s) first, then execute the spark-submit command the same as I was already doing.

            This is a quick and dirty of one implementation, although I guarantee there's better, done by folks who actually know what they're doing:

            The first 3 steps happen in a cluster boot script. I do this in an AWS Lambda, triggered by an APIGateway

            1. create a cluster and a queue or some sort of message brokerage system, like zookeeper/kafka. (I'm using API-Gateway -> lambda for this)
            2. pick a master node (logic in the lambda)
            3. create a message with some basic information, like the master's IP or domain and put it in the queue from step 1 (happens in the lambda)

            Everything below this happens in the startup script on the Spark nodes

            1. create a step in the startup script that has the nodes check the queue for the message from step 3
            2. add SPARK_MASTER_HOST and SPARK_LOCAL_IP to the $SPARK_HOME/conf/spark-env.sh file, using the information from the message you picked up in step 4
            3. add spark.driver.bindAddress to the $SPARK_HOME/conf/spark-defaults.conf file, using the information from the message you picked up in step 4
            4. use some logic in your startup script to decide "this" node is a master or a worker
            5. start the master or worker. in the gettyimages/docker-spark image, you can start a master with $SPARK_HOME/bin/spark-class org.apache.spark.deploy.master.Master -h and you can start a worker with $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker -h spark://:7077
            6. Now you can run the spark-submit command, which will deploy the work to the cluster.

            Edit: (some code for reference) This is the addition to the lambda

            Source https://stackoverflow.com/questions/50627748

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install docker-spark

            You can download it from GitHub.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/sequenceiq/docker-spark.git

          • CLI

            gh repo clone sequenceiq/docker-spark

          • sshUrl

            git@github.com:sequenceiq/docker-spark.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link