docker-hadoop | Docker containers for Hadoop/HDFS

 by   actionml Shell Version: Current License: No License

kandi X-RAY | docker-hadoop Summary

kandi X-RAY | docker-hadoop Summary

docker-hadoop is a Shell library typically used in Big Data, Docker, Kafka, Spark, Amazon S3, Hadoop applications. docker-hadoop has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

These containers currently provide ability to deploy a standalone hadoop cluster which can contain two types of nodes: namenodes and datanodes. Hadoop architecture presumes that namenode (as well as secondary namenode) is run on separate host with datanodes. Namenode contains the whole HDFS cluster metadata which requires big amount of RAM while datanodes usually placed on commodity hardware and basically provide storage.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              docker-hadoop has a low active ecosystem.
              It has 8 star(s) with 6 fork(s). There are 6 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              docker-hadoop has no issues reported. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of docker-hadoop is current.

            kandi-Quality Quality

              docker-hadoop has no bugs reported.

            kandi-Security Security

              docker-hadoop has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              docker-hadoop does not have a standard license declared.
              Check the repository for any license declaration and review the terms closely.
              OutlinedDot
              Without a license, all rights are reserved, and you cannot use the library in your applications.

            kandi-Reuse Reuse

              docker-hadoop releases are not available. You will need to build from source code and install.
              Installation instructions are not available. Examples and code snippets are available.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of docker-hadoop
            Get all kandi verified functions for this library.

            docker-hadoop Key Features

            No Key Features are available at this moment for docker-hadoop.

            docker-hadoop Examples and Code Snippets

            No Code Snippets are available at this moment for docker-hadoop.

            Community Discussions

            QUESTION

            Why every processing slots in Flink 1.4 use separate core?
            Asked 2020-Sep-07 at 08:31

            I have a docker-compose with hadoop big-data-europe and flink 1.10 and 1.4 which I try to start in separate container. I use this reference YARN Setup, in which there is an example

            Example: Issue the following command to allocate 10 Task Managers, with 8 GB of memory and 32 processing slots each:

            ...

            ANSWER

            Answered 2020-Sep-07 at 08:31

            As stated in the documentation for configuration parameters in yarn deployment mode, yarn.containers.vcores specifies the number of virtual cores (vcores) per YARN container. By default, the number of vcores is set to the number of slots per TaskManager, if set, or to 1, otherwise. In order for this parameter to be used your cluster must have CPU scheduling enabled.

            In your case, you specify the -s 32 taskmanager.numberOfTaskSlots parameter without overriding the yarn.containers.vcores setting thus the app acquires the container with 32 vcores. In order to be able to run with 32 slots per TM and only 8 cores, please, set the yarn.containers.vcores to 8 in flink/conf/flink-conf.yaml.

            Regarding the resources, yes, every task manager equals to yarn container acquired, but container has a number of vcores, specified by yarn.containers.vcores (or to a number of slots per container). Regarding the slot, it's more like a resource group and each slot can have multiple tasks, each running in a separate thread. So, slot itself is not limited to only one thread. Please, find more at Task Slots and Resources Docs page.

            Source https://stackoverflow.com/questions/63772535

            QUESTION

            Docker - where these will be logged
            Asked 2020-Sep-03 at 17:52

            I see a lot of echo statements in one the entrypoint.sh.

            1. Where these logs will be stored ?

            2. I believe, these will be automatically logged. Useful in debugging to see which environment variables ingested etc .. ?

            A Sample entrypoint.sh file https://github.com/big-data-europe/docker-hadoop/blob/master/base/entrypoint.sh

            ...

            ANSWER

            Answered 2020-Sep-03 at 17:52
            1. If entrypoint.sh is the image's entrypoint, it'll be logged in the docker logs output and in the container's log files (usually at /var/lib/docker/containers//-json.log).

            2. That's usually done for exposing the configuration upon which the container is running. In this case the container is only reporting what's doing, as half of the echo lines are just setting up the hadoop configuration files.

            Source https://stackoverflow.com/questions/63729012

            QUESTION

            Link Kafka and HDFS with docker containers
            Asked 2020-Jan-21 at 19:26

            Hello Guys, I'm trying to connect Kafka and HDFS with Kafka Connect, but I still face an issue that I can't rid of it.

            I'm using this example: https://clubhouse.io/developer-how-to/how-to-set-up-a-hadoop-cluster-in-docker/

            I start the HDFS first with: docker-compose up -d

            Then I launch the zookeeper kafka and mysql with images from debezium website. https://debezium.io/documentation/reference/1.0/tutorial.html

            docker run -it --rm --name zookeeper --network docker-hadoop-master_default -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper:1.0

            docker run -it --rm --name kafka --network docker-hadoop-master_default -e ZOOKEEPER_CONNECT=zookeeper -p 9092:9092 --link zookeeper:zookeeper debezium/kafka:1.0

            docker run -it --rm --name mysql --network docker-hadoop-master_default -p 3306:3306 -e MYSQL_ROOT_PASSWORD=debezium -e MYSQL_USER=mysqluser -e MYSQL_PASSWORD=mysqlpw debezium /example-mysql:1.0

            I use the network on these runs because when I tried to change the network from HDFS on docker-compose.yml the resource manager shutdown and no matter how I couldn't find how I could raise up again and make him stable. So added directly on these containers zookeeper kafka and mysql.

            Then, this is the most tricky part, the Kafka Connect, I used the same network to on this case which makes sense.

            docker run -it --rm --name connect --network docker-hadoop-master_default -p 8083:8083 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my_connect_configs -e OFFSET_STORAGE_TOPIC=my_connect_offsets -e STATUS_STORAGE_TOPIC=my_connect_statuses -e BOOTSTRAP_SERVERS="172.18.0.10:9092" -e CORE_CONF_fs_defaultFS=hdfs://172.18.0.2:9000 --link namenode:namenode --link zookeeper:zookeeper --link mysql:mysql debezium/connect:1.0

            To link the source (Mysql) to Kafka I uses the connector from the debezium tutorial, the one below.

            curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '{ "name": "inventory-connector", "config": { "connector.class": "io.debezium.connector.mysql.MySqlConnector", "tasks.max": "1", "database.hostname": "mysql", "database.port": "3306", "database.user": "debezium", "database.password": "dbz", "database.server.id": "184054", "database.server.name": "dbserver1", "database.whitelist": "inventory", "database.history.kafka.bootstrap.servers": "kafka:9092", "database.history.kafka.topic": "dbhistory.inventory" } }'

            I tested if Kafka receives any event from the source and works fine.

            After setting this, I moved to the installation of the plugin, which I downloaded from the confluent web site and pasted on my local machine Linux, then I installed the Confluent-Hub, and after that the plugin on my local machine. Then I created the user kafka and change all the content from the plugin directory into kafka:kafka.

            After all this I used docker cp :/kafka/connect to copy to Kafka Connect.

            Then check if it is there and then restart the Kafka Connect to install it.

            We can use this to check if is installed: curl -i -X GET -H "Accept:application/json" localhost:8083/connector-plugins

            You need to see somewhere this: [{"class":"io.confluent.connect.hdfs.HdfsSinkConnector","type":"sink","version":"5.4.0"},…

            After this step I believe is where my problem resides: curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '{"name":"hdfs-sink","config":{"connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector","tasks.max":1,"topics":"dbserver1,dbserver1.inventory.products,dbserver1.inventory.products_on_hand,dbserver1.inventory.customers,dbserver1.inventory.orders, dbserver1.inventory.geom,dbserver1.inventory.addresses","hdfs.url":"hdfs://172.18.0.2:9000","flush.size":3,"logs.dir":"logs","topics.dir":"kafka","format.class":"io.confluent.connect.hdfs.parquet.ParquetFormat","partitioner.class":"io.confluent.connect.hdfs.partitioner.DefaultPartitioner","partition.field.name":"day"}}'

            I have no idea how to convince Kafka Connect that I want a specific IP address from the namenode, he just keeps my trowing messages that found a different IP when the expected is hdfs://namenode:9000

            Also adding this -e CORE_CONF_fs_defaultFS=hdfs://172.18.0.2:9000 to the docker run our setting it inside the Kafka Connect, when I POST the Curl of hdfs-sink he trowing me the message below.

            Log from Kafka Connect:

            ...

            ANSWER

            Answered 2020-Jan-21 at 19:26

            By default, Docker compose adds an underscore and the directory where you ran the command underscore is not allowed in a hostname. Hadoop prefers hostnames by default in the hdfs-site.xml config file.

            I have no idea how to convince Kafka Connect that I want a specific IP address from the namenode, he just keeps my trowing messages that found a different IP when the expected is hdfs://namenode:9000

            Ideally, you wouldn't use an IP within Docker anyway, you would use the service name and exposed port.

            For the HDFS Connector, you also need to define 1) HADOOP_CONF_DIR env-var 2) mount your XML configs as a volume for remote clients such as Connect to interact with the Hadoop cluster and 3) define hadoop.conf.dir in connector property.

            Source https://stackoverflow.com/questions/59845575

            QUESTION

            Access the IP from a docker container on a bridge network on mac osx with docker-compose
            Asked 2019-Nov-30 at 16:04

            I am trying to access the webUIs from the containers on the docker-compose from a hadoop-cluster. Link: https://github.com/big-data-europe/docker-hadoop

            The Docker-Compose File had the following content:

            ...

            ANSWER

            Answered 2019-Nov-30 at 16:04

            To access the Namenode UI, you'd use localhost:9870 since that's the port you've forwarded. You shouldn't need to access the UIs "from other containers"

            And that's the only container you've opened a port forward for

            Source https://stackoverflow.com/questions/59111876

            QUESTION

            network `hbase` is declared as external, but could not be found. You need to create a swarm-scoped network before the stack is deployed
            Asked 2019-Sep-30 at 05:59

            I've below docker swarm cluster.

            ...

            ANSWER

            Answered 2019-Sep-30 at 05:59

            you need to create your Network first:

            Source https://stackoverflow.com/questions/58161371

            QUESTION

            Hue access to HDFS: bypass default hue.ini?
            Asked 2019-Jul-20 at 11:06
            The set up

            I am trying to compose a lightweight minimal hadoop stack with the images provided by bde2020 (learning purpose). Right now, the stack includes (among others)

            • a namenode
            • a datanote
            • hue

            Basically, I started from Big Data Europe official docker compose, and added a hue image based on their documentation

            The issue

            Hue's file browser can't access HDFS:

            ...

            ANSWER

            Answered 2019-Jul-19 at 21:09

            I was able to get the Filebrowser working with this INI

            Source https://stackoverflow.com/questions/57116402

            QUESTION

            Hadoop Spark docker swarm where pyspark gives BlockMissingException but file is fine
            Asked 2019-Apr-01 at 01:33

            Based on https://github.com/gotthardsen/docker-hadoop-spark-workbench/tree/master/swarm I have a docker swarm setup with hadoop, spark, hue and a jupyter notebook setup.

            Using Hue I uploaded a file to hdfs, and I have not problem downloading or viewing the file from hue or in hdfs on the namenode. There is no missing blocks and file check says everything is fine.

            But when I try to access it using pyspark in jupyter I get a:

            org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-296583215-10.0.0.6-1542473394293:blk_1073741833_1009 file=/20170930.csv

            I know this is not about missing block but more likely something else. But I can not figure out why. The code python code from the workbook using python2 kernel is:

            ...

            ANSWER

            Answered 2018-Nov-18 at 10:28

            Since Docker containers are ephemeral, it's possible the datanode container died, and therefore the data within, but the namenode still knows that the file used to exist.

            I don't know about node-affinity rules in Swarm, but you should try to add volume mounts to the namenode and datanode containers, plus make sure they can only be scheduled on single machines (assuming you have more than one, since you are using Swarm rather than just Compose)

            Probably the same, but I have made my own Docker Compose with Hue, Juptyer, NameNode, and Datanode, and I did test it with PySpark

            Source https://stackoverflow.com/questions/53359692

            QUESTION

            spark-submit to a docker container
            Asked 2018-Feb-20 at 14:56

            I creasted a Spark Cluster using this repository and the relative README.md file.

            Now I'm trying to execute through spark-submit a job to the docker container of the Spark Master so the command that I use is something similar:

            ...

            ANSWER

            Answered 2018-Feb-20 at 14:56

            the problem was to use the hostname for spark://spark-master:7077

            So inside the Spark Master is something like this:

            Source https://stackoverflow.com/questions/47579962

            QUESTION

            Accessing hdfs from docker-hadoop-spark--workbench via zeppelin
            Asked 2017-Dec-26 at 05:14

            I have installed https://github.com/big-data-europe/docker-hadoop-spark-workbench

            Then started it up with docker-compose up . I navigated to the various urls mentioned in the git readme and all appears to be up.

            I then started a local apache zeppelin with:

            ...

            ANSWER

            Answered 2017-Dec-26 at 03:46

            Reason for the exception is that the sparkSession object is null for some reason in Zeppelin.

            Reference: https://github.com/apache/zeppelin/blob/master/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java

            Source https://stackoverflow.com/questions/47836204

            QUESTION

            How to use hdfs shell commands with apache zeppelin?
            Asked 2017-Dec-16 at 13:15

            I have installed apache zeppelin by downloading and extracting the binary with all interpreters

            I then started it up with:

            ...

            ANSWER

            Answered 2017-Dec-16 at 13:15

            It does but you are using shell interpreter.

            Make sure that file interpreter is installed:

            Source https://stackoverflow.com/questions/47843565

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install docker-hadoop

            You can download it from GitHub.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/actionml/docker-hadoop.git

          • CLI

            gh repo clone actionml/docker-hadoop

          • sshUrl

            git@github.com:actionml/docker-hadoop.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link