docker-hadoop | Docker container with a full Hadoop cluster setup
kandi X-RAY | docker-hadoop Summary
kandi X-RAY | docker-hadoop Summary
This Docker container contains a full Hadoop distribution with the following components:.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of docker-hadoop
docker-hadoop Key Features
docker-hadoop Examples and Code Snippets
Community Discussions
Trending Discussions on docker-hadoop
QUESTION
I have a docker-compose with hadoop big-data-europe and flink 1.10 and 1.4 which I try to start in separate container. I use this reference YARN Setup, in which there is an example
Example: Issue the following command to allocate 10 Task Managers, with 8 GB of memory and 32 processing slots each:
...ANSWER
Answered 2020-Sep-07 at 08:31As stated in the documentation for configuration parameters in yarn deployment mode, yarn.containers.vcores
specifies the number of virtual cores (vcores) per YARN container. By default, the number of vcores is set to the number of slots per TaskManager, if set, or to 1, otherwise. In order for this parameter to be used your cluster must have CPU scheduling enabled.
In your case, you specify the -s 32
taskmanager.numberOfTaskSlots
parameter without overriding the yarn.containers.vcores
setting thus the app acquires the container with 32 vcores. In order to be able to run with 32 slots per TM and only 8 cores, please, set the yarn.containers.vcores
to 8
in flink/conf/flink-conf.yaml
.
Regarding the resources, yes, every task manager equals to yarn container acquired, but container has a number of vcores, specified by yarn.containers.vcores
(or to a number of slots per container). Regarding the slot, it's more like a resource group and each slot can have multiple tasks, each running in a separate thread. So, slot itself is not limited to only one thread. Please, find more at Task Slots and Resources Docs page.
QUESTION
I see a lot of echo statements in one the entrypoint.sh
.
Where these logs will be stored ?
I believe, these will be automatically logged. Useful in debugging to see which environment variables ingested etc .. ?
A Sample entrypoint.sh file https://github.com/big-data-europe/docker-hadoop/blob/master/base/entrypoint.sh
...ANSWER
Answered 2020-Sep-03 at 17:52If
entrypoint.sh
is the image's entrypoint, it'll be logged in thedocker logs
output and in the container's log files (usually at/var/lib/docker/containers//-json.log
).That's usually done for exposing the configuration upon which the container is running. In this case the container is only reporting what's doing, as half of the
echo
lines are just setting up the hadoop configuration files.
QUESTION
Hello Guys, I'm trying to connect Kafka and HDFS with Kafka Connect, but I still face an issue that I can't rid of it.
I'm using this example: https://clubhouse.io/developer-how-to/how-to-set-up-a-hadoop-cluster-in-docker/
I start the HDFS first with: docker-compose up -d
Then I launch the zookeeper kafka and mysql with images from debezium website. https://debezium.io/documentation/reference/1.0/tutorial.html
docker run -it --rm --name zookeeper --network docker-hadoop-master_default -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper:1.0
docker run -it --rm --name kafka --network docker-hadoop-master_default -e ZOOKEEPER_CONNECT=zookeeper -p 9092:9092 --link zookeeper:zookeeper debezium/kafka:1.0
docker run -it --rm --name mysql --network docker-hadoop-master_default -p 3306:3306 -e MYSQL_ROOT_PASSWORD=debezium -e MYSQL_USER=mysqluser -e MYSQL_PASSWORD=mysqlpw debezium /example-mysql:1.0
I use the network on these runs because when I tried to change the network from HDFS on docker-compose.yml the resource manager shutdown and no matter how I couldn't find how I could raise up again and make him stable. So added directly on these containers zookeeper kafka and mysql.
Then, this is the most tricky part, the Kafka Connect, I used the same network to on this case which makes sense.
docker run -it --rm --name connect --network docker-hadoop-master_default -p 8083:8083 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my_connect_configs -e OFFSET_STORAGE_TOPIC=my_connect_offsets -e STATUS_STORAGE_TOPIC=my_connect_statuses -e BOOTSTRAP_SERVERS="172.18.0.10:9092" -e CORE_CONF_fs_defaultFS=hdfs://172.18.0.2:9000 --link namenode:namenode --link zookeeper:zookeeper --link mysql:mysql debezium/connect:1.0
To link the source (Mysql) to Kafka I uses the connector from the debezium tutorial, the one below.
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '{ "name": "inventory-connector", "config": { "connector.class": "io.debezium.connector.mysql.MySqlConnector", "tasks.max": "1", "database.hostname": "mysql", "database.port": "3306", "database.user": "debezium", "database.password": "dbz", "database.server.id": "184054", "database.server.name": "dbserver1", "database.whitelist": "inventory", "database.history.kafka.bootstrap.servers": "kafka:9092", "database.history.kafka.topic": "dbhistory.inventory" } }'
I tested if Kafka receives any event from the source and works fine.
After setting this, I moved to the installation of the plugin, which I downloaded from the confluent web site and pasted on my local machine Linux, then I installed the Confluent-Hub, and after that the plugin on my local machine. Then I created the user kafka and change all the content from the plugin directory into kafka:kafka.
After all this I used docker cp :/kafka/connect to copy to Kafka Connect.
Then check if it is there and then restart the Kafka Connect to install it.
We can use this to check if is installed: curl -i -X GET -H "Accept:application/json" localhost:8083/connector-plugins
You need to see somewhere this: [{"class":"io.confluent.connect.hdfs.HdfsSinkConnector","type":"sink","version":"5.4.0"},…
After this step I believe is where my problem resides: curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '{"name":"hdfs-sink","config":{"connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector","tasks.max":1,"topics":"dbserver1,dbserver1.inventory.products,dbserver1.inventory.products_on_hand,dbserver1.inventory.customers,dbserver1.inventory.orders, dbserver1.inventory.geom,dbserver1.inventory.addresses","hdfs.url":"hdfs://172.18.0.2:9000","flush.size":3,"logs.dir":"logs","topics.dir":"kafka","format.class":"io.confluent.connect.hdfs.parquet.ParquetFormat","partitioner.class":"io.confluent.connect.hdfs.partitioner.DefaultPartitioner","partition.field.name":"day"}}'
I have no idea how to convince Kafka Connect that I want a specific IP address from the namenode, he just keeps my trowing messages that found a different IP when the expected is hdfs://namenode:9000
Also adding this -e CORE_CONF_fs_defaultFS=hdfs://172.18.0.2:9000 to the docker run our setting it inside the Kafka Connect, when I POST the Curl of hdfs-sink he trowing me the message below.
Log from Kafka Connect:
...ANSWER
Answered 2020-Jan-21 at 19:26By default, Docker compose adds an underscore and the directory where you ran the command underscore is not allowed in a hostname. Hadoop prefers hostnames by default in the hdfs-site.xml
config file.
I have no idea how to convince Kafka Connect that I want a specific IP address from the namenode, he just keeps my trowing messages that found a different IP when the expected is hdfs://namenode:9000
Ideally, you wouldn't use an IP within Docker anyway, you would use the service name and exposed port.
For the HDFS Connector, you also need to define 1) HADOOP_CONF_DIR
env-var 2) mount your XML configs as a volume for remote clients such as Connect to interact with the Hadoop cluster and 3) define hadoop.conf.dir
in connector property.
QUESTION
I am trying to access the webUIs from the containers on the docker-compose from a hadoop-cluster. Link: https://github.com/big-data-europe/docker-hadoop
The Docker-Compose File had the following content:
...ANSWER
Answered 2019-Nov-30 at 16:04To access the Namenode UI, you'd use localhost:9870 since that's the port you've forwarded. You shouldn't need to access the UIs "from other containers"
And that's the only container you've opened a port forward for
QUESTION
I've below docker swarm cluster.
...ANSWER
Answered 2019-Sep-30 at 05:59you need to create your Network first:
QUESTION
I am trying to compose a lightweight minimal hadoop stack with the images provided by bde2020 (learning purpose). Right now, the stack includes (among others)
- a namenode
- a datanote
- hue
Basically, I started from Big Data Europe official docker compose, and added a hue image based on their documentation
The issueHue's file browser can't access HDFS:
...ANSWER
Answered 2019-Jul-19 at 21:09I was able to get the Filebrowser working with this INI
QUESTION
Based on https://github.com/gotthardsen/docker-hadoop-spark-workbench/tree/master/swarm I have a docker swarm setup with hadoop, spark, hue and a jupyter notebook setup.
Using Hue I uploaded a file to hdfs, and I have not problem downloading or viewing the file from hue or in hdfs on the namenode. There is no missing blocks and file check says everything is fine.
But when I try to access it using pyspark in jupyter I get a:
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-296583215-10.0.0.6-1542473394293:blk_1073741833_1009 file=/20170930.csv
I know this is not about missing block but more likely something else. But I can not figure out why. The code python code from the workbook using python2 kernel is:
...ANSWER
Answered 2018-Nov-18 at 10:28Since Docker containers are ephemeral, it's possible the datanode container died, and therefore the data within, but the namenode still knows that the file used to exist.
I don't know about node-affinity rules in Swarm, but you should try to add volume mounts to the namenode and datanode containers, plus make sure they can only be scheduled on single machines (assuming you have more than one, since you are using Swarm rather than just Compose)
Probably the same, but I have made my own Docker Compose with Hue, Juptyer, NameNode, and Datanode, and I did test it with PySpark
QUESTION
I creasted a Spark Cluster using this repository and the relative README.md file.
Now I'm trying to execute through spark-submit
a job to the docker container of the Spark Master so the command that I use is something similar:
ANSWER
Answered 2018-Feb-20 at 14:56the problem was to use the hostname for spark://spark-master:7077
So inside the Spark Master is something like this:
QUESTION
I have installed https://github.com/big-data-europe/docker-hadoop-spark-workbench
Then started it up with docker-compose up
. I navigated to the various urls mentioned in the git readme and all appears to be up.
I then started a local apache zeppelin with:
...ANSWER
Answered 2017-Dec-26 at 03:46Reason for the exception is that the sparkSession
object is null
for some reason in Zeppelin.
QUESTION
I have installed apache zeppelin by downloading and extracting the binary with all interpreters
I then started it up with:
...ANSWER
Answered 2017-Dec-16 at 13:15It does but you are using shell interpreter.
Make sure that file
interpreter is installed:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install docker-hadoop
Build the current image locally: ./build-docker-image.sh
Pull from DockerHub: docker pull segence/hadoop:latest
This will set up a local Hadoop cluster using bridged networking with one namenode and one datanode. You can log into the namenode (master) by issuing docker exec -it hadoop-namenode bash and to the datanode (slave) by docker exec -it hadoop-datanode1 bash. By default, the HDFS replication factor is set to 1, because it is assumed that a local Docker cluster will be started with a single datanode. To override the replication setting simply change the HDFS_REPLICATION_FACTOR environment variable in the docker-compose.yml file (and also add more datanodes). Adding more data nodes adds the complexity of exposing all datanode UI ports to localhost. In this scenario, no UI ports should be exposed to avoid the conflict.
Go into the cluster-setup/local-cluster directory: cd cluster-setup/local-cluster
Edit the slaves-config/slaves file if you want to add more slaves (datanodes) other than the default one slave node. If you add more slaves then also edit the docker-compose.yml file by adding more slave node configurations.
Launch the new cluster: docker-compose up -d
Create the hadoop user on the host system, e.g. useradd hadoop. If you encounter the following error message when running a Docker container: WARNING: IPv4 forwarding is disabled. Networking will not work. then turn on packet forwading (RHEL 7): /sbin/sysctl -w net.ipv4.ip_forward=1. You can use the script cluster-setup/standalone-cluster/setup-rhel.sh to achieve the above, as well as to create the required directories and change their ownership (as in point 1 and 2 below, so you can skip them if you used the RHEL setup script). The cluster setup runs with host networking, so the Hadoop nodes will get the hostname and DNS settings directly from the host machine. Make sure IP addresses and DNS names, as well as DNS resolution is correctly set up on the host machines.
Create the following directories on the host: Directory for the HDFS data: /hadoop/data Directory for MapReduce/Spark deployments: /hadoop/deployments
Make the hadoop user own the directories: chown -R hadoop:hadoop /hadoop
Create the file /hadoop/slaves-config/slaves listing all slave node host names on a separate line
Copy the start-namenode.sh file onto the system (e.g. into /hadoop/start-namenode.sh)
Launch the new namenode: /hadoop/start-namenode.sh [HDFS REPLICATION FACTOR], where HDFS REPLICATION FACTOR is the replication factor for HDFS blocks (defaults to 2).
Create the directory for the HDFS data: /hadoop/data
Make the hadoop user own the directories: chown -R hadoop:hadoop /hadoop
Create the file /hadoop/slaves-config/slaves listing all slave node host names on a separate line
Copy the start-datanode.sh file onto the system (e.g. into /hadoop/start-datanode.sh)
Launch the new datanode with its ID: /hadoop/start-datanode.sh <NAMENODE HOST NAME> [HDFS REPLICATION FACTOR], where NAMENODE HOST NAME is the host name of the namenode and HDFS REPLICATION FACTOR is the replication factor for HDFS blocks (defaults to 2, and has to be consistent throughout all cluster nodes).
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page