hadoop | Public hadoop release repository

by naver Java Version: Current License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | hadoop Summary

hadoop is a Java library typically used in Big Data, Docker, Spark, Hadoop applications. hadoop has no bugs, it has build file available, it has a Permissive License and it has high support. However hadoop has 8 vulnerabilities. You can download it from GitHub.

This repository is based on Apache Hadoop 2.7.1 source code. It is used to make Naver's large scale multi-tenant hadoop cluster, which is called C3. The C3 users can execute several data processing jobs with MapReduce, Spark and Hive on CPU, and execute Deep Learning algorithms on GPU. And also they can run long-live applications on docker container. Recently hadoop's new features is adding to Hadoop 2.8 or Hadoop 3.0. However if you are using hadoop cluster for years in production, your hadoop version maybe is not hadoop 2.8 or 3.0, because these versions is not recommended for production cluster yet. Thus you can't use very useful new features: GPU scheduler, docker container, several resource isolations(e.g. network outbound, disk). We're applying and developing new features to this repository. You can see histories in commit logs.

Support

Quality

Security

License

Reuse

Support

hadoop has a highly active ecosystem.

It has 27 star(s) with 11 fork(s). There are 10 watchers for this library.

It had no major release in the last 6 months.

There are 1 open issues and 0 have been closed. There are 1 open pull requests and 0 closed requests.

It has a negative sentiment in the developer community.

The latest version of hadoop is current.

Quality

hadoop has no bugs reported.

Security

hadoop has 8 vulnerability issues reported (1 critical, 7 high, 0 medium, 0 low).

License

hadoop is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

hadoop releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed hadoop and discovered the below as its top functions. This is intended to give you an instant insight into hadoop implemented functionality, and help decide if they suit your requirements.

Process a timeline event
Receives a packet from the stream .
Generate the startup shutdown message .
Creates an application submission context .
Generate splits for the given node .
Process a line line .
Reload the allocation configuration file .
Dumps properties of the TFile
Converts a path to a byte array .
Compute replication work for blocks .

Get all kandi verified functions for this library.

hadoop Key Features

No Key Features are available at this moment for hadoop.

hadoop Examples and Code Snippets

No Code Snippets are available at this moment for hadoop.

Community Discussions

Trending Discussions on hadoop

I can't pass parameters to foreach loop while implementing Structured Streaming + Kafka in Spark SQL

Getting java.lang.ClassNotFoundException when I try to do spark-submit, referred other similar queries online but couldnt get it to work

Indexing of Spark 3 Dataframe into Apache Solr 8

Update to mapred-default.xml not visible in web UI configuration

Import org.apache statement cannot be resolved in GCP Shell

Hadoop NameNode Web Interface

RDD in Spark: where and how are they stored?

Hive: Query executing from hours

Cannot Allocate Memory in Delta Lake

Webapp fails with "JBAS011232: Only one JAX-RS Application Class allowed" after adding a maven dependency to hadoop-azure

QUESTION

I can't pass parameters to foreach loop while implementing Structured Streaming + Kafka in Spark SQL

Asked 2021-Jun-15 at 04:42

I followed the instructions at Structured Streaming + Kafka and built a program that receives data streams sent from kafka as input, when I receive the data stream I want to pass it to SparkSession variable to do some query work with Spark SQL, so I extend the ForeachWriter class again as follows:

...

ANSWER

Answered 2021-Jun-15 at 04:42

do some query work with Spark SQL

You wouldn't use a ForEachWriter for that

Source https://stackoverflow.com/questions/67972167

QUESTION

Getting java.lang.ClassNotFoundException when I try to do spark-submit, referred other similar queries online but couldnt get it to work

Asked 2021-Jun-14 at 09:36

I am new to Spark and am trying to run on a hadoop cluster a simple spark jar file built through maven in intellij. But I am getting classnotfoundexception in all the ways I tried to submit the application through spark-submit.

My pom.xml:

...

ANSWER

Answered 2021-Jun-14 at 09:36

You need to add scala-compiler configuration to your pom.xml. The problem is without that there is nothing to compile your SparkTrans.scala file into java classes.

Add:

Source https://stackoverflow.com/questions/67934425

QUESTION

Indexing of Spark 3 Dataframe into Apache Solr 8

Asked 2021-Jun-14 at 07:42

I have setup a small size Hadoop Yarn cluster where Apache Spark is running. I have some data (JSON, CSV) that I upload to Spark (data-frame) for some analysis. Later, I have to index all data-frame data into Apache SOlr. I am using Spark 3 and Solr 8.8 version.

In my search, I have found a solution here but it is for different version of Spark. Hence, I have decided to ask someone for this.

Is there any builtin option for this task. I am open to use SolrJ and pySpark (not scal shell).

...

ANSWER

Answered 2021-Jun-14 at 07:42

I found a solution myself. Till now Lucidword spark-solr module does not support these versions of Spark (3.0.2) and Solr (8.8). I have first installed PySolr module and then use following example code to finish my job:

Source https://stackoverflow.com/questions/66311948

QUESTION

Update to mapred-default.xml not visible in web UI configuration

Asked 2021-Jun-12 at 07:08

I have an Apache Kylin container running in docker. I was getting a Java heap space error in map reduce phase so I tried updating some parameters in Hadoop mapred-default.xml file. After making the changes, I restarted the container but, when I go to Yarn ResourceManager Web UI and then to Configuration:

An xml file is opened, looking like this:

However my new values for the properties that I set inside the mapred-default.xml are not here, it is showing the old values for those properties... Does anyone have any idea why that is happening and what I should do to make it register the new values? I tried restarting the container, but it didn't help...

...

ANSWER

Answered 2021-Jun-12 at 07:08

To override a default value for a property, specify the new value within the tags, inside mapred-site.xml not mapred-default.xml, using the following format:

Source https://stackoverflow.com/questions/67935665

QUESTION

Import org.apache statement cannot be resolved in GCP Shell

Asked 2021-Jun-10 at 21:48

I had used the below command in GCP Shell terminal to create a project wordcount

...

ANSWER

Answered 2021-Jun-10 at 21:48

I'd suggest finding an archetype for creating MapReduce applications, otherwise, you need to add hadoop-client as a dependency in your pom.xml

Source https://stackoverflow.com/questions/67916362

QUESTION

Hadoop NameNode Web Interface

Asked 2021-Jun-09 at 14:18

I have 3 remote computers (servers):

computer 1 has internal IP: 10.1.7.245
computer 2 has internal IP: 10.1.7.246
computer 3 has internal IP: 10.1.7.247

(The 3 computers above are in the same network, these 3 computers are all using Ubuntu 18.04.5 LTS Operating System)

(My personal laptop is in another different network, my laptop also uses Ubuntu 18.04.5 LTS Operating System)

I use my personal laptop to connect to the 3 remote computers using SSH protocol and using user root : (Below ABC is a name)

computer 1: ssh root@ABC.University.edu.vn -p 12001
computer 2: ssh root@ABC.University.edu.vn -p 12002
computer 3: ssh root@ABC.University.edu.vn -p 12003

I have successfully set up a Hadoop Cluster which contains 3 above computer:

computer 1 is the Hadoop Master
computer 2 is the Hadoop Slave 1
computer 3 is the Hadoop Slave 2

======================================================

I starts HDFS of the Hadoop Cluster by using the below command on Computer 1: start-dfs.sh

Everything is successful:

computer 1 (the Master) is running the NameNode
computer 2 (the Slave 1) is running the DataNode
computer 3 (the Slave 2) is running the DataNode

I know that the the Web Interface for the NameNode is running on Computer 1, on IP 0.0.0.0 and on port 9870 . Therefore, if I open the web browser on computer 1 (or on computer 2, or on computer 3), I will enter the 10.1.7.245:9870 on the URL bar (address bar) of the web browser to see the Web Interface of the NameNode.

======================================================

Now, I am using the web browser of my personal laptop.

How could I access to the Web Interface of the NameNode ?

...

ANSWER

Answered 2021-Jun-08 at 17:56

Unless you expose port 9870, your personal laptop on another network will not be able to access the web interface.

You can check to see if it is exposed by trying :9870 to see if it is exposed. IP-address here has to be the global IP-address, not the local (10.* ) address.

To get the NameNode's IP address, ssh into the NameNode server, and type ifconfig (sudo apt install ifconfig if not already installed - I'm assuming Ubuntu/Linux here). ifconfig should give you a global IP address (not the 255.* - that is a mask).

Source https://stackoverflow.com/questions/67891388

QUESTION

RDD in Spark: where and how are they stored?

Asked 2021-Jun-09 at 09:45

I've always heard that Spark is 100x faster than classic Map Reduce frameworks like Hadoop. But recently I'm reading that this is only true if RDDs are cached, which I thought was always done but instead requires the explicit cache () method.

I would like to understand how all produced RDDs are stored throughout the work. Suppose we have this workflow:

I read a file -> I get the RDD_ONE
I use the map on the RDD_ONE -> I get the RDD_TWO
I use any other transformation on the RDD_TWO

QUESTIONS:

if I don't use cache () or persist () is every RDD stored in memory, in cache or on disk (local file system or HDFS)?

if RDD_THREE depends on RDD_TWO and this in turn depends on RDD_ONE (lineage) if I didn't use the cache () method on RDD_THREE Spark should recalculate RDD_ONE (reread it from disk) and then RDD_TWO to get RDD_THREE?

Thanks in advance.

...

ANSWER

Answered 2021-Jun-09 at 06:13

In spark there are two types of operations: transformations and actions. A transformation on a dataframe will return another dataframe and an action on a dataframe will return a value.

Transformations are lazy, so when a transformation is performed spark will add it to the DAG and execute it when an action is called.

Suppose, you read a file into a dataframe, then perform a filter, join, aggregate, and then count. The count operation which is an action will actually kick all the previous transformation.

If we call another action(like show) the whole operation is executed again which can be time consuming. So, if we want not to run the whole set of operation again and again we can cache the dataframe.

Few pointers you can consider while caching:

Cache only when the resulting dataframe is generated from significant transformation. If spark can regenerate the cached dataframe in few seconds then caching is not required.
Cache should be performed when the dataframe is used for multiple actions. If there are only 1-2 actions on the dataframe then it is not worth saving that dataframe in memory.

Source https://stackoverflow.com/questions/67894971

QUESTION

Hive: Query executing from hours

Asked 2021-Jun-08 at 23:08

I'm try to execute the below hive query on Azure HDInsight cluster but it's taking unprecedented amount of time to finish. Did implemented hive settings but of no use. Below are the details:

Table

...

ANSWER

Answered 2021-Jun-07 at 03:19

if you don't have index on your fk columns , you should add them for sure , here is my suggestion:

Source https://stackoverflow.com/questions/67864692

QUESTION

Cannot Allocate Memory in Delta Lake

Asked 2021-Jun-08 at 11:11

Problem

The goal is to have a Spark Streaming application that read data from Kafka and use Delta Lake to create store data. The granularity of the delta table is pretty granular, the first partition is the organization_id (there are more than 5000 organizations) and the second partition is the date.

The application has a expected latency, but it does not last more than one day up. The error is always about memory as I'll show below.

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006f8000000, 671088640, 0) failed; error='Cannot allocate memory' (errno=12)

There is no persistence and the memory is already high for the whole application.

What I've tried

Increasing memory and workes were the first things I've tried, but the number of partitions were changed as well, from 4 to 16.

Script of Execution ...

ANSWER

Answered 2021-Jun-08 at 11:11

Just upgraded the version to Delta.io 1.0.0 and it stopped happening.

Source https://stackoverflow.com/questions/67519651

QUESTION

Webapp fails with "JBAS011232: Only one JAX-RS Application Class allowed" after adding a maven dependency to hadoop-azure

Asked 2021-Jun-03 at 20:31

I have a webapp that runs fine in JBoss EAP 6.4. I want to add some functionality to my webapp so that it can process Parquet files that reside in AzureBlob storage. I add a single dependency to my pom.xml:

...

ANSWER

Answered 2021-Jun-03 at 20:31

hadoop-azure pulls in hadoop-common, which pulls in Jersey. In the version of hadoop-azure you're using, hadoop-common is in compile . In new version, it is in provided scope. So you can just upgrade the hadoop-azure dependency to the latest one. If you need hadoop-common to compile, then you can redeclare hadoop-common and put it in provided scope.

Source https://stackoverflow.com/questions/67807156

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

CVE-2020-9492 HIGH

In Apache Hadoop 3.2.0 to 3.2.1, 3.0.0-alpha1 to 3.1.3, and 2.0.0-alpha to 2.10.0, WebHDFS client might send SPNEGO authorization header to remote URL without proper verification.

https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E

https://lists.apache.org/thread.html/r513758942356ccd0d14538ba18a09903fc72716d74be1cb727ea91ff%40%3Cgeneral.hadoop.apache.org%3E

https://lists.apache.org/thread.html/r941e9be04efe0f455d20aeac88516c0848decd7e7b1d93d5687060f4@%3Ccommits.druid.apache.org%3E

https://lists.apache.org/thread.html/r79323adac584edab99fd5e4b52a013844b784a5d4b600da0662b33d6@%3Ccommits.druid.apache.org%3E

https://lists.apache.org/thread.html/rc0057ebf32b646ab47f7f5744a8948332e015c39044cbb9d87ea76cd@%3Ccommits.druid.apache.org%3E

https://lists.apache.org/thread.html/r0a534f1cde7555f7208e9f9b791c1ab396d215eaaef283b3a9153429@%3Ccommits.druid.apache.org%3E

CVE-2018-11764 HIGH

Web endpoint authentication check is broken in Apache Hadoop 3.0.0-alpha4, 3.0.0-beta1, and 3.0.0. Authenticated users may impersonate any user even if no proxy user is configured.

https://lists.apache.org/thread.html/r790ad0a049cde713b93589ecfd4dd2766fda0fc6807eedb6cf69f5c1%40%3Cgeneral.hadoop.apache.org%3E

https://security.netapp.com/advisory/ntap-20201103-0003/

CVE-2018-11765 HIGH

In Apache Hadoop versions 3.0.0-alpha2 to 3.0.0, 2.9.0 to 2.9.2, 2.8.0 to 2.8.5, any users can access some servlets without authentication when Kerberos authentication is enabled and SPNEGO through HTTP is not enabled.

https://lists.apache.org/thread.html/r2c7f899911a04164ed1707083fcd4135f8427e04778c87d83509b0da%40%3Cgeneral.hadoop.apache.org%3E

https://lists.apache.org/thread.html/rf9dfa8b77585c9227db9637552eebb2ab029255a0db4eb76c2b6c4cf@%3Cdev.druid.apache.org%3E

https://lists.apache.org/thread.html/rbe25cac0f499374f8ae17a4a44a8404927b56de28d4c41940d82b7a4@%3Ccommits.druid.apache.org%3E

https://lists.apache.org/thread.html/reea5eb8622afbfbfca46bc758f79db83d90a3263a906c4d1acba4971@%3Ccommits.druid.apache.org%3E

https://lists.apache.org/thread.html/r4dddf1705dbedfa94392913b2dad1cd2d1d89040facd389eea0b3510@%3Ccommits.druid.apache.org%3E

https://lists.apache.org/thread.html/r79b15c5b66c6df175d01d7560adf0cd5c369129b9a161905e0339927@%3Ccommits.druid.apache.org%3E

https://lists.apache.org/thread.html/rb21df54a4e39732ce653d2aa5672e36a792b59eb6717f2a06bb8d02a@%3Ccommits.druid.apache.org%3E

https://lists.apache.org/thread.html/r46447f38ea8c89421614e9efd7de5e656186d35e10fc97cf88477a01@%3Ccommits.druid.apache.org%3E

https://lists.apache.org/thread.html/r17d94d132b207dad221595fd8b8b18628f5f5ec7e3f5be939ecd8928@%3Ccommits.druid.apache.org%3E

https://lists.apache.org/thread.html/r74825601e93582167eb7cdc2f764c74c9c6d8006fa90018562fda60f@%3Ccommits.druid.apache.org%3E

https://lists.apache.org/thread.html/rb241464d83baa3749b08cd3dabc8dba70a9a9027edcef3b5d4c24ef4@%3Ccommits.druid.apache.org%3E

https://security.netapp.com/advisory/ntap-20201016-0005/

CVE-2018-8029 HIGH

In Apache Hadoop versions 3.0.0-alpha1 to 3.1.0, 2.9.0 to 2.9.1, and 2.2.0 to 2.8.4, a user who can escalate to yarn user can possibly run arbitrary commands as root user.

https://lists.apache.org/thread.html/17084c09e6dedf60efe08028b429c92ffd28aacc28454e4fa924578a@%3Cgeneral.hadoop.apache.org%3E

https://lists.apache.org/thread.html/a97c53a81e639ca2fc7b8f61a4fcd1842c2a78544041244a7c624727@%3Cissues.hbase.apache.org%3E

http://www.securityfocus.com/bid/108518

https://lists.apache.org/thread.html/a0164b87660223a2d491f83c88f905fe1a9fa8dc795148d9b0d968c8@%3Cdev.hbase.apache.org%3E

https://lists.apache.org/thread.html/0b8d58e02dbd0fb8bf7320c514fe58da1d6728bdc150f1ba04e0d9fc@%3Cissues.hbase.apache.org%3E

https://security.netapp.com/advisory/ntap-20190617-0001/

https://lists.apache.org/thread.html/r4dddf1705dbedfa94392913b2dad1cd2d1d89040facd389eea0b3510@%3Ccommits.druid.apache.org%3E

https://lists.apache.org/thread.html/rb21df54a4e39732ce653d2aa5672e36a792b59eb6717f2a06bb8d02a@%3Ccommits.druid.apache.org%3E

CVE-2018-8009 HIGH

Apache Hadoop 3.1.0, 3.0.0-alpha to 3.0.2, 2.9.0 to 2.9.1, 2.8.0 to 2.8.4, 2.0.0-alpha to 2.7.6, 0.23.0 to 0.23.11 is exploitable via the zip slip vulnerability in places that accept a zip file.

https://hadoop.apache.org/cve_list.html#cve-2018-8009-http-cve-mitre-org-cgi-bin-cvename-cgi-name-cve-2018-8009-zip-slip-impact-on-apache-hadoop

http://www.securityfocus.com/bid/105927

https://snyk.io/research/zip-slip-vulnerability

https://lists.apache.org/thread.html/a1c227745ce30acbcf388c5b0cc8423e8bf495d619cd0fa973f7f38d@%3Cuser.hadoop.apache.org%3E

https://lists.apache.org/thread.html/708d94141126eac03011144a971a6411fcac16d9c248d1d535a39451@%3Csolr-user.lucene.apache.org%3E

https://access.redhat.com/errata/RHSA-2019:3892

https://lists.apache.org/thread.html/r4dddf1705dbedfa94392913b2dad1cd2d1d89040facd389eea0b3510@%3Ccommits.druid.apache.org%3E

https://lists.apache.org/thread.html/rb21df54a4e39732ce653d2aa5672e36a792b59eb6717f2a06bb8d02a@%3Ccommits.druid.apache.org%3E

CVE-2017-3166 HIGH

In Apache Hadoop versions 2.6.1 to 2.6.5, 2.7.0 to 2.7.3, and 3.0.0-alpha1, if a file in an encryption zone with access permissions that make it world readable is localized via YARN's localization mechanism, that file will be stored in a world-readable location and can be shared freely with any application that requests to localize that file.

https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E

https://lists.apache.org/thread.html/9317fd092b257a0815434b116a8af8daea6e920b6673f4fd5583d5fe@%3Ccommits.druid.apache.org%3E

CVE-2018-11768 HIGH

In Apache Hadoop 3.1.0 to 3.1.1, 3.0.0-alpha1 to 3.0.3, 2.9.0 to 2.9.1, and 2.0.0-alpha to 2.8.4, the user/group information can be corrupted across storing in fsimage and reading back from fsimage.

https://lists.apache.org/thread.html/2067a797b330530a6932f4b08f703b3173253d0a2b7c8c524e54adaf@%3Cgeneral.hadoop.apache.org%3E

https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E

https://lists.apache.org/thread.html/ea6d2dfbefab8ebe46be18b05136b83ae53b7866f1bc60c680a2b600@%3Chdfs-dev.hadoop.apache.org%3E

https://lists.apache.org/thread.html/2c9cc65864be0058a5d5ed2025dfb9c700bf23d352b0c826c36ff96a@%3Chdfs-dev.hadoop.apache.org%3E

https://lists.apache.org/thread.html/f20bb4e055d8394fc525cc7772fb84096f706389043e76220c8a29a4@%3Chdfs-dev.hadoop.apache.org%3E

https://lists.apache.org/thread.html/9b609d4392d886711e694cf40d86f770022baf42a1b1aa97e8244c87@%3Cdev.lucene.apache.org%3E

https://lists.apache.org/thread.html/ceb16af9139ab0fea24aef935b6321581976887df7ad632e9a515dda@%3Cdev.lucene.apache.org%3E

https://lists.apache.org/thread.html/72ca514e01cd5f08151e74f9929799b4cbe1b6e9e6cd24faa72ffcc6@%3Cdev.lucene.apache.org%3E

https://lists.apache.org/thread.html/r02e39d7beb32eebcdbb4b516e95f67d71c90d5d462b26f4078d21eeb@%3Cdev.flink.apache.org%3E

https://lists.apache.org/thread.html/r02e39d7beb32eebcdbb4b516e95f67d71c90d5d462b26f4078d21eeb@%3Cuser.flink.apache.org%3E

CVE-2017-3162 HIGH

HDFS clients interact with a servlet on the DataNode to browse the HDFS namespace. The NameNode is provided as a query parameter that is not validated in Apache Hadoop before 2.7.0.

https://s.apache.org/k2ss

http://www.securityfocus.com/bid/98017

https://lists.apache.org/thread.html/r127f75748fcabc63bc5a1bec6885753eb9b2bed803b6ed7bd46f965b@%3Cuser.hadoop.apache.org%3E

CVE-2017-3161 MEDIUM

The HDFS web UI in Apache Hadoop before 2.7.0 is vulnerable to a cross-site scripting (XSS) attack through an unescaped query parameter.

https://s.apache.org/4MQm

http://www.securityfocus.com/bid/98025

https://lists.apache.org/thread.html/r127f75748fcabc63bc5a1bec6885753eb9b2bed803b6ed7bd46f965b@%3Cuser.hadoop.apache.org%3E

Install hadoop

See BUILDING.txt to check required packages and run install_requirements.sh to install. If you installed required packages, there is source.me file which declare several environment variables. If you want to build the package without native libraries, remove -Pnative option. Currently if you build the hadoop package, we recommend running maven with -DskipTests. Depending on the build environment, maven tests can sometimes fail and some our improvements need to be fix.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: