hadoop | latest information about Hadoop, please visit

 by   apache Java Updated: 2 days ago - Current License: Apache-2.0

Download this library from

Build Applications

kandi X-RAY | hadoop REVIEW AND RATINGS

For the latest information about Hadoop, please visit our website at:.

kandi-support
Support

  • hadoop has a highly active ecosystem.
  • It has 12283 star(s) with 7628 fork(s).
  • It had no major release in the last 12 months.
  • It has a negative sentiment in the developer community.

quality kandi
Quality

  • hadoop has 9285 bugs (360 blocker, 185 critical, 1521 major, 7219 minor) and 44440 code smells.

security
Security

  • hadoop has 5 vulnerability issues reported (1 critical, 2 high, 2 medium, 0 low).
  • hadoop code analysis shows 92 unresolved vulnerabilities (74 blocker, 5 critical, 1 major, 12 minor).
  • There are 1156 security hotspots that need review.

license
License

  • hadoop is licensed under the Apache-2.0 License. This license is Permissive.
  • Permissive licenses have the least restrictions, and you can use them in most projects.

build
Reuse

  • hadoop releases are not available. You will need to build from source code and install.
  • Build file is available. You can build the component from source.
Top functions reviewed by kandi - BETA

Coming Soon for all Libraries!

Currently covering the most popular Java, JavaScript and Python libraries. See a SAMPLE HERE.
kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.

hadoop Key Features

Apache Hadoop

hadoop examples and code snippets

  • I can't pass parameters to foreach loop while implementing Structured Streaming + Kafka in Spark SQL
  • Getting java.lang.ClassNotFoundException when I try to do spark-submit, referred other similar queries online but couldnt get it to work
  • Indexing of Spark 3 Dataframe into Apache Solr 8
  • Update to mapred-default.xml not visible in web UI configuration
  • Hive: Query executing from hours
  • Set spark context configuration prioritizing spark-submit
  • Error when building docker image for jupyter spark notebook
  • Map-reduce functional outline
  • Python subprocess with apostrophes, removes them
  • Exception in thread "JobGenerator" java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps scala.Predef$.refArrayOps(java.lang.Object[])'

I can't pass parameters to foreach loop while implementing Structured Streaming + Kafka in Spark SQL

.selectExpr("CAST(value AS STRING)")
          .as(Encoders.STRING());  // or parse your JSON here using a schema 

data.select(...)  // or move this to a method / class that takes the Dataset as a parameter

// await termination 

Getting java.lang.ClassNotFoundException when I try to do spark-submit, referred other similar queries online but couldnt get it to work

<project>
  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>4.5.2</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

Indexing of Spark 3 Dataframe into Apache Solr 8

import pysolr
import json

def solrIndexer(row):
    solr = pysolr.Solr('http://localhost:8983/solr/spark-test')
    obj = json.loads(row)
    solr.add(obj)

#load data to dataframe from HDFS
csvDF = spark.read.load("hdfs://hms/data/*.csv", format="csv", sep=",", inferSchema="true", header="true")

csvDF.toJSON().map(solrIndexer).count()

Update to mapred-default.xml not visible in web UI configuration

<property>
 <name>mapreduce.map.memory.mb</name>
 <value>1024</value>
</property>

Hive: Query executing from hours

create index idx_TABLE2 on table DB_MYDB.TABLE2 (SDNT_ID,CLSS_CD,BRNCH_CD,SECT_CD,GRP_CD,GRP_NM) AS 'COMPACT' WITH DEFERRED REBUILD;

create index idx_TABLE3 on table DB_MYDB.TABLE3(SDNT_ID,CLSS_CD,BRNCH_CD,SECT_CD,GRP_CD,GRP_NM) AS 'COMPACT' WITH DEFERRED REBUILD;
-----------------------
set hive.exec.reducers.bytes.per.reducer=67108864; --example only, check your current settings 
                                                   --and reduce accordingly to get twice more reducers on Reducer 2 vertex

Set spark context configuration prioritizing spark-submit

config = my_config_dict
sc = SparkContext()
conf = sc.getConf()
for option in my_config_dict.keys():
    conf.setIfMissing(option, my_config_dict[option])

Error when building docker image for jupyter spark notebook

docker build --rm --force-rm \
  --build-arg spark_version=3.0.2 \
  -t jupyter/pyspark-notebook:3.0.2 .
docker build --rm --force-rm \
  --build-arg spark_version=3.1.1 \
  --build-arg hadoop_version=2.7 \
  -t jupyter/pyspark-notebook:3.1.1 .  
-----------------------
docker build --rm --force-rm \
  --build-arg spark_version=3.0.2 \
  -t jupyter/pyspark-notebook:3.0.2 .
docker build --rm --force-rm \
  --build-arg spark_version=3.1.1 \
  --build-arg hadoop_version=2.7 \
  -t jupyter/pyspark-notebook:3.1.1 .  

Map-reduce functional outline

    xA           xB
     |           |
  xform(xA)   ​xform(xB)
       ​\       /
aggregator(xform(xA), xform(xB))
           ​|
         ​value
    xA           xB               xC
     |           |                |
  xform(xA)   ​xform(xB)         xform(xC)
     |           |                |
     yA          yB               yC
       ​\       /                  |
aggregator(yA, yB)                |
           ​|                     /
         ​value                  /
           |                   /
          aggregator(value, yC)
                   |
              next_value
import functools

# Combiner
def add(a, b):
    return a + b

# Transformer
def square(a):
    return a * a

one_to_ten = range(1, 11)

functools.reduce(add, map(square, one_to_ten), 0)
(require '[clojure.core.reducers :as r])

(defn square [x] (* x x))

(r/fold + (pmap square (range 1 11)))
-----------------------
    xA           xB
     |           |
  xform(xA)   ​xform(xB)
       ​\       /
aggregator(xform(xA), xform(xB))
           ​|
         ​value
    xA           xB               xC
     |           |                |
  xform(xA)   ​xform(xB)         xform(xC)
     |           |                |
     yA          yB               yC
       ​\       /                  |
aggregator(yA, yB)                |
           ​|                     /
         ​value                  /
           |                   /
          aggregator(value, yC)
                   |
              next_value
import functools

# Combiner
def add(a, b):
    return a + b

# Transformer
def square(a):
    return a * a

one_to_ten = range(1, 11)

functools.reduce(add, map(square, one_to_ten), 0)
(require '[clojure.core.reducers :as r])

(defn square [x] (* x x))

(r/fold + (pmap square (range 1 11)))
-----------------------
    xA           xB
     |           |
  xform(xA)   ​xform(xB)
       ​\       /
aggregator(xform(xA), xform(xB))
           ​|
         ​value
    xA           xB               xC
     |           |                |
  xform(xA)   ​xform(xB)         xform(xC)
     |           |                |
     yA          yB               yC
       ​\       /                  |
aggregator(yA, yB)                |
           ​|                     /
         ​value                  /
           |                   /
          aggregator(value, yC)
                   |
              next_value
import functools

# Combiner
def add(a, b):
    return a + b

# Transformer
def square(a):
    return a * a

one_to_ten = range(1, 11)

functools.reduce(add, map(square, one_to_ten), 0)
(require '[clojure.core.reducers :as r])

(defn square [x] (* x x))

(r/fold + (pmap square (range 1 11)))
-----------------------
    xA           xB
     |           |
  xform(xA)   ​xform(xB)
       ​\       /
aggregator(xform(xA), xform(xB))
           ​|
         ​value
    xA           xB               xC
     |           |                |
  xform(xA)   ​xform(xB)         xform(xC)
     |           |                |
     yA          yB               yC
       ​\       /                  |
aggregator(yA, yB)                |
           ​|                     /
         ​value                  /
           |                   /
          aggregator(value, yC)
                   |
              next_value
import functools

# Combiner
def add(a, b):
    return a + b

# Transformer
def square(a):
    return a * a

one_to_ten = range(1, 11)

functools.reduce(add, map(square, one_to_ten), 0)
(require '[clojure.core.reducers :as r])

(defn square [x] (* x x))

(r/fold + (pmap square (range 1 11)))

Python subprocess with apostrophes, removes them

subprocess.run([
    'docker', 'exec', 'hbase', 'bash', '-c',
    '''echo 'create "myTable", "R"' | hbase shell'''])

Exception in thread &quot;JobGenerator&quot; java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps scala.Predef$.refArrayOps(java.lang.Object[])'

<properties>
    <scala.minor.version>2.11</scala.minor.version>
    <spark.version>2.4.2</spark.version>
</properties>
         <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_${scala.minor.version}</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_${scala.minor.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.minor.version}.8</version>
        </dependency>
-----------------------
<properties>
    <scala.minor.version>2.11</scala.minor.version>
    <spark.version>2.4.2</spark.version>
</properties>
         <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_${scala.minor.version}</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_${scala.minor.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.minor.version}.8</version>
        </dependency>
-----------------------
<project xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>TikiData</groupId>
    <artifactId>TikiData</artifactId>
    <version>V1</version>
    <dependencies>
        <!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.8.6</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.3.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.1.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.1.1</version>
            <scope>provided</scope>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
            <version>3.1.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.12</artifactId>
            <version>3.1.1</version>
            <scope>provided</scope>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.12.2</version>
        </dependency>

    </dependencies>
    <build>
        <sourceDirectory>src</sourceDirectory>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.1</version>
                <configuration>
                    <release>11</release>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                        <configuration>
                            <archive>
                                <manifest>
                                    <mainClass>
                                        demo.KafkaDemo
                                    </mainClass>
                                </manifest>
                            </archive>
                            <descriptorRefs>
                                <descriptorRef>jar-with-dependencies</descriptorRef>
                            </descriptorRefs>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

COMMUNITY DISCUSSIONS

Top Trending Discussions on hadoop
  • I can't pass parameters to foreach loop while implementing Structured Streaming + Kafka in Spark SQL
  • Getting java.lang.ClassNotFoundException when I try to do spark-submit, referred other similar queries online but couldnt get it to work
  • Indexing of Spark 3 Dataframe into Apache Solr 8
  • Update to mapred-default.xml not visible in web UI configuration
  • Import org.apache statement cannot be resolved in GCP Shell
  • Hadoop NameNode Web Interface
  • RDD in Spark: where and how are they stored?
  • Hive: Query executing from hours
  • Cannot Allocate Memory in Delta Lake
  • Webapp fails with &quot;JBAS011232: Only one JAX-RS Application Class allowed&quot; after adding a maven dependency to hadoop-azure
Top Trending Discussions on hadoop

QUESTION

I can't pass parameters to foreach loop while implementing Structured Streaming + Kafka in Spark SQL

Asked 2021-Jun-15 at 04:42

I followed the instructions at Structured Streaming + Kafka and built a program that receives data streams sent from kafka as input, when I receive the data stream I want to pass it to SparkSession variable to do some query work with Spark SQL, so I extend the ForeachWriter class again as follows:

package stream;

import java.io.FileNotFoundException;
import java.io.PrintWriter;

import org.apache.spark.sql.ForeachWriter;
import org.apache.spark.sql.SparkSession;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;

import dataservices.OrderDataServices;
import models.SuccessEvent;

public class MapEventWriter extends ForeachWriter<String>{
private SparkSession spark;

public MapEventWriter(SparkSession spark) {
    this.spark = spark;
}

private static final long serialVersionUID = 1L;

@Override
public void close(Throwable errorOrNull) {
    // TODO Auto-generated method stub
    
}

@Override
public boolean open(long partitionId, long epochId) {
    // TODO Auto-generated method stub
    return true;
}

@Override
public void process(String input) {     
    OrderDataServices services = new OrderDataServices(this.spark);
}
}

however in the process function, if I use spark variable, the program gives an error, the program passes in my spark as follows:

package demo;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.TimeoutException;

import org.apache.hadoop.fs.Path;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.json.simple.parser.ParseException;

import dataservices.OrderDataServices;
import models.MapperEvent;
import models.OrderEvent;
import models.SuccessEvent;
import stream.MapEventWriter;
import stream.MapEventWriter1;

public class Demo {
    public static void main(String[] args) throws TimeoutException, StreamingQueryException, ParseException, IOException {
        try (SparkSession spark = SparkSession.builder().appName("Read kafka").getOrCreate()) {
            Dataset<String> data = spark
                    .readStream()
                    .format("kafka")
                    .option("kafka.bootstrap.servers", "localhost:9092")
                    .option("subscribe", "tiki-1")
                    .load()
                    .selectExpr("CAST(value AS STRING)")
                    .as(Encoders.STRING());
            
            MapEventWriter eventWriter = new MapEventWriter(spark);
            
            StreamingQuery query = data
                    .writeStream()
                    .foreach(eventWriter)
                    .start();
            
            query.awaitTermination();
            
        }
    }
    
    
}

The error is NullPointerException at the spark call location, that is, no spark variable is initialized. Hope anyone can help me, I really appreciate it.

Caused by: java.lang.NullPointerException
    at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:151)
    at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:149)
    at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:998)
    at org.apache.spark.sql.SparkSession.read(SparkSession.scala:655)
    at dataservices.OrderDataServices.<init>(OrderDataServices.java:18)
    at stream.MapEventWriter.process(MapEventWriter.java:38)
    at stream.MapEventWriter.process(MapEventWriter.java:15)

ANSWER

Answered 2021-Jun-15 at 04:42

do some query work with Spark SQL

You wouldn't use a ForEachWriter for that

.selectExpr("CAST(value AS STRING)")
          .as(Encoders.STRING());  // or parse your JSON here using a schema 

data.select(...)  // or move this to a method / class that takes the Dataset as a parameter

// await termination 

Source https://stackoverflow.com/questions/67972167

QUESTION

Getting java.lang.ClassNotFoundException when I try to do spark-submit, referred other similar queries online but couldnt get it to work

Asked 2021-Jun-14 at 09:36

I am new to Spark and am trying to run on a hadoop cluster a simple spark jar file built through maven in intellij. But I am getting classnotfoundexception in all the ways I tried to submit the application through spark-submit.

My pom.xml:

<?xmlversion="1.0"encoding="UTF-8"?>
<projectxmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>org.example</groupId>
<artifactId>SparkTrans</artifactId>
<version>1.0-SNAPSHOT</version>

<dependencies>
<!--https://mvnrepository.com/artifact/org.apache.spark/spark-core-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<!--https://mvnrepository.com/artifact/org.apache.spark/spark-sql-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>

<!--https://mvnrepository.com/artifact/org.apache.spark/spark-hive-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.3</version>
<scope>compile</scope>
</dependency>
<!--https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-slf4j-impl-->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>2.8</version>
<scope>test</scope>
</dependency>
<!--https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-api-->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.8</version>
</dependency>
<!--https://mvnrepository.com/artifact/com.typesafe/config-->
<dependency>
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.3.4</version>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.11</artifactId>
<version>3.1.1</version>
<scope>test</scope>
</dependency>
</dependencies>


<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<id>shade-libs</id>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
<exclude>resources/*</exclude>
</excludes>
</filter>
</filters>
<shadedClassifierName>fat</shadedClassifierName>
<shadedArtifactAttached>true</shadedArtifactAttached>
<relocations>
<relocation>
<pattern>org.apache.http.client</pattern>
<shadedPattern>shaded.org.apache.http.client</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>


</project>

My main scala object (SparkTrans.scala):

import common.InputConfig
import org.apache.spark.sql.{DataFrame,SparkSession}
importorg.slf4j.LoggerFactory

object SparkTrans{

private val logger=LoggerFactory.getLogger(getClass.getName)

def main(args:Array[String]):Unit={
try{
logger.info("main method started")
logger.warn("This is a warning")

val arg_length=args.length

if(arg_length==0){
logger.warn("No Argument passed")
System.exit(1)
}

val inputConfig:InputConfig=InputConfig(env=args(0),targetDB=args(1))
println("The first argument passed is" + inputConfig.env)
println("The second argument passed is" + inputConfig.targetDB)

val spark=SparkSession
.builder()
.appName("SparkPOCinside")
.config("spark.master","yarn")
.enableHiveSupport()
.getOrCreate()

println("Created Spark Session")

val sampleSeq=Seq((1,"Spark"),(2,"BigData"))

val df1=spark.createDataFrame(sampleSeq).toDF("courseid","coursename")
df1.show()


logger.warn("sql_test_a method started")
val courseDF=spark.sql("select * from MYINSTANCE.sql_test_a")
logger.warn("sql_test_a method ended")
courseDF.show()


}
catch{
case e:Exception=>
logger.error("An error has occurred in the main method" + e.printStackTrace())
}


}

}

I tried the below commands to spark-submit, but all of them give classnotfoundexception. I tried to switch the arguments around where I mention the --class right after --deploy-mode but in vain:

spark-submit --master yarn --deploy-mode cluster --queue ABCD --conf spark.yarn.security.tokens.hive.enabled=false --files hdfs://nameservice1/user/XMLs/hive-site.xml --keytab hdfs://nameservice1/user/MYINSTANCE/landing/workflow/wf_data/lib/MYKEY.keytab --num-executors 1 --executor-cores 1 --executor-memory 2g --conf spark.yarn.executor.memoryOverhead=3072 --class org.example.SparkTrans hdfs://nameservice1/user/MYINSTANCE/landing/workflow/wf_data/SparkTrans-1.0-SNAPSHOT-fat.jar dev somedb


spark-submit --master yarn --deploy-mode cluster --queue ABCD --conf spark.yarn.security.tokens.hive.enabled=false --files hdfs://nameservice1/user/XMLs/hive-site.xml --keytab hdfs://nameservice1/user/MYINSTANCE/landing/workflow/wf_data/lib/MYKEY.keytab --num-executors 1 --executor-cores 1 --executor-memory 2g --conf spark.yarn.executor.memoryOverhead=3072 --class org.example.SparkTrans --name org.example.SparkTrans hdfs://nameservice1/user/MYINSTANCE/landing/workflow/wf_data/SparkTrans-1.0-SNAPSHOT-fat.jar dev somedb


spark-submit --master yarn --deploy-mode cluster --queue ABCD --conf spark.yarn.security.tokens.hive.enabled=false --files hdfs://nameservice1/user/XMLs/hive-site.xml --keytab hdfs://nameservice1/user/MYINSTANCE/landing/workflow/wf_data/lib/MYKEY.keytab --num-executors 1 --executor-cores 1 --executor-memory 2g --conf spark.yarn.executor.memoryOverhead=3072 --class SparkTrans hdfs://nameservice1/user/MYINSTANCE/landing/workflow/wf_data/SparkTrans-1.0-SNAPSHOT-fat.jar dev somedb

Exact error I am getting:

btrace WARNING: No output stream. DataCommand output is ignored.
[main] INFO ResourceCollector - Unravel Sensor 4.6.1.8rc0013/2.0.3 initializing.
21/06/11 10:09:27 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT]
21/06/11 10:09:28 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1614625006458_6646161_000001
21/06/11 10:09:30 INFO spark.SecurityManager: Changing view acls to: MYKEY
21/06/11 10:09:30 INFO spark.SecurityManager: Changing modify acls to: MYKEY
21/06/11 10:09:30 INFO spark.SecurityManager: Changing view acls groups to: 
21/06/11 10:09:30 INFO spark.SecurityManager: Changing modify acls groups to: 
21/06/11 10:09:30 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(MYKEY); groups with view permissions: Set(); users  with modify permissions: Set(MYKEY); groups with modify permissions: Set()
21/06/11 10:09:30 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread
21/06/11 10:09:30 ERROR yarn.ApplicationMaster: Uncaught exception: 
java.lang.ClassNotFoundException: SparkTrans
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.spark.deploy.yarn.ApplicationMaster.startUserApplication(ApplicationMaster.scala:561)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:347)
    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:197)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:695)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:693)
    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
21/06/11 10:09:30 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.lang.ClassNotFoundException: SparkTrans)
21/06/11 10:09:30 INFO util.ShutdownHookManager: Shutdown hook called

Can any of you let me know what I am doing wrong? I have checked and see the hive-site.xml and my jar are in the correct locations in hdfs as mentioned in my commands.

ANSWER

Answered 2021-Jun-14 at 09:36

You need to add scala-compiler configuration to your pom.xml. The problem is without that there is nothing to compile your SparkTrans.scala file into java classes.

Add:

<project>
  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>4.5.2</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

to your pom.xml and ensure your scala file is in src/main/scala

Then it should be compiled and added to your jar. Here's the documentation for the scala plugin.

You can check what's in your jar with jar tf jar-file, see guide here.

Source https://stackoverflow.com/questions/67934425

QUESTION

Indexing of Spark 3 Dataframe into Apache Solr 8

Asked 2021-Jun-14 at 07:42

I have setup a small size Hadoop Yarn cluster where Apache Spark is running. I have some data (JSON, CSV) that I upload to Spark (data-frame) for some analysis. Later, I have to index all data-frame data into Apache SOlr. I am using Spark 3 and Solr 8.8 version.

In my search, I have found a solution here but it is for different version of Spark. Hence, I have decided to ask someone for this.

Is there any builtin option for this task. I am open to use SolrJ and pySpark (not scal shell).

ANSWER

Answered 2021-Jun-14 at 07:42

I found a solution myself. Till now Lucidword spark-solr module does not support these versions of Spark (3.0.2) and Solr (8.8). I have first installed PySolr module and then use following example code to finish my job:

import pysolr
import json

def solrIndexer(row):
    solr = pysolr.Solr('http://localhost:8983/solr/spark-test')
    obj = json.loads(row)
    solr.add(obj)

#load data to dataframe from HDFS
csvDF = spark.read.load("hdfs://hms/data/*.csv", format="csv", sep=",", inferSchema="true", header="true")

csvDF.toJSON().map(solrIndexer).count()

If there is some better option or improvement in above code, you are welcome to answer.

Source https://stackoverflow.com/questions/66311948

QUESTION

Update to mapred-default.xml not visible in web UI configuration

Asked 2021-Jun-12 at 07:08

I have an Apache Kylin container running in docker. I was getting a Java heap space error in map reduce phase so I tried updating some parameters in Hadoop mapred-default.xml file. After making the changes, I restarted the container but, when I go to Yarn ResourceManager Web UI and then to Configuration:

Yarn ResourceManager Web UI screenshot

An xml file is opened, looking like this:

enter image description here

However my new values for the properties that I set inside the mapred-default.xml are not here, it is showing the old values for those properties... Does anyone have any idea why that is happening and what I should do to make it register the new values? I tried restarting the container, but it didn't help...

ANSWER

Answered 2021-Jun-12 at 07:08

To override a default value for a property, specify the new value within the tags, inside mapred-site.xml not mapred-default.xml, using the following format:

<property>
 <name>mapreduce.map.memory.mb</name>
 <value>1024</value>
</property>

Be sure restart yarn after reconfigure by stop-yarn.sh and start-yarn.sh.

Source https://stackoverflow.com/questions/67935665

QUESTION

Import org.apache statement cannot be resolved in GCP Shell

Asked 2021-Jun-10 at 21:48

I had used the below command in GCP Shell terminal to create a project wordcount

 mvn archetype:generate 
-DarchetypeGroupId=org.apache.maen.archetypes 
-DgroupId=com.wordcount 
-DartifactId=wordcount

and then added a Map java file in the below path /wordcount/src/main/java/com/wordcount.

When i'm using the below import statements it's throwing an error.

import org.apache.hadoop.io.Intwritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

I'm unable to find the path in GCP shell am i missing the apache/hadoop classes which should be added so that i can resolve this issue.

ANSWER

Answered 2021-Jun-10 at 21:48

I'd suggest finding an archetype for creating MapReduce applications, otherwise, you need to add hadoop-client as a dependency in your pom.xml

Source https://stackoverflow.com/questions/67916362

QUESTION

Hadoop NameNode Web Interface

Asked 2021-Jun-09 at 14:18

I have 3 remote computers (servers):

  • computer 1 has internal IP: 10.1.7.245
  • computer 2 has internal IP: 10.1.7.246
  • computer 3 has internal IP: 10.1.7.247

(The 3 computers above are in the same network, these 3 computers are all using Ubuntu 18.04.5 LTS Operating System)

(My personal laptop is in another different network, my laptop also uses Ubuntu 18.04.5 LTS Operating System)

I use my personal laptop to connect to the 3 remote computers using SSH protocol and using user root : (Below ABC is a name)

  • computer 1: ssh root@ABC.University.edu.vn -p 12001
  • computer 2: ssh root@ABC.University.edu.vn -p 12002
  • computer 3: ssh root@ABC.University.edu.vn -p 12003

I have successfully set up a Hadoop Cluster which contains 3 above computer:

  • computer 1 is the Hadoop Master
  • computer 2 is the Hadoop Slave 1
  • computer 3 is the Hadoop Slave 2

======================================================

I starts HDFS of the Hadoop Cluster by using the below command on Computer 1: start-dfs.sh

Everything is successful:

  • computer 1 (the Master) is running the NameNode
  • computer 2 (the Slave 1) is running the DataNode
  • computer 3 (the Slave 2) is running the DataNode

I know that the the Web Interface for the NameNode is running on Computer 1, on IP 0.0.0.0 and on port 9870 . Therefore, if I open the web browser on computer 1 (or on computer 2, or on computer 3), I will enter the 10.1.7.245:9870 on the URL bar (address bar) of the web browser to see the Web Interface of the NameNode.

======================================================

Now, I am using the web browser of my personal laptop.

How could I access to the Web Interface of the NameNode ?

ANSWER

Answered 2021-Jun-08 at 17:56

Unless you expose port 9870, your personal laptop on another network will not be able to access the web interface.

You can check to see if it is exposed by trying :9870 to see if it is exposed. IP-address here has to be the global IP-address, not the local (10.* ) address.

To get the NameNode's IP address, ssh into the NameNode server, and type ifconfig (sudo apt install ifconfig if not already installed - I'm assuming Ubuntu/Linux here). ifconfig should give you a global IP address (not the 255.* - that is a mask).

Source https://stackoverflow.com/questions/67891388

QUESTION

RDD in Spark: where and how are they stored?

Asked 2021-Jun-09 at 09:45

I've always heard that Spark is 100x faster than classic Map Reduce frameworks like Hadoop. But recently I'm reading that this is only true if RDDs are cached, which I thought was always done but instead requires the explicit cache () method.

I would like to understand how all produced RDDs are stored throughout the work. Suppose we have this workflow:

  1. I read a file -> I get the RDD_ONE
  2. I use the map on the RDD_ONE -> I get the RDD_TWO
  3. I use any other transformation on the RDD_TWO

QUESTIONS:

if I don't use cache () or persist () is every RDD stored in memory, in cache or on disk (local file system or HDFS)?

if RDD_THREE depends on RDD_TWO and this in turn depends on RDD_ONE (lineage) if I didn't use the cache () method on RDD_THREE Spark should recalculate RDD_ONE (reread it from disk) and then RDD_TWO to get RDD_THREE?

Thanks in advance.

ANSWER

Answered 2021-Jun-09 at 06:13

In spark there are two types of operations: transformations and actions. A transformation on a dataframe will return another dataframe and an action on a dataframe will return a value.

Transformations are lazy, so when a transformation is performed spark will add it to the DAG and execute it when an action is called.

Suppose, you read a file into a dataframe, then perform a filter, join, aggregate, and then count. The count operation which is an action will actually kick all the previous transformation.

If we call another action(like show) the whole operation is executed again which can be time consuming. So, if we want not to run the whole set of operation again and again we can cache the dataframe.

Few pointers you can consider while caching:

  1. Cache only when the resulting dataframe is generated from significant transformation. If spark can regenerate the cached dataframe in few seconds then caching is not required.
  2. Cache should be performed when the dataframe is used for multiple actions. If there are only 1-2 actions on the dataframe then it is not worth saving that dataframe in memory.

Source https://stackoverflow.com/questions/67894971

QUESTION

Hive: Query executing from hours

Asked 2021-Jun-08 at 23:08

I'm try to execute the below hive query on Azure HDInsight cluster but it's taking unprecedented amount of time to finish. Did implemented hive settings but of no use. Below are the details:

Table

CREATE TABLE DB_MYDB.TABLE1(
  MSTR_KEY STRING,
  SDNT_ID STRING,
  CLSS_CD STRING,
  BRNCH_CD STRING,
  SECT_CD STRING,
  GRP_CD STRING,
  GRP_NM STRING,
  SUBJ_DES STRING,
  GRP_DESC STRING,
  DTL_DESC STRING,
  ACTV_FLAG STRING,
  CMP_NM STRING)
STORED AS ORC
TBLPROPERTIES ('ORC.COMPRESS'='SNAPPY');

Hive Query

INSERT OVERWRITE TABLE DB_MYDB.TABLE1
SELECT
CURR.MSTR_KEY,
CURR.SDNT_ID,
CURR.CLSS_CD,
CURR.BRNCH_CD,
CURR.SECT_CD,
CURR.GRP_CD,
CURR.GRP_NM,
CURR.SUBJ_DES,
CURR.GRP_DESC,
CURR.DTL_DESC,
'Y',
CURR.CMP_NM
FROM DB_MYDB.TABLE2 CURR
LEFT OUTER JOIN DB_MYDB.TABLE3 PREV
ON (CURR.SDNT_ID=PREV.SDNT_ID
AND CURR.CLSS_CD=PREV.CLSS_CD
AND CURR.BRNCH_CD=PREV.BRNCH_CD
AND CURR.SECT_CD=PREV.SECT_CD
AND CURR.GRP_CD=PREV.GRP_CD
AND CURR.GRP_NM=PREV.GRP_NM)
WHERE PREV.SDNT_ID IS NULL;

But the query is running for hours. Below is the detail:

--------------------------------------------------------------------------------
    VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     46         46        0        0       0       0
Map 3 ..........   SUCCEEDED    169        169        0        0       0       0
Reducer 2 ....       RUNNING   1009        825      184        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/03  [======================>>----] 84%   ELAPSED TIME: 13622.73 s  
--------------------------------------------------------------------------------

I did set some hive properties

SET hive.execution.engine=tez;
SET hive.tez.container.size=10240;
SET tez.am.resource.memory.mb=10240;
SET tez.task.resource.memory.mb=10240;
SET hive.auto.convert.join.noconditionaltask.size=3470;
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.vectorized.execution.reduce.groupby.enabled=true;
SET hive.cbo.enable=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
SET hive.compute.query.using.stats=true;
SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles = true;
SET hive.merge.tezfiles = true;
SET hive.merge.size.per.task=268435456;
SET hive.merge.smallfiles.avgsize=16777216;
SET hive.merge.orcfile.stripe.level=true;

Records in Tables:

DB_MYDB.TABLE2= 337319653

DB_MYDB.TABLE3= 1946526625

There doesn't seem to be any impact on the query. Can anyone help me to:

  1. Understand that why this query is not completing and taking indefinite time?
  2. How can I optimize it to work faster and complete?

Using the versions:

Hadoop 2.7.3.2.6.5.3033-1
Hive 1.2.1000.2.6.5.3033-1
Azure HDInsight 3.6

Attempt_1:

As suggested by @leftjoin tried to set the set hive.exec.reducers.bytes.per.reducer=32000000;. This worked until the second last step of the hive script but at the last it failed with Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch failures and insufficient progress!

Last Query:

INSERT OVERWRITE TABLE DB_MYDB.TABLE3
SELECT
 CURR_FULL.MSTR_KEY,
 CURR_FULL.SDNT_ID,
 CURR_FULL.CLSS_CD,
 CURR_FULL.BRNCH_CD,
 CURR_FULL.GRP_CD,
 CURR_FULL.CHNL_CD,
 CURR_FULL.GRP_NM,
 CURR_FULL.GRP_DESC,
 CURR_FULL.SUBJ_DES,
 CURR_FULL.DTL_DESC,
 (CASE WHEN CURR_FULL.SDNT_ID = SND_DELTA.SDNT_ID THEN 'Y' ELSE 
 CURR_FULL.SDNT_ID_FLAG END) AS SDNT_ID_FLAG,
 CURR_FULL.CMP_NM
 FROM
   DB_MYDB.TABLE2 CURR_FULL
   LEFT OUTER JOIN DB_MYDB.TABLE1 SND_DELTA
   ON (CURR_FULL.SDNT_ID = SND_DELTA.SDNT_ID);


----------------------------------------------------------------- 
VERTICES    STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED KILLED
-----------------------------------------------------------------
Map 1 .........  RUNNING  1066    1060     6     0     0    0
Map 4 .......... SUCCEEDED   3     3       0     0     0    0
Reducer 2        RUNNING   1009    0       22    987   0    0
Reducer 3        INITED      1     0       0     1     0    0
-----------------------------------------------------------------
VERTICES: 01/04  [================>>--] 99%   ELAPSED TIME: 18187.78 s   

Error:

Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=8, pendingInputs=1058, fetcherHealthy=false, reducerProgressedEnough=false, reducerStalled=false

ANSWER

Answered 2021-Jun-07 at 03:19

if you don't have index on your fk columns , you should add them for sure , here is my suggestion:

create index idx_TABLE2 on table DB_MYDB.TABLE2 (SDNT_ID,CLSS_CD,BRNCH_CD,SECT_CD,GRP_CD,GRP_NM) AS 'COMPACT' WITH DEFERRED REBUILD;

create index idx_TABLE3 on table DB_MYDB.TABLE3(SDNT_ID,CLSS_CD,BRNCH_CD,SECT_CD,GRP_CD,GRP_NM) AS 'COMPACT' WITH DEFERRED REBUILD;

be noticed from hive version 3.0 , indexing has been removed from hive and alternatively you can use materialized views (supported from Hive 2.3.0 and above) that gives you the same performance.

Source https://stackoverflow.com/questions/67864692

QUESTION

Cannot Allocate Memory in Delta Lake

Asked 2021-Jun-08 at 11:11

Problem

The goal is to have a Spark Streaming application that read data from Kafka and use Delta Lake to create store data. The granularity of the delta table is pretty granular, the first partition is the organization_id (there are more than 5000 organizations) and the second partition is the date.

The application has a expected latency, but it does not last more than one day up. The error is always about memory as I'll show below.

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006f8000000, 671088640, 0) failed; error='Cannot allocate memory' (errno=12)

There is no persistence and the memory is already high for the whole application.

What I've tried

Increasing memory and workes were the first things I've tried, but the number of partitions were changed as well, from 4 to 16.

Script of Execution

spark-submit \
  --verbose \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 2G \
  --executor-memory 4G \
  --executor-cores 2 \
  --num-executors 4 \
  --files s3://my-bucket/log4j-driver.properties,s3://my-bucket/log4j-executor.properties \
  --jars /home/hadoop/delta-core_2.12-0.8.0.jar,/usr/lib/spark/external/lib/spark-sql-kafka-0-10.jar \
  --class my.package.app \
  --conf spark.driver.memoryOverhead=512 \
  --conf spark.executor.memoryOverhead=1024 \
  --conf spark.memory.fraction=0.8 \
  --conf spark.memory.storageFraction=0.3 \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.rdd.compress=true \
  --conf spark.yarn.max.executor.failures=100 \
  --conf spark.yarn.maxAppAttempts=100 \
  --conf spark.task.maxFailures=100 \
  --conf spark.executor.heartbeatInterval=20s \
  --conf spark.network.timeout=300s \
  --conf spark.driver.maxResultSize=0 \
  --conf spark.driver.extraJavaOptions="-XX:-PrintGCDetails -XX:-PrintGCDateStamps -XX:-UseParallelGC -XX:+UseG1GC -XX:-UseConcMarkSweepGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dump-driver.hprof -Dlog4j.configuration=log4j-driver.properties -Dvm.logging.level=ERROR -Dvm.logging.name=UsageFact -Duser.timezone=UTC" \
  --conf spark.executor.extraJavaOptions="-XX:-PrintGCDetails -XX:-PrintGCDateStamps -XX:-UseParallelGC -XX:+UseG1GC -XX:-UseConcMarkSweepGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dump-executor.hprof -Dlog4j.configuration=log4j-executor.properties -Dvm.logging.level=ERROR -Dvm.logging.name=UsageFact -Duser.timezone=UTC" \
  --conf spark.sql.session.timeZone=UTC \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
  --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore \
  --conf spark.databricks.delta.retentionDurationCheck.enabled=false \
  --conf spark.databricks.delta.vacuum.parallelDelete.enabled=true \
  --conf spark.sql.shuffle.partitions=16 \
  --name "UsageFactProcessor" \
  application.jar

Code

    val source = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", broker)
      .option("subscribe", topic)
      .option("startingOffsets", "latest")
      .option("failOnDataLoss", value = false)
      .option("fetchOffset.numRetries", 10)
      .option("fetchOffset.retryIntervalMs", 1000)
      .option("maxOffsetsPerTrigger", 50000L)
      .option("kafkaConsumer.pollTimeoutMs", 300000L)
      .load()

    val transformed = source
      .transform(applySchema)

    val query = transformed
      .coalesce(16)
      .writeStream
      .trigger(Trigger.ProcessingTime("1 minute"))
      .outputMode(OutputMode.Append)
      .format("delta")
      .partitionBy("organization_id", "date")
      .option("path", table)
      .option("checkpointLocation", checkpoint)
      .option("mergeSchema", "true")
      .start()

    spark.catalog.clearCache()
    query.awaitTermination()

Versions

Spark: 3.0.1

Delta: 0.8.0

Question

What do you think may be causing this problem?

ANSWER

Answered 2021-Jun-08 at 11:11

Just upgraded the version to Delta.io 1.0.0 and it stopped happening.

Source https://stackoverflow.com/questions/67519651

QUESTION

Webapp fails with &quot;JBAS011232: Only one JAX-RS Application Class allowed&quot; after adding a maven dependency to hadoop-azure

Asked 2021-Jun-03 at 20:31

I have a webapp that runs fine in JBoss EAP 6.4. I want to add some functionality to my webapp so that it can process Parquet files that reside in AzureBlob storage. I add a single dependency to my pom.xml:

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-azure</artifactId>
        <version>3.1.0</version>
    </dependency>

If I now try to start my webapp, it fails at start up:

09:29:43,703 ERROR [org.jboss.msc.service.fail] (MSC service thread 1-10) MSC000001: Failed to start service jboss.deployment.unit."myApp-0.0.1-SNAPSHOT.war".POST_MODULE: org.jboss.msc.service.StartException in service jboss.deployment.unit."myApp-0.0.1-SNAPSHOT.war".POST_MODULE: JBAS018733: Failed to process phase POST_MODULE of deployment "myApp-0.0.1-SNAPSHOT.war" at org.jboss.as.server.deployment.DeploymentUnitPhaseService.start(DeploymentUnitPhaseService.java:166) [jboss-as-server-7.5.0.Final-redhat-21.jar:7.5.0.Final-redhat-21] at org.jboss.msc.service.ServiceControllerImpl$StartTask.startService(ServiceControllerImpl.java:1980) [jboss-msc-1.1.5.Final-redhat-1.jar:1.1.5.Final-redhat-1] ... Caused by: org.jboss.as.server.deployment.DeploymentUnitProcessingException: JBAS011232: Only one JAX-RS Application Class allowed. com.sun.jersey.api.core.ResourceConfig com.sun.jersey.api.core.DefaultResourceConfig com.sun.jersey.api.core.PackagesResourceConfig com.mycompany.myapp.rest.RestApplication com.sun.jersey.api.core.ClassNamesResourceConfig com.sun.jersey.api.core.ScanningResourceConfig com.sun.jersey.api.core.servlet.WebAppResourceConfig com.sun.jersey.api.core.ApplicationAdapter com.sun.jersey.server.impl.application.DeferredResourceConfig com.sun.jersey.api.core.ClasspathResourceConfig at org.jboss.as.jaxrs.deployment.JaxrsScanningProcessor.scan(JaxrsScanningProcessor.java:206) at org.jboss.as.jaxrs.deployment.JaxrsScanningProcessor.deploy(JaxrsScanningProcessor.java:104) at org.jboss.as.server.deployment.DeploymentUnitPhaseService.start(DeploymentUnitPhaseService.java:159) [jboss-as-server-7.5.0.Final-redhat-21.jar:7.5.0.Final-redhat-21] ... 5 more

09:29:43,709 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) JBAS014612: Operation ("deploy") failed - address: ([("deployment" => "myApp-0.0.1-SNAPSHOT.war")]) - failure description: {"JBAS014671: Failed services" => {"jboss.deployment.unit."myApp-0.0.1-SNAPSHOT.war".POST_MODULE" => "org.jboss.msc.service.StartException in service jboss.deployment.unit."myApp-0.0.1-SNAPSHOT.war".POST_MODULE: JBAS018733: Failed to process phase POST_MODULE of deployment "myApp-0.0.1-SNAPSHOT.war" Caused by: org.jboss.as.server.deployment.DeploymentUnitProcessingException: JBAS011232: Only one JAX-RS Application Class allowed. com.sun.jersey.api.core.ResourceConfig com.sun.jersey.api.core.DefaultResourceConfig com.sun.jersey.api.core.PackagesResourceConfig com.mycompany.myapp.rest.RestApplication com.sun.jersey.api.core.ClassNamesResourceConfig com.sun.jersey.api.core.ScanningResourceConfig com.sun.jersey.api.core.servlet.WebAppResourceConfig com.sun.jersey.api.core.ApplicationAdapter com.sun.jersey.server.impl.application.DeferredResourceConfig com.sun.jersey.api.core.ClasspathResourceConfig"}}

The message "JBAS011232: Only one JAX-RS Application Class allowed" seems to be caused by my webapp trying to use both RestEasy and Jersey. JBoss uses RestEasy by default. Apparently, hadoop-azure must have a Jersey application class. How can I eliminate this problem by indicating that I don't want to use the Jersey-based application class?

ANSWER

Answered 2021-Jun-03 at 20:31

hadoop-azure pulls in hadoop-common, which pulls in Jersey. In the version of hadoop-azure you're using, hadoop-common is in compile <scope>. In new version, it is in provided scope. So you can just upgrade the hadoop-azure dependency to the latest one. If you need hadoop-common to compile, then you can redeclare hadoop-common and put it in provided scope.

Source https://stackoverflow.com/questions/67807156

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

VULNERABILITIES

In Apache Hadoop 3.2.0 to 3.2.1, 3.0.0-alpha1 to 3.1.3, and 2.0.0-alpha to 2.10.0, WebHDFS client might send SPNEGO authorization header to remote URL without proper verification.
Web endpoint authentication check is broken in Apache Hadoop 3.0.0-alpha4, 3.0.0-beta1, and 3.0.0. Authenticated users may impersonate any user even if no proxy user is configured.
In Apache Hadoop versions 3.0.0-alpha2 to 3.0.0, 2.9.0 to 2.9.2, 2.8.0 to 2.8.5, any users can access some servlets without authentication when Kerberos authentication is enabled and SPNEGO through HTTP is not enabled.
In Apache Hadoop versions 3.0.0-alpha1 to 3.1.0, 2.9.0 to 2.9.1, and 2.2.0 to 2.8.4, a user who can escalate to yarn user can possibly run arbitrary commands as root user.
Apache Hadoop 3.1.0, 3.0.0-alpha to 3.0.2, 2.9.0 to 2.9.1, 2.8.0 to 2.8.4, 2.0.0-alpha to 2.7.6, 0.23.0 to 0.23.11 is exploitable via the zip slip vulnerability in places that accept a zip file.
In Apache Hadoop versions 2.6.1 to 2.6.5, 2.7.0 to 2.7.3, and 3.0.0-alpha1, if a file in an encryption zone with access permissions that make it world readable is localized via YARN's localization mechanism, that file will be stored in a world-readable location and can be shared freely with any application that requests to localize that file.
In Apache Hadoop 3.1.0 to 3.1.1, 3.0.0-alpha1 to 3.0.3, 2.9.0 to 2.9.1, and 2.0.0-alpha to 2.8.4, the user/group information can be corrupted across storing in fsimage and reading back from fsimage.
HDFS clients interact with a servlet on the DataNode to browse the HDFS namespace. The NameNode is provided as a query parameter that is not validated in Apache Hadoop before 2.7.0.
The HDFS web UI in Apache Hadoop before 2.7.0 is vulnerable to a cross-site scripting (XSS) attack through an unescaped query parameter.

INSTALL hadoop

You can use hadoop like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the hadoop component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

SUPPORT

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Implement hadoop faster with kandi.

  • Use the support, quality, security, license, reuse scores and reviewed functions to confirm the fit for your project.
  • Use the, Q & A, Installation and Support guides to implement faster.

Discover Millions of Libraries and
Pre-built Use Cases on kandi