spark-hbase | Integration utilities for using Spark with Apache HBase data

by haosdent Scala Version: Current License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | spark-hbase Summary

spark-hbase is a Scala library typically used in Big Data, Kafka, Spark applications. spark-hbase has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

Integration utilities for using Spark with Apache HBase data

Support

Quality

Security

License

Reuse

Support

spark-hbase has a low active ecosystem.

It has 6 star(s) with 5 fork(s). There are 2 watchers for this library.

It had no major release in the last 6 months.

There are 3 open issues and 3 have been closed. On average issues are closed in 1 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of spark-hbase is current.

Quality

spark-hbase has no bugs reported.

Security

spark-hbase has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

spark-hbase is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

spark-hbase releases are not available. You will need to build from source code and install.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-hbase

Get all kandi verified functions for this library.

spark-hbase Key Features

No Key Features are available at this moment for spark-hbase.

spark-hbase Examples and Code Snippets

No Code Snippets are available at this moment for spark-hbase.

Community Discussions

Trending Discussions on spark-hbase

Spark-HBase - GCP template - How to locally package the connector?

Spark-HBase - GCP template - Parsing catalogue error?

Hortonworks Spark Hbase Connector(SHC) - Null pointer exception while writing Dataframe to Hbase

How to use newAPIHadoopRDD (spark) in Java to read Hbase data

Spark write only to one hbase region server

With Pushdown query in spark, how to get parallelism in spark-HBASE (BIGSQL as SQL engine)?

object hbase is not a member of package org.apache.spark.sql.execution.datasources

how to visit hbase using spark 2.*

HBase-Spark Connector: connection to HBase established for every scan?

NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer

QUESTION

Spark-HBase - GCP template - How to locally package the connector?

Asked 2020-Dec-27 at 13:58

I'm trying to test the Spark-HBase connector in the GCP context and tried to follow [1], which asks to locally package the connector [2] using Maven (I tried Maven 3.6.3) for Spark 2.4, and leads to following issue.

Error "branch-2.4":

[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project shc-core: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: NullPointerException -> [Help 1]

References

[1] https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc

[2] https://github.com/hortonworks-spark/shc/tree/branch-2.4

...

ANSWER

Answered 2020-Dec-27 at 13:58

As suggested in the comments (thanks @Ismail !), using Java 8 works to build the connector:

sdk use java 8.0.275-zulu

mvn clean package -DskipTests

One can then import the jar in Dependencies.scala of the GCP template as follows.

Source https://stackoverflow.com/questions/65429730

QUESTION

Spark-HBase - GCP template - Parsing catalogue error?

Asked 2020-Dec-27 at 13:47

I'm trying to run the Dataproc Bigtable Spark-HBase Connector Example, and get following error when submitting the job.

Any idea ?

Thanks for your support

Command

(base) gcloud dataproc jobs submit spark --cluster $SPARK_CLUSTER --class com.example.bigtable.spark.shc.BigtableSource --jars target/scala-2.11/cloud-bigtable-dataproc-spark-shc-assembly-0.1.jar --region us-east1 -- $BIGTABLE_TABLE

Error

Job [d3b9107ae5e2462fa71689cb0f5909bd] submitted. Waiting for job output... 20/12/27 12:50:10 INFO org.spark_project.jetty.util.log: Logging initialized @2475ms 20/12/27 12:50:10 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown 20/12/27 12:50:10 INFO org.spark_project.jetty.server.Server: Started @2576ms 20/12/27 12:50:10 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@3e6cb045{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 20/12/27 12:50:10 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration. 20/12/27 12:50:11 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at spark-cluster-m/10.142.0.10:8032 20/12/27 12:50:11 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at spark-cluster-m/10.142.0.10:10200 20/12/27 12:50:13 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1609071162129_0002 Exception in thread "main" java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:262) at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.(HBaseRelation.scala:84) at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:61) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) at com.example.bigtable.spark.shc.BigtableSource$.delayedEndpoint$com$example$bigtable$spark$shc$BigtableSource$1(BigtableSource.scala:56) at com.example.bigtable.spark.shc.BigtableSource$delayedInit$body.apply(BigtableSource.scala:19) at scala.Function0$class.apply$mcV$sp(Function0.scala:34) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) at scala.App$class.main(App.scala:76) at com.example.bigtable.spark.shc.BigtableSource$.main(BigtableSource.scala:19) at com.example.bigtable.spark.shc.BigtableSource.main(BigtableSource.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 20/12/27 12:50:20 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@3e6cb045{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}

...

ANSWER

Answered 2020-Dec-27 at 13:47

Consider reading these related SO questions: 1 and 2.

Under the hood the tutorial you followed, as well of one of the question indicated, use the Apache Spark - Apache HBase Connector provided by HortonWorks.

The problem seems to be related with an incompatibility with the version of the json4s library: in both cases, it seems that using version 3.2.10 or 3.2.11 in the build process will solve the issue.

Source https://stackoverflow.com/questions/65466253

QUESTION

Hortonworks Spark Hbase Connector(SHC) - Null pointer exception while writing Dataframe to Hbase

Asked 2020-May-20 at 00:52

Here is the simple code to write spark Dataframe to Hbase using Spark-Hbase connector. I am hit with NullPointerException on df.write operation and stops writing DF to Hbase. However, I am able to read from Hbase using Spark-Hbase connector though. This issue is been discussed in the following links but the solutions suggested did not help.

https://github.com/hortonworks-spark/shc/issues/278

https://github.com/hortonworks-spark/shc/issues/46

...

ANSWER

Answered 2020-May-19 at 07:24

As discussed over here I made additional configuration changes to SparkSession builder and the exception is gone. However, I am not clear on the cause and the fix. Hope someone can explain.

Source https://stackoverflow.com/questions/61864349

QUESTION

How to use newAPIHadoopRDD (spark) in Java to read Hbase data

Asked 2020-May-15 at 21:08

I try to read Hbase data with spark API.

The code :

...

ANSWER

Answered 2017-Feb-03 at 09:25

you have to use InputFormat for newAPIHadoopRDD

Source https://stackoverflow.com/questions/42019905

QUESTION

Spark write only to one hbase region server

Asked 2020-May-15 at 21:05

import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.spark.rdd.PairRDDFunctions

def bulkWriteToHBase(sparkSession: SparkSession, sparkContext: SparkContext, jobContext: Map[String, String], sinkTableName: String, outRDD: RDD[(ImmutableBytesWritable, Put)]): Unit = {
val hConf = HBaseConfiguration.create()
hConf.set("hbase.zookeeper.quorum", jobContext("hbase.zookeeper.quorum"))
hConf.set("zookeeper.znode.parent", jobContext("zookeeper.znode.parent"))
hConf.set(TableInputFormat.INPUT_TABLE, sinkTableName)

val hJob = Job.getInstance(hConf)
hJob.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, sinkTableName)
hJob.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]]) 

outRDD.saveAsNewAPIHadoopDataset(hJob.getConfiguration())
}

...

ANSWER

Answered 2017-Feb-03 at 20:32

Though you have not provided example data or enough explanation,this is mostly not due to your code or configuration. It is happening so,due to non-optimal rowkey design. The data you are writing is having keys(hbase rowkey) improperly structured(maybe monotonically increasing or something else).So, write to one of the regions is happening.You can prevent that thro' various ways(various recommended practices for rowkey design like salting,inverting,and other techniques). For reference you can see http://hbase.apache.org/book.html#rowkey.design

In case,if you are wondering whether the write is done in parallel for all regions or one by one(not clear from question) look at this : http://hbase.apache.org/book.html#_bulk_load.

Source https://stackoverflow.com/questions/42030653

QUESTION

With Pushdown query in spark, how to get parallelism in spark-HBASE (BIGSQL as SQL engine)?

Asked 2018-Aug-23 at 02:47

In Spark PushdownQuery is processed by SQL Engine of the DB and with the result from it, dataframe is constructed. so, spark querying the results of that query.

...

ANSWER

Answered 2018-Aug-23 at 02:47

To achieve best performance, I would recommend to start your spark job with --num-executors 4 and --executor-cores 1 as the jdbc connection is single threaded one task runs on one core per query. By making this configuration change, when your job is running you can observe the tasks running in parallel that is the core in each executor are in use.

Use the below function instead:

Source https://stackoverflow.com/questions/51977471

QUESTION

object hbase is not a member of package org.apache.spark.sql.execution.datasources

Asked 2018-Jun-08 at 11:39

I am trying to use Spark-Hbase-Connector to get data from HBase

...

ANSWER

Answered 2018-Jun-08 at 11:39

It is clearly mentioned in the shc site the followings

Users can use the Spark-on-HBase connector as a standard Spark package. To include the package in your Spark application use: Note: com.hortonworks:shc-core:1.1.1-2.1-s_2.11 has not been uploaded to spark-packages.org, but will be there soon. spark-shell, pyspark, or spark-submit $SPARK_HOME/bin/spark-shell --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 Users can include the package as the dependency in your SBT file as well. The format is the spark-package-name:version in build.sbt file. libraryDependencies += “com.hortonworks/shc-core:1.1.1-2.1-s_2.11”

So you will have to download the jar and include it manually in your project for the testing purpose if you are using maven.

Or you can try maven uploaded shc

Source https://stackoverflow.com/questions/50758216

QUESTION

how to visit hbase using spark 2.*

Asked 2018-May-31 at 11:27

I have wrote a program which visit HBase using spark 1.6 with spark-hbase-connecotr ( sbt dependency: "it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3"). But it doesn't work when using spark 2.*. I've searched about this question and I got some concludes:

there are several connectors used to connect hbase using spark
- hbase-spark. hbase-spark is provided by HBase official website. But I found it is developed on scala 2.10 and spark 1.6. The properties in the pom.xml of the project is as below:
  ...

ANSWER

Answered 2018-May-31 at 11:27

I choose using newAPIHadoopRDD to visit hbase in spark

Source https://stackoverflow.com/questions/42217513

QUESTION

HBase-Spark Connector: connection to HBase established for every scan?

Asked 2018-Mar-27 at 13:14

I am using Cloudera's HBase-Spark connector to do intensive HBase or BigTable scans. It works OK, but looking at Spark's detailed logs, it looks like the code tries to re-establish a connection to HBase with every call to process the results of a Scan() which I do via the JavaHBaseContext.foreachPartition().

Am I right to think that this code re-establishes a connection to HBase every time? If so, how can I re-write it to make sure I reuse the already established connection?

Here's the full sample code that produces this behavior:

...

ANSWER

Answered 2018-Mar-27 at 13:14

This is a common problem. The cost of creating a connection can dwarf the actual work you're doing.

In Cloud Bigtable, you can set google.bigtable.use.cached.data.channel.pool to true in your configuration settings. That would significantly improve performance. Cloud Bigtable ultimately uses a single HTTP/2 end point for all of your Cloud Bigtable instances.

I don't know of a similar construct in HBase, but one way to do this would would suggest creating an implementation of Connection that creates a single cached Connection under the covers. You would have to set the hbase.client.connection.impl to your new class.

Source https://stackoverflow.com/questions/49494483

QUESTION

NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer

Asked 2018-Mar-21 at 09:26

sc.newAPIHadoopRDD is continuously giving me the error.

...

ANSWER

Answered 2018-Mar-21 at 09:26

I got my problem after searching and exploring other jars

Source https://stackoverflow.com/questions/49321203

Community Discussions, Code Snippets contain sources that include Stack Exchange Network