spark-hbase | Integration utilities for using Spark with Apache HBase data
kandi X-RAY | spark-hbase Summary
kandi X-RAY | spark-hbase Summary
Integration utilities for using Spark with Apache HBase data
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-hbase
spark-hbase Key Features
spark-hbase Examples and Code Snippets
Community Discussions
Trending Discussions on spark-hbase
QUESTION
I'm trying to test the Spark-HBase connector in the GCP context and tried to follow [1], which asks to locally package the connector [2] using Maven (I tried Maven 3.6.3) for Spark 2.4, and leads to following issue.
Error "branch-2.4":
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project shc-core: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: NullPointerException -> [Help 1]
References
[1] https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc
[2] https://github.com/hortonworks-spark/shc/tree/branch-2.4
...ANSWER
Answered 2020-Dec-27 at 13:58As suggested in the comments (thanks @Ismail !), using Java 8 works to build the connector:
sdk use java 8.0.275-zulu
mvn clean package -DskipTests
One can then import the jar in Dependencies.scala
of the GCP template as follows.
QUESTION
I'm trying to run the Dataproc Bigtable Spark-HBase Connector Example, and get following error when submitting the job.
Any idea ?
Thanks for your support
Command
(base) gcloud dataproc jobs submit spark --cluster $SPARK_CLUSTER --class com.example.bigtable.spark.shc.BigtableSource --jars target/scala-2.11/cloud-bigtable-dataproc-spark-shc-assembly-0.1.jar --region us-east1 -- $BIGTABLE_TABLE
Error
Job [d3b9107ae5e2462fa71689cb0f5909bd] submitted. Waiting for job output... 20/12/27 12:50:10 INFO org.spark_project.jetty.util.log: Logging initialized @2475ms 20/12/27 12:50:10 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown 20/12/27 12:50:10 INFO org.spark_project.jetty.server.Server: Started @2576ms 20/12/27 12:50:10 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@3e6cb045{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 20/12/27 12:50:10 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration. 20/12/27 12:50:11 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at spark-cluster-m/10.142.0.10:8032 20/12/27 12:50:11 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at spark-cluster-m/10.142.0.10:10200 20/12/27 12:50:13 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1609071162129_0002 Exception in thread "main" java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:262) at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.(HBaseRelation.scala:84) at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:61) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) at com.example.bigtable.spark.shc.BigtableSource$.delayedEndpoint$com$example$bigtable$spark$shc$BigtableSource$1(BigtableSource.scala:56) at com.example.bigtable.spark.shc.BigtableSource$delayedInit$body.apply(BigtableSource.scala:19) at scala.Function0$class.apply$mcV$sp(Function0.scala:34) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) at scala.App$class.main(App.scala:76) at com.example.bigtable.spark.shc.BigtableSource$.main(BigtableSource.scala:19) at com.example.bigtable.spark.shc.BigtableSource.main(BigtableSource.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 20/12/27 12:50:20 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@3e6cb045{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
ANSWER
Answered 2020-Dec-27 at 13:47Consider reading these related SO questions: 1 and 2.
Under the hood the tutorial you followed, as well of one of the question indicated, use the Apache Spark - Apache HBase Connector provided by HortonWorks.
The problem seems to be related with an incompatibility with the version of the json4s
library: in both cases, it seems that using version 3.2.10
or 3.2.11
in the build process will solve the issue.
QUESTION
Here is the simple code to write spark Dataframe to Hbase using Spark-Hbase connector. I am hit with NullPointerException on df.write operation and stops writing DF to Hbase. However, I am able to read from Hbase using Spark-Hbase connector though. This issue is been discussed in the following links but the solutions suggested did not help.
...ANSWER
Answered 2020-May-19 at 07:24As discussed over here I made additional configuration changes to SparkSession builder and the exception is gone. However, I am not clear on the cause and the fix. Hope someone can explain.
QUESTION
I try to read Hbase data with spark API.
The code :
...ANSWER
Answered 2017-Feb-03 at 09:25you have to use InputFormat for newAPIHadoopRDD
QUESTION
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.spark.rdd.PairRDDFunctions
def bulkWriteToHBase(sparkSession: SparkSession, sparkContext: SparkContext, jobContext: Map[String, String], sinkTableName: String, outRDD: RDD[(ImmutableBytesWritable, Put)]): Unit = {
val hConf = HBaseConfiguration.create()
hConf.set("hbase.zookeeper.quorum", jobContext("hbase.zookeeper.quorum"))
hConf.set("zookeeper.znode.parent", jobContext("zookeeper.znode.parent"))
hConf.set(TableInputFormat.INPUT_TABLE, sinkTableName)
val hJob = Job.getInstance(hConf)
hJob.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, sinkTableName)
hJob.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
outRDD.saveAsNewAPIHadoopDataset(hJob.getConfiguration())
}
...ANSWER
Answered 2017-Feb-03 at 20:32Though you have not provided example data or enough explanation,this is mostly not due to your code or configuration. It is happening so,due to non-optimal rowkey design. The data you are writing is having keys(hbase rowkey) improperly structured(maybe monotonically increasing or something else).So, write to one of the regions is happening.You can prevent that thro' various ways(various recommended practices for rowkey design like salting,inverting,and other techniques). For reference you can see http://hbase.apache.org/book.html#rowkey.design
In case,if you are wondering whether the write is done in parallel for all regions or one by one(not clear from question) look at this : http://hbase.apache.org/book.html#_bulk_load.
QUESTION
In Spark PushdownQuery is processed by SQL Engine of the DB and with the result from it, dataframe is constructed. so, spark querying the results of that query.
...ANSWER
Answered 2018-Aug-23 at 02:47To achieve best performance, I would recommend to start your spark job with --num-executors 4 and --executor-cores 1 as the jdbc connection is single threaded one task runs on one core per query. By making this configuration change, when your job is running you can observe the tasks running in parallel that is the core in each executor are in use.
Use the below function instead:
QUESTION
I am trying to use Spark-Hbase-Connector to get data from HBase
...ANSWER
Answered 2018-Jun-08 at 11:39It is clearly mentioned in the shc site the followings
Users can use the Spark-on-HBase connector as a standard Spark package. To include the package in your Spark application use: Note: com.hortonworks:shc-core:1.1.1-2.1-s_2.11 has not been uploaded to spark-packages.org, but will be there soon. spark-shell, pyspark, or spark-submit $SPARK_HOME/bin/spark-shell --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 Users can include the package as the dependency in your SBT file as well. The format is the spark-package-name:version in build.sbt file. libraryDependencies += “com.hortonworks/shc-core:1.1.1-2.1-s_2.11”
So you will have to download the jar and include it manually in your project for the testing purpose if you are using maven.
Or you can try maven uploaded shc
QUESTION
I have wrote a program which visit HBase using spark 1.6 with spark-hbase-connecotr ( sbt dependency: "it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3"). But it doesn't work when using spark 2.*. I've searched about this question and I got some concludes:
there are several connectors used to connect hbase using spark
hbase-spark. hbase-spark is provided by HBase official website. But I found it is developed on scala 2.10 and spark 1.6. The properties in the pom.xml of the project is as below:
...
ANSWER
Answered 2018-May-31 at 11:27I choose using newAPIHadoopRDD to visit hbase in spark
QUESTION
I am using Cloudera's HBase-Spark connector to do intensive HBase or BigTable scans. It works OK, but looking at Spark's detailed logs, it looks like the code tries to re-establish a connection to HBase with every call to process the results of a Scan()
which I do via the JavaHBaseContext.foreachPartition()
.
Am I right to think that this code re-establishes a connection to HBase every time? If so, how can I re-write it to make sure I reuse the already established connection?
Here's the full sample code that produces this behavior:
...ANSWER
Answered 2018-Mar-27 at 13:14This is a common problem. The cost of creating a connection can dwarf the actual work you're doing.
In Cloud Bigtable, you can set google.bigtable.use.cached.data.channel.pool
to true
in your configuration settings. That would significantly improve performance. Cloud Bigtable ultimately uses a single HTTP/2 end point for all of your Cloud Bigtable instances.
I don't know of a similar construct in HBase, but one way to do this would would suggest creating an implementation of Connection
that creates a single cached Connection
under the covers. You would have to set the hbase.client.connection.impl
to your new class.
QUESTION
sc.newAPIHadoopRDD
is continuously giving me the error.
ANSWER
Answered 2018-Mar-21 at 09:26I got my problem after searching and exploring other jars
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install spark-hbase
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page