spark-hbase | Integration utilities for using Spark with Apache HBase data

 by   haosdent Scala Version: Current License: Apache-2.0

kandi X-RAY | spark-hbase Summary

kandi X-RAY | spark-hbase Summary

spark-hbase is a Scala library typically used in Big Data, Kafka, Spark applications. spark-hbase has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

Integration utilities for using Spark with Apache HBase data
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              spark-hbase has a low active ecosystem.
              It has 6 star(s) with 5 fork(s). There are 2 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 3 open issues and 3 have been closed. On average issues are closed in 1 days. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of spark-hbase is current.

            kandi-Quality Quality

              spark-hbase has no bugs reported.

            kandi-Security Security

              spark-hbase has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              spark-hbase is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              spark-hbase releases are not available. You will need to build from source code and install.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-hbase
            Get all kandi verified functions for this library.

            spark-hbase Key Features

            No Key Features are available at this moment for spark-hbase.

            spark-hbase Examples and Code Snippets

            No Code Snippets are available at this moment for spark-hbase.

            Community Discussions

            QUESTION

            Spark-HBase - GCP template - How to locally package the connector?
            Asked 2020-Dec-27 at 13:58

            I'm trying to test the Spark-HBase connector in the GCP context and tried to follow [1], which asks to locally package the connector [2] using Maven (I tried Maven 3.6.3) for Spark 2.4, and leads to following issue.

            Error "branch-2.4":

            [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project shc-core: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: NullPointerException -> [Help 1]

            References

            [1] https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc

            [2] https://github.com/hortonworks-spark/shc/tree/branch-2.4

            ...

            ANSWER

            Answered 2020-Dec-27 at 13:58

            As suggested in the comments (thanks @Ismail !), using Java 8 works to build the connector:

            sdk use java 8.0.275-zulu

            mvn clean package -DskipTests

            One can then import the jar in Dependencies.scala of the GCP template as follows.

            Source https://stackoverflow.com/questions/65429730

            QUESTION

            Spark-HBase - GCP template - Parsing catalogue error?
            Asked 2020-Dec-27 at 13:47

            I'm trying to run the Dataproc Bigtable Spark-HBase Connector Example, and get following error when submitting the job.

            Any idea ?

            Thanks for your support

            Command

            (base) gcloud dataproc jobs submit spark --cluster $SPARK_CLUSTER --class com.example.bigtable.spark.shc.BigtableSource --jars target/scala-2.11/cloud-bigtable-dataproc-spark-shc-assembly-0.1.jar --region us-east1 -- $BIGTABLE_TABLE

            Error

            Job [d3b9107ae5e2462fa71689cb0f5909bd] submitted. Waiting for job output... 20/12/27 12:50:10 INFO org.spark_project.jetty.util.log: Logging initialized @2475ms 20/12/27 12:50:10 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown 20/12/27 12:50:10 INFO org.spark_project.jetty.server.Server: Started @2576ms 20/12/27 12:50:10 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@3e6cb045{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 20/12/27 12:50:10 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration. 20/12/27 12:50:11 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at spark-cluster-m/10.142.0.10:8032 20/12/27 12:50:11 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at spark-cluster-m/10.142.0.10:10200 20/12/27 12:50:13 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1609071162129_0002 Exception in thread "main" java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:262) at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.(HBaseRelation.scala:84) at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:61) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) at com.example.bigtable.spark.shc.BigtableSource$.delayedEndpoint$com$example$bigtable$spark$shc$BigtableSource$1(BigtableSource.scala:56) at com.example.bigtable.spark.shc.BigtableSource$delayedInit$body.apply(BigtableSource.scala:19) at scala.Function0$class.apply$mcV$sp(Function0.scala:34) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) at scala.App$class.main(App.scala:76) at com.example.bigtable.spark.shc.BigtableSource$.main(BigtableSource.scala:19) at com.example.bigtable.spark.shc.BigtableSource.main(BigtableSource.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 20/12/27 12:50:20 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@3e6cb045{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}

            ...

            ANSWER

            Answered 2020-Dec-27 at 13:47

            Consider reading these related SO questions: 1 and 2.

            Under the hood the tutorial you followed, as well of one of the question indicated, use the Apache Spark - Apache HBase Connector provided by HortonWorks.

            The problem seems to be related with an incompatibility with the version of the json4s library: in both cases, it seems that using version 3.2.10 or 3.2.11 in the build process will solve the issue.

            Source https://stackoverflow.com/questions/65466253

            QUESTION

            Hortonworks Spark Hbase Connector(SHC) - Null pointer exception while writing Dataframe to Hbase
            Asked 2020-May-20 at 00:52

            Here is the simple code to write spark Dataframe to Hbase using Spark-Hbase connector. I am hit with NullPointerException on df.write operation and stops writing DF to Hbase. However, I am able to read from Hbase using Spark-Hbase connector though. This issue is been discussed in the following links but the solutions suggested did not help.

            https://github.com/hortonworks-spark/shc/issues/278

            https://github.com/hortonworks-spark/shc/issues/46

            ...

            ANSWER

            Answered 2020-May-19 at 07:24

            As discussed over here I made additional configuration changes to SparkSession builder and the exception is gone. However, I am not clear on the cause and the fix. Hope someone can explain.

            Source https://stackoverflow.com/questions/61864349

            QUESTION

            How to use newAPIHadoopRDD (spark) in Java to read Hbase data
            Asked 2020-May-15 at 21:08

            I try to read Hbase data with spark API.

            The code :

            ...

            ANSWER

            Answered 2017-Feb-03 at 09:25

            you have to use InputFormat for newAPIHadoopRDD

            Source https://stackoverflow.com/questions/42019905

            QUESTION

            Spark write only to one hbase region server
            Asked 2020-May-15 at 21:05
            import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
            import org.apache.hadoop.hbase.mapreduce.TableInputFormat
            import org.apache.hadoop.mapreduce.Job
            import org.apache.hadoop.hbase.io.ImmutableBytesWritable
            import org.apache.spark.rdd.PairRDDFunctions
            
            def bulkWriteToHBase(sparkSession: SparkSession, sparkContext: SparkContext, jobContext: Map[String, String], sinkTableName: String, outRDD: RDD[(ImmutableBytesWritable, Put)]): Unit = {
            val hConf = HBaseConfiguration.create()
            hConf.set("hbase.zookeeper.quorum", jobContext("hbase.zookeeper.quorum"))
            hConf.set("zookeeper.znode.parent", jobContext("zookeeper.znode.parent"))
            hConf.set(TableInputFormat.INPUT_TABLE, sinkTableName)
            
            val hJob = Job.getInstance(hConf)
            hJob.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, sinkTableName)
            hJob.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]]) 
            
            outRDD.saveAsNewAPIHadoopDataset(hJob.getConfiguration())
            }
            
            ...

            ANSWER

            Answered 2017-Feb-03 at 20:32

            Though you have not provided example data or enough explanation,this is mostly not due to your code or configuration. It is happening so,due to non-optimal rowkey design. The data you are writing is having keys(hbase rowkey) improperly structured(maybe monotonically increasing or something else).So, write to one of the regions is happening.You can prevent that thro' various ways(various recommended practices for rowkey design like salting,inverting,and other techniques). For reference you can see http://hbase.apache.org/book.html#rowkey.design

            In case,if you are wondering whether the write is done in parallel for all regions or one by one(not clear from question) look at this : http://hbase.apache.org/book.html#_bulk_load.

            Source https://stackoverflow.com/questions/42030653

            QUESTION

            With Pushdown query in spark, how to get parallelism in spark-HBASE (BIGSQL as SQL engine)?
            Asked 2018-Aug-23 at 02:47

            In Spark PushdownQuery is processed by SQL Engine of the DB and with the result from it, dataframe is constructed. so, spark querying the results of that query.

            ...

            ANSWER

            Answered 2018-Aug-23 at 02:47

            To achieve best performance, I would recommend to start your spark job with --num-executors 4 and --executor-cores 1 as the jdbc connection is single threaded one task runs on one core per query. By making this configuration change, when your job is running you can observe the tasks running in parallel that is the core in each executor are in use.

            Use the below function instead:

            Source https://stackoverflow.com/questions/51977471

            QUESTION

            object hbase is not a member of package org.apache.spark.sql.execution.datasources
            Asked 2018-Jun-08 at 11:39

            I am trying to use Spark-Hbase-Connector to get data from HBase

            ...

            ANSWER

            Answered 2018-Jun-08 at 11:39

            It is clearly mentioned in the shc site the followings

            Users can use the Spark-on-HBase connector as a standard Spark package. To include the package in your Spark application use: Note: com.hortonworks:shc-core:1.1.1-2.1-s_2.11 has not been uploaded to spark-packages.org, but will be there soon. spark-shell, pyspark, or spark-submit $SPARK_HOME/bin/spark-shell --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 Users can include the package as the dependency in your SBT file as well. The format is the spark-package-name:version in build.sbt file. libraryDependencies += “com.hortonworks/shc-core:1.1.1-2.1-s_2.11”

            So you will have to download the jar and include it manually in your project for the testing purpose if you are using maven.

            Or you can try maven uploaded shc

            Source https://stackoverflow.com/questions/50758216

            QUESTION

            how to visit hbase using spark 2.*
            Asked 2018-May-31 at 11:27

            I have wrote a program which visit HBase using spark 1.6 with spark-hbase-connecotr ( sbt dependency: "it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3"). But it doesn't work when using spark 2.*. I've searched about this question and I got some concludes:

            1. there are several connectors used to connect hbase using spark

              • hbase-spark. hbase-spark is provided by HBase official website. But I found it is developed on scala 2.10 and spark 1.6. The properties in the pom.xml of the project is as below:

                ...

            ANSWER

            Answered 2018-May-31 at 11:27

            I choose using newAPIHadoopRDD to visit hbase in spark

            Source https://stackoverflow.com/questions/42217513

            QUESTION

            HBase-Spark Connector: connection to HBase established for every scan?
            Asked 2018-Mar-27 at 13:14

            I am using Cloudera's HBase-Spark connector to do intensive HBase or BigTable scans. It works OK, but looking at Spark's detailed logs, it looks like the code tries to re-establish a connection to HBase with every call to process the results of a Scan() which I do via the JavaHBaseContext.foreachPartition().

            Am I right to think that this code re-establishes a connection to HBase every time? If so, how can I re-write it to make sure I reuse the already established connection?

            Here's the full sample code that produces this behavior:

            ...

            ANSWER

            Answered 2018-Mar-27 at 13:14

            This is a common problem. The cost of creating a connection can dwarf the actual work you're doing.

            In Cloud Bigtable, you can set google.bigtable.use.cached.data.channel.pool to true in your configuration settings. That would significantly improve performance. Cloud Bigtable ultimately uses a single HTTP/2 end point for all of your Cloud Bigtable instances.

            I don't know of a similar construct in HBase, but one way to do this would would suggest creating an implementation of Connection that creates a single cached Connection under the covers. You would have to set the hbase.client.connection.impl to your new class.

            Source https://stackoverflow.com/questions/49494483

            QUESTION

            NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
            Asked 2018-Mar-21 at 09:26

            sc.newAPIHadoopRDD is continuously giving me the error.

            ...

            ANSWER

            Answered 2018-Mar-21 at 09:26

            I got my problem after searching and exploring other jars

            Source https://stackoverflow.com/questions/49321203

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install spark-hbase

            You can download it from GitHub.

            Support

            [x] HBase read based scan[ ] HBase write based batchPut[ ] HBase read based analyze HFile[ ] HBase write based bulkload
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/haosdent/spark-hbase.git

          • CLI

            gh repo clone haosdent/spark-hbase

          • sshUrl

            git@github.com:haosdent/spark-hbase.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link