gobblin | distributed data integration framework that simplifies

 by   apache Java Version: gobblin_0.11.0 License: Apache-2.0

kandi X-RAY | gobblin Summary

kandi X-RAY | gobblin Summary

gobblin is a Java library typically used in Big Data, Kafka, Spark applications. gobblin has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can download it from GitHub.

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              gobblin has a medium active ecosystem.
              It has 2129 star(s) with 734 fork(s). There are 168 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              gobblin has no issues reported. There are 108 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of gobblin is gobblin_0.11.0

            kandi-Quality Quality

              gobblin has no bugs reported.

            kandi-Security Security

              gobblin has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              gobblin is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              gobblin releases are available to install and integrate.
              Build file is available. You can build the component from source.
              Installation instructions are available. Examples and code snippets are not available.

            Top functions reviewed by kandi - BETA

            kandi has reviewed gobblin and discovered the below as its top functions. This is intended to give you an instant insight into gobblin implemented functionality, and help decide if they suit your requirements.
            • Generates a table DDL .
            • Return a collection of files that can be copied from the source .
            • Get work units .
            • Executes the asynchronous model .
            • Instantiates the given specification .
            • Computes the full path diff set .
            • Called when a Compaction job is complete .
            • Processes a query by job id .
            • Run the job .
            • Helper method to generate copy entities .
            Get all kandi verified functions for this library.

            gobblin Key Features

            No Key Features are available at this moment for gobblin.

            gobblin Examples and Code Snippets

            No Code Snippets are available at this moment for gobblin.

            Community Discussions

            QUESTION

            How to limit the amount of files produced by apache gobblin's output?
            Asked 2021-May-20 at 20:25

            I am currently using apache gobblin to read from a kafka topic. I went over the docs to check if there is a config to limit the amount of files produced by gobblin but couldnt find it.

            Is it possible to limit this?

            Thanks!

            ...

            ANSWER

            Answered 2021-May-20 at 20:25

            There is no config to directly control the number of files produced by Gobblin for Kafka -> data lake ingestion. There are a few factors that determine the number of files output: 1. number of workunits created, and 2. whether your pipeline is using a PartitionedDataWriter. In the case of partitioned writes, the number of files is ultimately determined by the input data stream. For instance, if your pipeline is configured using a TimeBasedAvroWriterPartitioner (which is commonly used to write out files in YYYY/MM/DD/HH format) with the event time of the Kafka messages as the partitioning key, you will end up with lots of small files in your destination system if your input Kafka stream has a ton of late data.

            However, you do have a few configurations to limit the number of workunits created by the Kafka source in a given run. In the case of Kafka, each workunit corresponds to a subset of topic partitions of a single topic assigned to a single Gobblin task.

            1. mr.job.max.mappers: which limits how many mappers (or Gobblin tasks) are created in each run (and thus, limits the total number of workunits), and
            2. mr.target.mapper.size: which intuitively maps to the maximum number of records each Gobblin task will pull in a single run.

            You can reduce the first config and set the second config to a larger value, which will have the desired effect of reducing number of workunits and hence, the number of output files.

            In addition to the above configs, Gobblin also has a compaction utility (a MapReduce job) that coalesces small files produced by the data ingestion pipeline into a small number of large sized files. A common production set up is to run the compaction on an hourly/daily cadence to limit the number of files in the data lake. See: https://gobblin.readthedocs.io/en/latest/user-guide/Compaction/ for more details.

            Source https://stackoverflow.com/questions/67609837

            QUESTION

            Converting from DescribeSObjectResult to JsonArray (or to HttpEntity)
            Asked 2020-Aug-29 at 09:48

            A couple of weeks ago, I asked a question about reading Salesforce data using the SOAP API instead of the REST API (see Trying to use Apache Gobblin to read Salesforce data using SOAP API(s) instead of REST API ), but unfortunately nobody answered it, and so I am trying to implement the solution (with a little help) directly.

            Using the REST API, the existing code that reads in a table's definition (by making a call to the REST API) looks like this:

            ...

            ANSWER

            Answered 2020-Aug-29 at 00:05

            You have consumed the WSDL, right? The DescribeSObjectResult is a normal class in your project. So... my Java is rusty but seems the question is simple "how to convert a Java object to JSON"?

            There are libraries for this, right? Jackson for example. This helps? Converting Java objects to JSON with Jackson

            I'm not sure if you'll end with identical result but should be close enough.

            Source https://stackoverflow.com/questions/63641909

            QUESTION

            Gobblin job metrics not publishing data to InfluxDB
            Asked 2020-May-21 at 11:50

            I have configured .pull file to produce and send metrics to InfluxDb for source, extractor and converter jobs. I tried with the example wikipedia job.

            ...

            ANSWER

            Answered 2020-May-21 at 11:50

            I found the problem here. The gobblin uses config file as a source for metrics configuration. Instead of adding the properties to *.pull or *.job file, I had to add those to *.conf file. Once added, it will send metrics to the platform whichever is addded to the gobblin application.

            Source https://stackoverflow.com/questions/61815761

            QUESTION

            Apache gobblin build failed
            Asked 2020-May-06 at 17:11

            I'm new to gobblin. I try to build a distribution using master branch of the project. I'm getting bellow error while following the instruction.

            ...

            ANSWER

            Answered 2020-May-06 at 17:11

            Current Gobblin build scripts use features that are present in JDK 8, but were removed in newer JDK versions. Gradle can use the latest JDK installed on your machine, e.g. JDK 13. As a result, the build process can fail.

            As a workaround, you can tell Gradle to use JDK 8.

            For example, on Windows, this can be achieved by making a change in gradle.properties (given that you have jre1.8.0_202 installed):

            Source https://stackoverflow.com/questions/61634373

            QUESTION

            Error: Could not find or load main class org.apache.gobblin.runtime.cli.GobblinCli
            Asked 2020-Mar-29 at 19:09

            I am new to gobblin. I build gobblin from incubator-gobblin GitHub master branch. Now I am tring wikipedia example from getting started guide but getting following error.

            WARN: HADOOP_HOME is not defined. Gobblin Hadoop libs will be used in classpath. Error: Could not find or load main class org.apache.gobblin.runtime.cli.GobblinCli

            with --show-classpath it gives /mnt/c/users/name/incubator-gobblin/conf/classpath:: How can I solve it? Please let me know if anyone know the solution.

            ...

            ANSWER

            Answered 2020-Feb-04 at 18:18

            Make sure that you run this command in incubator-gobblin/build/gobblin-distribution/distributions/gobblin-dist and not in incubator-gobblin/gobblin-distribution

            Source https://stackoverflow.com/questions/59998664

            QUESTION

            Gobblin: java.lang.ClassNotFoundException: org.apache.gobblin.source.extractor.extract.jdbc.MysqlSource
            Asked 2020-Mar-06 at 12:23

            I am trying mysql to hdfs data ingestion using gobblin. While running mysql-to-gobblin.pull using steps below:

            1) start hadoop:
            sbin\start-all.cmd

            2) start mysql service:
            sudo service mysql start

            3) set GOBBLIN_WORK_DIR:
            export GOBBLIN_WORK_DIR=/mnt/c/users/name/incubator-gobblin/GOBBLIN_WORK_DIR

            4) set GOBBLIN_JOB_CONFIG_DIR
            export GOBBLIN_JOB_CONFIG_DIR=/mnt/c/users/name/incubator-gobblin/GOBBLIN_JOB_CONFIG_DIR

            5) Start standalone
            bin/gobblin.sh service standalone start --jars /mnt/C/Users/name/incubator-gobblin/build/gobblin-sql/libs/gobblin-sql-0.15.0.jar

            gives below error

            ...

            ANSWER

            Answered 2020-Mar-06 at 12:23

            solution is to add this jar or dependency to get rid of Caused by: java.lang.ClassNotFoundException: org.apache.gobblin.source.extractor.extract.jdbc.MysqlSource

            Source https://stackoverflow.com/questions/60350951

            QUESTION

            Gobblin ERROR: Unable to convert field:derivedwatermarkcolumn for value:"abc" for record:
            Asked 2020-Feb-29 at 17:11

            I am tring to ingest data from mysql table to hdfs. but it is giving me below error

            ...

            ANSWER

            Answered 2020-Feb-28 at 22:02

            Looks like the name of the watermark column comes from extract.delta.fields property. In your example, it's set to "name,password", so the name is treated as a watermark. Try setting it to "derivedwatermarkcolumn".

            How I found this: I've looked through the code of MysqlSource class to find where the watermark was mentioned, and then used IntelliJ's inspector to find out where the data is coming from. You can get it through a context menu -> Analyze -> Analyze data flow to here.

            Source https://stackoverflow.com/questions/60452758

            QUESTION

            Spark as Data Ingestion/Onboarding to HDFS
            Asked 2020-Jan-15 at 22:19

            While exploring various tools like [Nifi, Gobblin etc.], I have observed that Databricks is now promoting for using Spark for data ingestion/on-boarding.

            We have a spark[scala] based application running on YARN. So far we are working on a hadoop and spark cluster where we manually place required data files in HDFS first and then run our spark jobs later. Now when we are planning to make our application available for the client we are expecting any type and number of files [mainly csv, jason, xml etc.] from any data source [ftp, sftp, any relational and nosql database] of huge size [ranging from GB to PB].

            Keeping this in mind we are looking for options which could be used for data on-boarding and data sanity before pushing data into HDFS.

            Options which we are looking for based on priority: 1) Spark for data ingestion and sanity: As our application is written and is running on spark cluster, we are planning to use the same for data ingestion and sanity task as well. We are bit worried about Spark's support for many datasources/file types/etc. Also, we are not sure if we try to copy data from let's say any FTP/SFTP then will all workers will write data on HDFS in parallel? Is there any limitation while using it? Is there any Audit trail maintained by Spark while this data copy?

            2) Nifi in clustered mode: How good Nifi would be for this purpose? Can it be used for any datasource and for any size of file? Will be maintain the Audit trail? Would Nifi we able to handle such large files? How large cluster would be required in case we try to copy GB - PB of data and perform certain sanity on top of that data before pushing it to HDFS?

            3) Gobblin in clustered mode: Would like to hear similar answers as that for Nifi?

            4) If at all there is any other good option available for this purpose with lesser infra/cost involved and better performance?

            Any guidance/pointers/comparisions for above mentioned tools and technologies would be appreciated.

            Best Regards, Bhupesh

            ...

            ANSWER

            Answered 2017-Jun-29 at 09:22

            After doing certain R&D and considering the fact that using NIFI or goblin will demand for more infrastructure cost. I have started testing Spark for data on-boarding.

            SO far I have tried using Spark job for importing data [present at a remote staging area/node] into my HDFS and I am able to do that by mounting that remote location with all my spark cluster worker nodes. Doing this made that location local to those workers, hence spark job ran properly and data is on-boarded to my HDFS.

            Since my whole project is going to be on Spark, hence keeping data on-boarding part on spark would not cost anything extra to me. So far I am going good. Hence I would suggest to others as well, if you already have spark cluster and hadoop cluster up and running then instead of adding extra cost [where cost could be a major constraint] go for spark job for data on-boarding.

            Source https://stackoverflow.com/questions/44305031

            QUESTION

            Issue with custom service systemd when start Apache Gobblin
            Asked 2020-Jan-15 at 22:09

            Running /opt/gobblin/bin/gobblin-standalone.sh start directly everything works, the output in logs are fine.

            Running it through a systemd service, not works. Nothing are outputting in logs.

            ...

            ANSWER

            Answered 2019-Jan-20 at 17:23

            The trick is with Type=oneshot, RemainAfterExit=true and set the environments:

            Source https://stackoverflow.com/questions/54278704

            QUESTION

            I'm trying to install Apache Gobblin. How can I install it using Gradle?
            Asked 2020-Jan-15 at 22:08

            I want to install Apache Gobblin on my MacOS X. For this, I downloaded version 0.14.0 and followed the steps here.

            Install Gobblin

            The first thing I did was this:

            ...

            ANSWER

            Answered 2018-Dec-14 at 09:59

            Just digged a little into the codes. Are you sure that Java 9 is supported by their build scripts?

            Look at the line you have issue with: globalDependencies.gradle:44. See ToolProvider.getSystemToolClassLoader(). Now let's look at its docs for Java 9:

            Deprecated. This method is subject to removal in a future version of Java SE. Use the system tool provider or service loader mechanisms to locate system tools as well as user-installed tools. Returns a class loader that may be used to load system tools, or null if no such special loader is provided.

            Implementation Requirements:

            This implementation always returns null.

            See that? It always returns null!

            Things were different in Java 8, though:

            Returns the class loader for tools provided with this platform. This does not include user-installed tools. Use the service provider mechanism for locating user installed tools.

            So the script is calling getURLs on a null object and obviously throws an NPE. It probably needs to be fixed!

            Source https://stackoverflow.com/questions/53769473

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install gobblin

            Extract the archive file to your local directory.
            Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
            Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

            Support

            Gobblin documentation Running Gobblin on Docker from your laptop Getting started guide Gobblin architectureCommunity Slack: Get your inviteList of companies known to use GobblinSample projectHow to build Gobblin from source codeIssue tracker - Apache Jira
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries