gobblin | distributed data integration framework that simplifies
kandi X-RAY | gobblin Summary
kandi X-RAY | gobblin Summary
Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Generates a table DDL .
- Return a collection of files that can be copied from the source .
- Get work units .
- Executes the asynchronous model .
- Instantiates the given specification .
- Computes the full path diff set .
- Called when a Compaction job is complete .
- Processes a query by job id .
- Run the job .
- Helper method to generate copy entities .
gobblin Key Features
gobblin Examples and Code Snippets
Community Discussions
Trending Discussions on gobblin
QUESTION
I am currently using apache gobblin to read from a kafka topic. I went over the docs to check if there is a config to limit the amount of files produced by gobblin but couldnt find it.
Is it possible to limit this?
Thanks!
...ANSWER
Answered 2021-May-20 at 20:25There is no config to directly control the number of files produced by Gobblin for Kafka -> data lake ingestion. There are a few factors that determine the number of files output: 1. number of workunits created, and 2. whether your pipeline is using a PartitionedDataWriter. In the case of partitioned writes, the number of files is ultimately determined by the input data stream. For instance, if your pipeline is configured using a TimeBasedAvroWriterPartitioner (which is commonly used to write out files in YYYY/MM/DD/HH format) with the event time of the Kafka messages as the partitioning key, you will end up with lots of small files in your destination system if your input Kafka stream has a ton of late data.
However, you do have a few configurations to limit the number of workunits created by the Kafka source in a given run. In the case of Kafka, each workunit corresponds to a subset of topic partitions of a single topic assigned to a single Gobblin task.
- mr.job.max.mappers: which limits how many mappers (or Gobblin tasks) are created in each run (and thus, limits the total number of workunits), and
- mr.target.mapper.size: which intuitively maps to the maximum number of records each Gobblin task will pull in a single run.
You can reduce the first config and set the second config to a larger value, which will have the desired effect of reducing number of workunits and hence, the number of output files.
In addition to the above configs, Gobblin also has a compaction utility (a MapReduce job) that coalesces small files produced by the data ingestion pipeline into a small number of large sized files. A common production set up is to run the compaction on an hourly/daily cadence to limit the number of files in the data lake. See: https://gobblin.readthedocs.io/en/latest/user-guide/Compaction/ for more details.
QUESTION
A couple of weeks ago, I asked a question about reading Salesforce data using the SOAP API instead of the REST API (see Trying to use Apache Gobblin to read Salesforce data using SOAP API(s) instead of REST API ), but unfortunately nobody answered it, and so I am trying to implement the solution (with a little help) directly.
Using the REST API, the existing code that reads in a table's definition (by making a call to the REST API) looks like this:
...ANSWER
Answered 2020-Aug-29 at 00:05You have consumed the WSDL, right? The DescribeSObjectResult
is a normal class in your project. So... my Java is rusty but seems the question is simple "how to convert a Java object to JSON"?
There are libraries for this, right? Jackson for example. This helps? Converting Java objects to JSON with Jackson
I'm not sure if you'll end with identical result but should be close enough.
QUESTION
I have configured .pull file to produce and send metrics to InfluxDb for source, extractor and converter jobs. I tried with the example wikipedia job.
...ANSWER
Answered 2020-May-21 at 11:50I found the problem here. The gobblin uses config file as a source for metrics configuration. Instead of adding the properties to *.pull or *.job file, I had to add those to *.conf file. Once added, it will send metrics to the platform whichever is addded to the gobblin application.
QUESTION
I'm new to gobblin. I try to build a distribution using master branch of the project. I'm getting bellow error while following the instruction.
...ANSWER
Answered 2020-May-06 at 17:11Current Gobblin build scripts use features that are present in JDK 8, but were removed in newer JDK versions. Gradle can use the latest JDK installed on your machine, e.g. JDK 13. As a result, the build process can fail.
As a workaround, you can tell Gradle to use JDK 8.
For example, on Windows, this can be achieved by making a change in gradle.properties (given that you have jre1.8.0_202 installed):
QUESTION
I am new to gobblin. I build gobblin from incubator-gobblin GitHub master branch. Now I am tring wikipedia example from getting started guide but getting following error.
WARN: HADOOP_HOME is not defined. Gobblin Hadoop libs will be used in classpath.
Error: Could not find or load main class org.apache.gobblin.runtime.cli.GobblinCli
with --show-classpath
it gives /mnt/c/users/name/incubator-gobblin/conf/classpath::
How can I solve it? Please let me know if anyone know the solution.
ANSWER
Answered 2020-Feb-04 at 18:18Make sure that you run this command in incubator-gobblin/build/gobblin-distribution/distributions/gobblin-dist
and not in incubator-gobblin/gobblin-distribution
QUESTION
I am trying mysql to hdfs data ingestion using gobblin. While running mysql-to-gobblin.pull using steps below:
1) start hadoop:
sbin\start-all.cmd
2) start mysql service:
sudo service mysql start
3) set GOBBLIN_WORK_DIR:
export GOBBLIN_WORK_DIR=/mnt/c/users/name/incubator-gobblin/GOBBLIN_WORK_DIR
4) set GOBBLIN_JOB_CONFIG_DIR
export GOBBLIN_JOB_CONFIG_DIR=/mnt/c/users/name/incubator-gobblin/GOBBLIN_JOB_CONFIG_DIR
5) Start standalone
bin/gobblin.sh service standalone start --jars /mnt/C/Users/name/incubator-gobblin/build/gobblin-sql/libs/gobblin-sql-0.15.0.jar
gives below error
...ANSWER
Answered 2020-Mar-06 at 12:23solution is to add this jar or dependency to get rid of Caused by: java.lang.ClassNotFoundException: org.apache.gobblin.source.extractor.extract.jdbc.MysqlSource
QUESTION
I am tring to ingest data from mysql table to hdfs. but it is giving me below error
...ANSWER
Answered 2020-Feb-28 at 22:02Looks like the name of the watermark column comes from extract.delta.fields property. In your example, it's set to "name,password", so the name is treated as a watermark. Try setting it to "derivedwatermarkcolumn".
How I found this: I've looked through the code of MysqlSource class to find where the watermark was mentioned, and then used IntelliJ's inspector to find out where the data is coming from. You can get it through a context menu -> Analyze -> Analyze data flow to here.
QUESTION
While exploring various tools like [Nifi, Gobblin etc.], I have observed that Databricks is now promoting for using Spark for data ingestion/on-boarding.
We have a spark[scala] based application running on YARN. So far we are working on a hadoop and spark cluster where we manually place required data files in HDFS first and then run our spark jobs later. Now when we are planning to make our application available for the client we are expecting any type and number of files [mainly csv, jason, xml etc.] from any data source [ftp, sftp, any relational and nosql database] of huge size [ranging from GB to PB].
Keeping this in mind we are looking for options which could be used for data on-boarding and data sanity before pushing data into HDFS.
Options which we are looking for based on priority: 1) Spark for data ingestion and sanity: As our application is written and is running on spark cluster, we are planning to use the same for data ingestion and sanity task as well. We are bit worried about Spark's support for many datasources/file types/etc. Also, we are not sure if we try to copy data from let's say any FTP/SFTP then will all workers will write data on HDFS in parallel? Is there any limitation while using it? Is there any Audit trail maintained by Spark while this data copy?
2) Nifi in clustered mode: How good Nifi would be for this purpose? Can it be used for any datasource and for any size of file? Will be maintain the Audit trail? Would Nifi we able to handle such large files? How large cluster would be required in case we try to copy GB - PB of data and perform certain sanity on top of that data before pushing it to HDFS?
3) Gobblin in clustered mode: Would like to hear similar answers as that for Nifi?
4) If at all there is any other good option available for this purpose with lesser infra/cost involved and better performance?
Any guidance/pointers/comparisions for above mentioned tools and technologies would be appreciated.
Best Regards, Bhupesh
...ANSWER
Answered 2017-Jun-29 at 09:22After doing certain R&D and considering the fact that using NIFI or goblin will demand for more infrastructure cost. I have started testing Spark for data on-boarding.
SO far I have tried using Spark job for importing data [present at a remote staging area/node] into my HDFS and I am able to do that by mounting that remote location with all my spark cluster worker nodes. Doing this made that location local to those workers, hence spark job ran properly and data is on-boarded to my HDFS.
Since my whole project is going to be on Spark, hence keeping data on-boarding part on spark would not cost anything extra to me. So far I am going good. Hence I would suggest to others as well, if you already have spark cluster and hadoop cluster up and running then instead of adding extra cost [where cost could be a major constraint] go for spark job for data on-boarding.
QUESTION
Running /opt/gobblin/bin/gobblin-standalone.sh start
directly everything works, the output in logs are fine.
Running it through a systemd service, not works. Nothing are outputting in logs.
...ANSWER
Answered 2019-Jan-20 at 17:23The trick is with Type=oneshot
, RemainAfterExit=true
and set the environments:
QUESTION
I want to install Apache Gobblin
on my MacOS X. For this, I downloaded version 0.14.0 and followed the steps here.
The first thing I did was this:
...ANSWER
Answered 2018-Dec-14 at 09:59Just digged a little into the codes. Are you sure that Java 9 is supported by their build scripts?
Look at the line you have issue with: globalDependencies.gradle:44
. See ToolProvider.getSystemToolClassLoader()
. Now let's look at its docs for Java 9:
Deprecated. This method is subject to removal in a future version of Java SE. Use the system tool provider or service loader mechanisms to locate system tools as well as user-installed tools. Returns a class loader that may be used to load system tools, or null if no such special loader is provided.
Implementation Requirements:
This implementation always returns null.
See that? It always returns null
!
Things were different in Java 8, though:
Returns the class loader for tools provided with this platform. This does not include user-installed tools. Use the service provider mechanism for locating user installed tools.
So the script is calling getURLs
on a null
object and obviously throws an NPE. It probably needs to be fixed!
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install gobblin
Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page