kafka-connect-hdfs | Kafka Connect HDFS connector

by confluentinc Java Version: v10.2.2 License: Non-SPDX

X-Ray Key Features Code Snippets Community Discussions(5)Vulnerabilities Install Support

kandi X-RAY | kafka-connect-hdfs Summary

kafka-connect-hdfs is a Java library typically used in Big Data, Kafka, Spark, Hadoop applications. kafka-connect-hdfs has no bugs, it has no vulnerabilities, it has build file available and it has low support. However kafka-connect-hdfs has a Non-SPDX License. You can download it from GitHub.

kafka-connect-hdfs is a Kafka Connector for copying data between Kafka and Hadoop HDFS.

Support

Quality

Security

License

Reuse

Support

kafka-connect-hdfs has a low active ecosystem.

It has 452 star(s) with 394 fork(s). There are 320 watchers for this library.

It had no major release in the last 6 months.

There are 120 open issues and 192 have been closed. On average issues are closed in 484 days. There are 25 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of kafka-connect-hdfs is v10.2.2

Quality

kafka-connect-hdfs has 0 bugs and 0 code smells.

Security

kafka-connect-hdfs has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

kafka-connect-hdfs code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

kafka-connect-hdfs has a Non-SPDX License.

Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

Reuse

kafka-connect-hdfs releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

kafka-connect-hdfs saves you 5653 person hours of effort in developing the same functionality from scratch.

It has 12514 lines of code, 699 functions and 119 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed kafka-connect-hdfs and discovered the below as its top functions. This is intended to give you an instant insight into kafka-connect-hdfs implemented functionality, and help decide if they suit your requirements.

Writes data to sink
Performs the recovery process
Commit file
Reads the offset of the file from HDFS
Start the sink
Find fileStatus with max offset
Synchronize all the topics in the hive table
Closes the writer
Initialize Hive services
Return Avro schema for a given path
Append WAL file
Returns a new Partitioner instance based on the configuration
This method returns a list of configs configured for this task
Determines whether this path belongs to the committed file
Get Avro schema from disk
Creates a list of task configurations
Create a RecordWriter
Returns the latest offset and file path from the WAL file
Return the schema for the given path
Initializes the connection
Create a record writer
Configure kerberos authentication
Create a ParquetWriter that wraps the Avro schema
Apply the WAL file to the storage
Polls a source record
Create a record writer

Get all kandi verified functions for this library.

kafka-connect-hdfs Key Features

No Key Features are available at this moment for kafka-connect-hdfs.

kafka-connect-hdfs Examples and Code Snippets

No Code Snippets are available at this moment for kafka-connect-hdfs.

Community Discussions

Trending Discussions on kafka-connect-hdfs

What happens when I deploy new kafka-connect cluster while file opened? (kafka-connect-hdfs)

Kafka Stream for Kafka to HDFS

Could not transfer artifact io.confluent:kafka-connect-storage-common-parent:pom:6.0.0-SNAPSHOT from/to confluent (${confluent.maven.repo})

Kafka to hdfs3 sink Missing required configuration "confluent.topic.bootstrap.servers" which has no default value

How can i change the Debezium default topic naming convention to make it fit for confluent hive table auto-generated strategy?

QUESTION

What happens when I deploy new kafka-connect cluster while file opened? (kafka-connect-hdfs)

Asked 2021-Dec-22 at 20:41

I'm using hdfs kafka connect cluster, as in distributed mode.

I set rotate.interval.ms as 1 hour, and offset.flush.interval.ms as 1 minute.

In my case, I thought the file would be committed when a new record with an hour interval with the timestamp of the first record came; and offset will be flushed every minute.

However, I wondered what will be happened when I restart the cluster when the file has still opened. I mean, what will be happened in below case?

The file was opened starting with a record with a '15:37' timestamp. (offset 10)
after 10 minutes, the kafka-connect cluster restarted.
(I thought step 1's file will be discarded in the memory, and not be committed to the hdfs)
When the new worker started, will the "new opened file" start tracking the record from offset 10?

Does kafka-connect/kafka-connect-hdfs keep us from losing our uncommitted records?

Due to the official document, I thought __consumer_offsets will help me in this case, but I'm not sure.

Any documents or comments will be very helpful!

...

ANSWER

Answered 2021-Dec-22 at 20:41

The consumer offsets topic is used for sink connectors, yes, and, if possible, the consumer will reset to the last non-committed offsets.

I think the behavior might have changed some time ago, but the HDFS Connector used to use a write-ahead log (WAL) to temporarily preserve the data the it was writing to a temporary HDFS location before the final file was created.

Source https://stackoverflow.com/questions/70423504

QUESTION

Kafka Stream for Kafka to HDFS

Asked 2021-Jun-03 at 01:27

I have a Flink Job which reads data from Kafka topics and writes it to HDFS. There are some problems with checkpoints, for example after stopping Flink Job some files stay in pending mode and other problems with checkpoints which write to HDFS too. I want to try Kafka Streams for the same type of pipeline Kafka to HDFS. I found the next problem - https://github.com/confluentinc/kafka-connect-hdfs/issues/365 Could you tell me please how to resolve it? Could you tell me where Kafka Streams keep files for recovery?

...

ANSWER

Answered 2021-Jun-03 at 01:27

Kafka Streams only interacts between topics of the same cluster, not with external systems.

Kafka Connect HDFS2 connector maintains offsets in an internal offsets topic. Older versions of it maintained offsets in the filenames and used a write-ahead log to ensure file delivery

Source https://stackoverflow.com/questions/67807661

QUESTION

Could not transfer artifact io.confluent:kafka-connect-storage-common-parent:pom:6.0.0-SNAPSHOT from/to confluent (${confluent.maven.repo})

Asked 2020-Aug-28 at 18:34

I am trying Kafka connect for the first time and I want to connect SAP S/4 HANA to Hive. I have created the SAP S/4 source Kafka connector using this:

https://github.com/SAP/kafka-connect-sap

But, I am not able to create an HDFS sink connector. The issue is related to pom file.

I have tried mvn clean package. But, I got this error:

...

ANSWER

Answered 2020-Aug-28 at 18:34

I suggest you download existing Confluent Platform which includes HDFS Connect already

Otherwise, checkout a release version rather than only the master branch to build the project.

Source https://stackoverflow.com/questions/63602134

QUESTION

Kafka to hdfs3 sink Missing required configuration "confluent.topic.bootstrap.servers" which has no default value

Asked 2020-Jun-23 at 08:56

Status

My HDFS was installed via ambari, HDP. I'm Currently trying to load kafka topics into HDFS sink. Kafka and HDFS was installed in the same machine x.x.x.x. I didn't change much stuff from the default settings, except some port that according to my needs.

Here is how i execute kafka:

...

ANSWER

Answered 2020-Jun-23 at 08:23

Here's the error:

Missing required configuration "confluent.topic.bootstrap.servers" which has no default value.

The problem is that you've taken the config for the HDFS Sink connector, and changed the connector for a different one (HDFS 3 Sink), and this one has different configuration requirements.

You can follow the quickstart for the HDFS 3 sink connector, or fix your existing configuration by adding

Source https://stackoverflow.com/questions/62526864

QUESTION

How can i change the Debezium default topic naming convention to make it fit for confluent hive table auto-generated strategy?

Asked 2020-May-08 at 04:02

I am build an data synchronizer, which capture the data change from MySQL Source, and export the data to hive.

I choose to use Kafka Connect to implement this. I use Debezium as source connector, and confluent hdfs as sink connector.

But the problem is, the Debezium's naming convention for Kafka topic is like:

serverName.databaseName.tableName

In confluent hdfs sink propeties, i have to config the topics the same as Debezium generated:

"topics": "serverName.databaseName.tableName"

Confluent hdfs sink connector will generate path in HDFS like:

/topics/serverName.databaseName.tableName/partition=0

which will definitely cause some problem in HDFS/Hive, since the path contains syntax ., In fact, the external table auto generated by confluent hdfs sink connector failed, due to the path problem.

...

ANSWER

Answered 2020-May-08 at 04:02

HDFS Connector will replace dots (and dashes) with underscores when creating Hive tables

HDFS itself doesn't care about dots in paths. The problem is that you cannot have a dot after the port, and you have /null in there somehow.

hdfs://localhost:9000./null

is there anyway that i can change the Debezium default naming convention for topics

Solution has nothing to do with Debezium. You can use RegexRouter that is base Apache Kafka Connect library in a transforms config for you source or sink connector, depending on how early you want to "fix" the problem.

You could also write your own transform and put it in Connect's plugin.path

Source https://stackoverflow.com/questions/61663896

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install kafka-connect-hdfs

You can download it from GitHub.
You can use kafka-connect-hdfs like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the kafka-connect-hdfs component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

Support

Source Code: https://github.com/confluentinc/kafka-connect-hdfsIssue Tracker: https://github.com/confluentinc/kafka-connect-hdfs/issuesLearn how to work with the connector's source code by reading our Development and Contribution guidelines.

Find more information at: