reair | use tools for replicating tables | Storage library

by airbnb Java Version: Current License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(1)Vulnerabilities Install Support

kandi X-RAY | reair Summary

reair is a Java library typically used in Storage applications. reair has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. However reair has 96 bugs. You can download it from GitHub.

The replication features in ReAir are useful for the following use cases:. When migrating a Hive warehouse, ReAir can be used to copy over existing data to the new warehouse. Because ReAir copies both data and metadata, datasets are ready to query as soon as the copy completes. While many organizations start out with a single Hive warehouse, they often want better isolation between production and ad hoc workloads. Two isolated Hive warehouses accommodate this need well, and with two warehouses, there is a need to replicate evolving datasets. ReAir can be used to replicate data from one warehouse to another and propagate updates incrementally as they occur. Lastly, ReAir can be used to replicated datasets to a hot-standby warehouse for fast failover in disaster recovery scenarios. To accommodate these use cases, ReAir includes both batch and incremental replication tools. Batch replication executes a one-time copy of a list of tables. Incremental replication is a long-running process that copies objects as they are created or changed on the source warehouse.

Support

Quality

Security

License

Reuse

Support

reair has a low active ecosystem.

It has 261 star(s) with 95 fork(s). There are 42 watchers for this library.

It had no major release in the last 6 months.

There are 5 open issues and 15 have been closed. On average issues are closed in 118 days. There are 6 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of reair is current.

Quality

reair has 96 bugs (19 blocker, 2 critical, 62 major, 13 minor) and 1847 code smells.

Security

reair has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

reair code analysis shows 0 unresolved vulnerabilities.

There are 9 security hotspots that need review.

License

reair is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

reair releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

reair saves you 14429 person hours of effort in developing the same functionality from scratch.

It has 28874 lines of code, 2235 functions and 174 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed reair and discovered the below as its top functions. This is intended to give you an instant insight into reair implemented functionality, and help decide if they suit your requirements.

Main entry point for the copy
Retrieves a single partition
Returns the ThreadLocal metastore client
Copy a table
Returns true if the field with the specified field ID is set
Run the replication task
Run a single MR copy job
Run a batch replication
Runs a commit change job
Returns true if the field is set
Resets this record so that it can be reused
Compares this job info with the specified value
Insert a query entry into the audit log
Fetch the replication job data from the thrift server
Runs the copy partition task
Launches the audit log entry
Retrieves a job information from the database
Splits the input splits into chunks
Get the input splits
Run the rename task
Perform a mapping operation
Main entry point
Rename the destination
Sets the field value
Returns a string representation of this TReplicationJob
Ordered by id

Get all kandi verified functions for this library.

reair Key Features

No Key Features are available at this moment for reair.

reair Examples and Code Snippets

No Code Snippets are available at this moment for reair.

Community Discussions

Trending Discussions on reair

Sync files on hdfs having same size but varies in contents

QUESTION

Sync files on hdfs having same size but varies in contents

Asked 2019-May-08 at 08:21

i am trying to sync files from one hadoop clutster to another using distcp and airbnb reair utility, but both of them are not working as expected.

if file size is same on source and destination both of them fails to update it even if file content are been changed(checksum also varies) unless overwrite option is not used.

I need to keep sync data of around 30TB so every time loading complete dataset is not feasible.

Could anyone please suggest how can i bring two dataset in sync if file size is same(count in source is changed) and have varied checksum.

...

ANSWER

Answered 2018-Jan-24 at 04:17

The way DistCp handles syncing between files that are the same size but having different contents is by comparing its so-called FileChecksum. The FileChecksum was first introduced in HADOOP-3981, mostly for the purpose of being used in DistCp. Unfortunately, this has the known shortcoming of being incompatible between different storage implementations, and even incompatible between HDFS instances that have different internal block/chunk settings. Specifically, that FileChecksum bakes in the structure of having, for example, 512-bytes-per-chunk and 128MB-per-block.

Since GCS doesn't have the same notions of "chunks" or "blocks", there's no way for it to have any similar definition of a FileChecksum. The same is also true of all other object stores commonly used with Hadoop; the DistCp documentation appendix discusses this fact under "DistCp and Object Stores".

That said, there's a neat trick that can be done to define a nice standardized representation of a composite CRC for HDFS files that is mostly in-place compatible with existing HDFS deployments; I've filed HDFS-13056 with a proof of concept to try to get this added upstream, after which it should be possible to make it work out-of-the-box against GCS, since GCS also supports file-level CRC32C.

Source https://stackoverflow.com/questions/48289719

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install reair

If the MySQL tables for incremental replication were not set up while setting up the audit log, create the state tables for incremental replication on desired MySQL instance by running the create table commands listed here. Read through and fill out the configuration from the template. You might want to deploy the file to a widely accessible location. Switch to the repo directory and build the JAR. You can skip the unit tests if no changes have been made (via the '-x test' flag). Once the build finishes, the JAR to run the incremental replication process can be found under main/build/libs/airbnb-reair-main-1.0.0-all.jar. If you use the recommended log4j.properties file that is shipped with the tool, messages with the INFO level will be printed to stderr, but more detailed logging messages with >= DEBUG logging level will be recorded to a log file in the current working directory. When the incremental replication process is launched for the first time, it will start replicating entries after the highest numbered ID in the audit log. Because the process periodically checkpoints progress to the DB, it can be killed and will resume from where it left off when restarted. To override this behavior, please see the additional options section. For production deployment, an external process should monitor and restart the replication process if it exits. The replication process will exit if the number of consecutive failures while making RPCs or DB queries exceed the configured number of retries.
If the MySQL tables for incremental replication were not set up while setting up the audit log, create the state tables for incremental replication on desired MySQL instance by running the create table commands listed here.
Read through and fill out the configuration from the template. You might want to deploy the file to a widely accessible location.
Switch to the repo directory and build the JAR. You can skip the unit tests if no changes have been made (via the '-x test' flag).
To start replicating, set options to point to the appropriate logging configuration and kick off the replication launcher by using the hadoop jar command on the destination cluster. An example log4j.properties file is provided here. Be sure to specify the configuration file that was filled out in the prior step. As with batch replication, you may need to run the process as a different user.
Verify that entries are replicated properly by creating a test table on the source warehouse and checking to see if it appears on the destination warehouse.