reair | use tools for replicating tables | Storage library
kandi X-RAY | reair Summary
kandi X-RAY | reair Summary
The replication features in ReAir are useful for the following use cases:. When migrating a Hive warehouse, ReAir can be used to copy over existing data to the new warehouse. Because ReAir copies both data and metadata, datasets are ready to query as soon as the copy completes. While many organizations start out with a single Hive warehouse, they often want better isolation between production and ad hoc workloads. Two isolated Hive warehouses accommodate this need well, and with two warehouses, there is a need to replicate evolving datasets. ReAir can be used to replicate data from one warehouse to another and propagate updates incrementally as they occur. Lastly, ReAir can be used to replicated datasets to a hot-standby warehouse for fast failover in disaster recovery scenarios. To accommodate these use cases, ReAir includes both batch and incremental replication tools. Batch replication executes a one-time copy of a list of tables. Incremental replication is a long-running process that copies objects as they are created or changed on the source warehouse.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Main entry point for the copy
- Retrieves a single partition
- Returns the ThreadLocal metastore client
- Copy a table
- Returns true if the field with the specified field ID is set
- Run the replication task
- Run a single MR copy job
- Run a batch replication
- Runs a commit change job
- Returns true if the field is set
- Resets this record so that it can be reused
- Compares this job info with the specified value
- Insert a query entry into the audit log
- Fetch the replication job data from the thrift server
- Runs the copy partition task
- Launches the audit log entry
- Retrieves a job information from the database
- Splits the input splits into chunks
- Get the input splits
- Run the rename task
- Perform a mapping operation
- Main entry point
- Rename the destination
- Sets the field value
- Returns a string representation of this TReplicationJob
- Ordered by id
reair Key Features
reair Examples and Code Snippets
Community Discussions
Trending Discussions on reair
QUESTION
i am trying to sync files from one hadoop clutster to another using distcp and airbnb reair utility, but both of them are not working as expected.
if file size is same on source and destination both of them fails to update it even if file content are been changed(checksum also varies) unless overwrite option is not used.
I need to keep sync data of around 30TB so every time loading complete dataset is not feasible.
Could anyone please suggest how can i bring two dataset in sync if file size is same(count in source is changed) and have varied checksum.
...ANSWER
Answered 2018-Jan-24 at 04:17The way DistCp handles syncing between files that are the same size but having different contents is by comparing its so-called FileChecksum
. The FileChecksum
was first introduced in HADOOP-3981, mostly for the purpose of being used in DistCp. Unfortunately, this has the known shortcoming of being incompatible between different storage implementations, and even incompatible between HDFS instances that have different internal block/chunk settings. Specifically, that FileChecksum bakes in the structure of having, for example, 512-bytes-per-chunk and 128MB-per-block.
Since GCS doesn't have the same notions of "chunks" or "blocks", there's no way for it to have any similar definition of a FileChecksum. The same is also true of all other object stores commonly used with Hadoop; the DistCp documentation appendix discusses this fact under "DistCp and Object Stores".
That said, there's a neat trick that can be done to define a nice standardized representation of a composite CRC for HDFS files that is mostly in-place compatible with existing HDFS deployments; I've filed HDFS-13056 with a proof of concept to try to get this added upstream, after which it should be possible to make it work out-of-the-box against GCS, since GCS also supports file-level CRC32C.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install reair
If the MySQL tables for incremental replication were not set up while setting up the audit log, create the state tables for incremental replication on desired MySQL instance by running the create table commands listed here.
Read through and fill out the configuration from the template. You might want to deploy the file to a widely accessible location.
Switch to the repo directory and build the JAR. You can skip the unit tests if no changes have been made (via the '-x test' flag).
To start replicating, set options to point to the appropriate logging configuration and kick off the replication launcher by using the hadoop jar command on the destination cluster. An example log4j.properties file is provided here. Be sure to specify the configuration file that was filled out in the prior step. As with batch replication, you may need to run the process as a different user.
Verify that entries are replicated properly by creating a test table on the source warehouse and checking to see if it appears on the destination warehouse.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page