kamu-cli | New generation decentralized data warehouse and streaming
kandi X-RAY | kamu-cli Summary
kandi X-RAY | kamu-cli Summary
kamu (pronounced kaˈmju) is an easy-to-use command-line tool for managing, transforming, and collaborating on structured data.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of kamu-cli
kamu-cli Key Features
kamu-cli Examples and Code Snippets
Community Discussions
Trending Discussions on kamu-cli
QUESTION
I'm seeking advice from people deeply familiar with the binary layout of Apache Parquet:
Having a data transformation F(a) = b
where F
is fully deterministic, and same exact versions of the entire software stack (framework, arrow & parquet libraries) are used - how likely am I to get an identical binary representation of dataframe b
on different hosts every time b
is saved into Parquet?
In other words how reproducible Parquet is on binary level? When data is logically the same what can cause binary differences?
- Can there be some uninit memory in between values due to alignment?
- Assuming all serialization settings (compression, chunking, use of dictionaries etc.) are the same, can result still drift?
I'm working on a system for fully reproducible and deterministic data processing and computing dataset hashes to assert these guarantees.
My key goal has been to ensure that dataset b
contains an idendital set of records as dataset b'
- this is of course very different from hashing a binary representation of Arrow/Parquet. Not wanting to deal with the reproducibility of storage formats I've been computing logical data hashes in memory. This is slow but flexible, e.g. my hash stays the same even if records are re-ordered (which I consider an equivalent dataset).
But when thinking about integrating with IPFS
and other content-addressable storages that rely on hashes of files - it would simplify the design a lot to have just one hash (physical) instead of two (logical + physical), but this means I have to guarantee that Parquet files are reproducible.
I decided to continue using logical hashing for now.
I've created a new Rust crate arrow-digest that implements the stable hashing for Arrow arrays and record batches and tries hard to hide the encoding-related differences. The crate's README describes the hashing algorithm if someone finds it useful and wants to implement it in another language.
I'll continue to expand the set of supported types as I'm integrating it into the decentralized data processing tool I'm working on.
In the long term, I'm not sure logical hashing is the best way forward - a subset of Parquet that makes some efficiency sacrifices just to make file layout deterministic might be a better choice for content-addressability.
...ANSWER
Answered 2021-Dec-05 at 04:30At least in arrow's implementation I would expect, but haven't verified the exact same input (including identical metadata) in the same order to yield deterministic outputs (we try not to leave uninitialized values for security reasons) with the same configuration (assuming the compression algorithm chosen also makes the deterministic guarantee). It is possible there is some hash-map iteration for metadata or elsewhere that might also break this assumption.
As @Pace pointed out I would not rely on this and recommend against relying on it). There is nothing in the spec that guarantees this and since the writer version is persisted when writing a file you are guaranteed a breakage if you ever decided to upgrade. Things will also break if additional metadata is added or removed ( I believe in the past there have been some big fixes for round tripping data sets that would have caused non-determinism).
So in summary this might or might not work today but even if it does I would expect this would be very brittle.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install kamu-cli
Learn how to use kamu with this self-serve demo without needing to install anything.
Then follow the "Getting Started" section of our documentation to install the tool and try a bunch of examples.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page