kamu-cli | New generation decentralized data warehouse and streaming

 by   kamu-data Rust Version: v0.128.0 License: Non-SPDX

kandi X-RAY | kamu-cli Summary

kandi X-RAY | kamu-cli Summary

kamu-cli is a Rust library typically used in Data Science applications. kamu-cli has no bugs, it has no vulnerabilities and it has low support. However kamu-cli has a Non-SPDX License. You can download it from GitHub.

kamu (pronounced kaˈmju) is an easy-to-use command-line tool for managing, transforming, and collaborating on structured data.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              kamu-cli has a low active ecosystem.
              It has 205 star(s) with 8 fork(s). There are 15 watchers for this library.
              There were 10 major release(s) in the last 12 months.
              There are 6 open issues and 29 have been closed. On average issues are closed in 222 days. There are 10 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of kamu-cli is v0.128.0

            kandi-Quality Quality

              kamu-cli has 0 bugs and 0 code smells.

            kandi-Security Security

              kamu-cli has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              kamu-cli code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              kamu-cli has a Non-SPDX License.
              Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

            kandi-Reuse Reuse

              kamu-cli releases are available to install and integrate.
              Installation instructions are available. Examples and code snippets are not available.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of kamu-cli
            Get all kandi verified functions for this library.

            kamu-cli Key Features

            No Key Features are available at this moment for kamu-cli.

            kamu-cli Examples and Code Snippets

            No Code Snippets are available at this moment for kamu-cli.

            Community Discussions

            QUESTION

            How reproducible / deterministic is Parquet format?
            Asked 2021-Dec-09 at 03:55

            I'm seeking advice from people deeply familiar with the binary layout of Apache Parquet:

            Having a data transformation F(a) = b where F is fully deterministic, and same exact versions of the entire software stack (framework, arrow & parquet libraries) are used - how likely am I to get an identical binary representation of dataframe b on different hosts every time b is saved into Parquet?

            In other words how reproducible Parquet is on binary level? When data is logically the same what can cause binary differences?

            • Can there be some uninit memory in between values due to alignment?
            • Assuming all serialization settings (compression, chunking, use of dictionaries etc.) are the same, can result still drift?
            Context

            I'm working on a system for fully reproducible and deterministic data processing and computing dataset hashes to assert these guarantees.

            My key goal has been to ensure that dataset b contains an idendital set of records as dataset b' - this is of course very different from hashing a binary representation of Arrow/Parquet. Not wanting to deal with the reproducibility of storage formats I've been computing logical data hashes in memory. This is slow but flexible, e.g. my hash stays the same even if records are re-ordered (which I consider an equivalent dataset).

            But when thinking about integrating with IPFS and other content-addressable storages that rely on hashes of files - it would simplify the design a lot to have just one hash (physical) instead of two (logical + physical), but this means I have to guarantee that Parquet files are reproducible.

            Update

            I decided to continue using logical hashing for now.

            I've created a new Rust crate arrow-digest that implements the stable hashing for Arrow arrays and record batches and tries hard to hide the encoding-related differences. The crate's README describes the hashing algorithm if someone finds it useful and wants to implement it in another language.

            I'll continue to expand the set of supported types as I'm integrating it into the decentralized data processing tool I'm working on.

            In the long term, I'm not sure logical hashing is the best way forward - a subset of Parquet that makes some efficiency sacrifices just to make file layout deterministic might be a better choice for content-addressability.

            ...

            ANSWER

            Answered 2021-Dec-05 at 04:30

            At least in arrow's implementation I would expect, but haven't verified the exact same input (including identical metadata) in the same order to yield deterministic outputs (we try not to leave uninitialized values for security reasons) with the same configuration (assuming the compression algorithm chosen also makes the deterministic guarantee). It is possible there is some hash-map iteration for metadata or elsewhere that might also break this assumption.

            As @Pace pointed out I would not rely on this and recommend against relying on it). There is nothing in the spec that guarantees this and since the writer version is persisted when writing a file you are guaranteed a breakage if you ever decided to upgrade. Things will also break if additional metadata is added or removed ( I believe in the past there have been some big fixes for round tripping data sets that would have caused non-determinism).

            So in summary this might or might not work today but even if it does I would expect this would be very brittle.

            Source https://stackoverflow.com/questions/70220970

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install kamu-cli

            Watch this introductory video to see kamu in action.
            Learn how to use kamu with this self-serve demo without needing to install anything.
            Then follow the "Getting Started" section of our documentation to install the tool and try a bunch of examples.

            Support

            If you like what we're doing - support us by starring the repo, this helps us a lot!. Subscribe to our YouTube channel to get fresh tech talks and deep dives. Stop by and say "hi" in our Discord Server - we're always happy to chat about data.
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries

            Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link