marisa-trie | Static memory-efficient Trie

 by   pytries Python Version: Current License: MIT

kandi X-RAY | marisa-trie Summary

kandi X-RAY | marisa-trie Summary

null

Static memory-efficient Trie-like structures for Python (2.x and 3.x) based on marisa-trie C++ library.
Support
    Quality
      Security
        License
          Reuse

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of marisa-trie
            Get all kandi verified functions for this library.

            marisa-trie Key Features

            No Key Features are available at this moment for marisa-trie.

            marisa-trie Examples and Code Snippets

            No Code Snippets are available at this moment for marisa-trie.

            Community Discussions

            Trending Discussions on marisa-trie

            QUESTION

            marisa trie suffix compression?
            Asked 2017-Jul-12 at 20:47

            I'm using a custom Cython wrapper of this marisa trie library as a key-value multimap.

            My trie entries look like key 0xff data1 0xff data2 to map key to the tuple (data1, data2). data1 is a string of variable length but data2 is always a 4-byte unsigned int. The 0xff is a delimiter byte.

            I know a trie is not the most optimal data structure for this from a theoretical point of a view, but various practical considerations make it the best available choice.

            In this use case, I have about 10-20 million keys, each one has on average 10 data points. data2 is redundant for many entries (in some cases, data2 is always the same for all data points for a given key), so I had the idea of taking the most frequent data2 entry and adding a ("", base_data2) data point to each key.

            Since a MARISA trie, to my knowledge, does not have suffix compression and for a given key each data1 is unique, I assumed that this would save 4 bytes per data tuple that uses a redundant key (plus adding in a single 4-byte "value" for each key). Having rebuilt the trie, I checked that the redundant data was no longer being stored. I expected a sizable decrease in both serialized and in-memory size, but in fact the on-disk trie went from 566MB to 557MB (and a similar reduction in RAM usage for a loaded trie).

            From this I concluded that I must be wrong about there being no suffix compression. I was now storing the entries with a redundant data2 number as key 0xff data1 0xff, so to test this theory I removed the trailing 0xff and adjusted the code that uses the trie to cope. The new trie went down from 557MB to 535MB.

            So removing a single redundant trailing byte made a 2x larger improvement than removing the same number of 4-byte sequences, so either the suffix compression theory is dead wrong, or it's implemented in some very convoluted way.

            My remaining theory is that adding in the ("", base_data2) entry at a higher point in the trie somehow throws off the compression in some terrible way, but it should just be adding in 4 more bytes when I've removed many more than that from lower down in the trie.

            I'm not optimistic for a fix, but I'd dearly like to know why I'm seeing this behavior! Thank you for your attention.

            ...

            ANSWER

            Answered 2017-Jul-12 at 20:47

            As I suspected, it's caused by padding.

            in lib/marisa/grimoire/vector/vector.h, there is the following function:

            Source https://stackoverflow.com/questions/44895094

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install marisa-trie

            No Installation instructions are available at this moment for marisa-trie.Refer to component home page for details.

            Support

            For feature suggestions, bugs create an issue on GitHub
            If you have any questions vist the community on GitHub, Stack Overflow.
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • sshUrl

            git@github.com:pytries/marisa-trie.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link