marisa-trie | Static memory-efficient Trie
kandi X-RAY | marisa-trie Summary
kandi X-RAY | marisa-trie Summary
Static memory-efficient Trie-like structures for Python (2.x and 3.x) based on marisa-trie C++ library.
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of marisa-trie
marisa-trie Key Features
marisa-trie Examples and Code Snippets
Community Discussions
Trending Discussions on marisa-trie
QUESTION
I'm using a custom Cython wrapper of this marisa trie library as a key-value multimap.
My trie entries look like key 0xff data1 0xff data2
to map key
to the tuple (data1, data2)
. data1
is a string of variable length but data2
is always a 4-byte unsigned int. The 0xff
is a delimiter byte.
I know a trie is not the most optimal data structure for this from a theoretical point of a view, but various practical considerations make it the best available choice.
In this use case, I have about 10-20 million keys, each one has on average 10 data points. data2
is redundant for many entries (in some cases, data2
is always the same for all data points for a given key), so I had the idea of taking the most frequent data2
entry and adding a ("", base_data2)
data point to each key.
Since a MARISA trie, to my knowledge, does not have suffix compression and for a given key each data1
is unique, I assumed that this would save 4 bytes per data tuple that uses a redundant key (plus adding in a single 4-byte "value" for each key). Having rebuilt the trie, I checked that the redundant data was no longer being stored. I expected a sizable decrease in both serialized and in-memory size, but in fact the on-disk trie went from 566MB to 557MB (and a similar reduction in RAM usage for a loaded trie).
From this I concluded that I must be wrong about there being no suffix compression. I was now storing the entries with a redundant data2
number as key 0xff data1 0xff
, so to test this theory I removed the trailing 0xff
and adjusted the code that uses the trie to cope. The new trie went down from 557MB to 535MB.
So removing a single redundant trailing byte made a 2x larger improvement than removing the same number of 4-byte sequences, so either the suffix compression theory is dead wrong, or it's implemented in some very convoluted way.
My remaining theory is that adding in the ("", base_data2)
entry at a higher point in the trie somehow throws off the compression in some terrible way, but it should just be adding in 4 more bytes when I've removed many more than that from lower down in the trie.
I'm not optimistic for a fix, but I'd dearly like to know why I'm seeing this behavior! Thank you for your attention.
...ANSWER
Answered 2017-Jul-12 at 20:47As I suspected, it's caused by padding.
in lib/marisa/grimoire/vector/vector.h
, there is the following function:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install marisa-trie
No Installation instructions are available at this moment for marisa-trie.Refer to component home page for details.
Support
If you have any questions vist the community on GitHub, Stack Overflow.
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page