marisa-trie | Static memory-efficient Trie | Natural Language Processing library
kandi X-RAY | marisa-trie Summary
kandi X-RAY | marisa-trie Summary
Static memory-efficient Trie-like structures for Python (2.x and 3.x) based on marisa-trie C++ library.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Benchmark
- Benchmark a benchmark
- Format a result
- Run profiling
- Create a Trie
- Generate random words
- Returns a list of words 100kk
- Split a list of words
- Truncate a list of words
marisa-trie Key Features
marisa-trie Examples and Code Snippets
Community Discussions
Trending Discussions on marisa-trie
QUESTION
I'm using a custom Cython wrapper of this marisa trie library as a key-value multimap.
My trie entries look like key 0xff data1 0xff data2
to map key
to the tuple (data1, data2)
. data1
is a string of variable length but data2
is always a 4-byte unsigned int. The 0xff
is a delimiter byte.
I know a trie is not the most optimal data structure for this from a theoretical point of a view, but various practical considerations make it the best available choice.
In this use case, I have about 10-20 million keys, each one has on average 10 data points. data2
is redundant for many entries (in some cases, data2
is always the same for all data points for a given key), so I had the idea of taking the most frequent data2
entry and adding a ("", base_data2)
data point to each key.
Since a MARISA trie, to my knowledge, does not have suffix compression and for a given key each data1
is unique, I assumed that this would save 4 bytes per data tuple that uses a redundant key (plus adding in a single 4-byte "value" for each key). Having rebuilt the trie, I checked that the redundant data was no longer being stored. I expected a sizable decrease in both serialized and in-memory size, but in fact the on-disk trie went from 566MB to 557MB (and a similar reduction in RAM usage for a loaded trie).
From this I concluded that I must be wrong about there being no suffix compression. I was now storing the entries with a redundant data2
number as key 0xff data1 0xff
, so to test this theory I removed the trailing 0xff
and adjusted the code that uses the trie to cope. The new trie went down from 557MB to 535MB.
So removing a single redundant trailing byte made a 2x larger improvement than removing the same number of 4-byte sequences, so either the suffix compression theory is dead wrong, or it's implemented in some very convoluted way.
My remaining theory is that adding in the ("", base_data2)
entry at a higher point in the trie somehow throws off the compression in some terrible way, but it should just be adding in 4 more bytes when I've removed many more than that from lower down in the trie.
I'm not optimistic for a fix, but I'd dearly like to know why I'm seeing this behavior! Thank you for your attention.
...ANSWER
Answered 2017-Jul-12 at 20:47As I suspected, it's caused by padding.
in lib/marisa/grimoire/vector/vector.h
, there is the following function:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install marisa-trie
You can use marisa-trie like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page