marisa-trie | Static memory-efficient Trie | Natural Language Processing library

 by   pytries Python Version: 1.2.0 License: MIT

kandi X-RAY | marisa-trie Summary

kandi X-RAY | marisa-trie Summary

marisa-trie is a Python library typically used in Artificial Intelligence, Natural Language Processing applications. marisa-trie has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. However marisa-trie has 10 bugs. You can install using 'pip install marisa-trie' or download it from GitHub, PyPI.

Static memory-efficient Trie-like structures for Python (2.x and 3.x) based on marisa-trie C++ library.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              marisa-trie has a medium active ecosystem.
              It has 971 star(s) with 89 fork(s). There are 29 watchers for this library.
              There were 1 major release(s) in the last 6 months.
              There are 16 open issues and 43 have been closed. On average issues are closed in 355 days. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of marisa-trie is 1.2.0

            kandi-Quality Quality

              marisa-trie has 10 bugs (0 blocker, 0 critical, 10 major, 0 minor) and 8 code smells.

            kandi-Security Security

              marisa-trie has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              marisa-trie code analysis shows 0 unresolved vulnerabilities.
              There are 3 security hotspots that need review.

            kandi-License License

              marisa-trie is licensed under the MIT License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              marisa-trie releases are available to install and integrate.
              Deployable package is available in PyPI.
              Build file is available. You can build the component from source.
              marisa-trie saves you 317 person hours of effort in developing the same functionality from scratch.
              It has 761 lines of code, 79 functions and 10 files.
              It has low code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed marisa-trie and discovered the below as its top functions. This is intended to give you an instant insight into marisa-trie implemented functionality, and help decide if they suit your requirements.
            • Benchmark
            • Benchmark a benchmark
            • Format a result
            • Run profiling
            • Create a Trie
            • Generate random words
            • Returns a list of words 100kk
            • Split a list of words
            • Truncate a list of words
            Get all kandi verified functions for this library.

            marisa-trie Key Features

            No Key Features are available at this moment for marisa-trie.

            marisa-trie Examples and Code Snippets

            No Code Snippets are available at this moment for marisa-trie.

            Community Discussions

            Trending Discussions on marisa-trie

            QUESTION

            marisa trie suffix compression?
            Asked 2017-Jul-12 at 20:47

            I'm using a custom Cython wrapper of this marisa trie library as a key-value multimap.

            My trie entries look like key 0xff data1 0xff data2 to map key to the tuple (data1, data2). data1 is a string of variable length but data2 is always a 4-byte unsigned int. The 0xff is a delimiter byte.

            I know a trie is not the most optimal data structure for this from a theoretical point of a view, but various practical considerations make it the best available choice.

            In this use case, I have about 10-20 million keys, each one has on average 10 data points. data2 is redundant for many entries (in some cases, data2 is always the same for all data points for a given key), so I had the idea of taking the most frequent data2 entry and adding a ("", base_data2) data point to each key.

            Since a MARISA trie, to my knowledge, does not have suffix compression and for a given key each data1 is unique, I assumed that this would save 4 bytes per data tuple that uses a redundant key (plus adding in a single 4-byte "value" for each key). Having rebuilt the trie, I checked that the redundant data was no longer being stored. I expected a sizable decrease in both serialized and in-memory size, but in fact the on-disk trie went from 566MB to 557MB (and a similar reduction in RAM usage for a loaded trie).

            From this I concluded that I must be wrong about there being no suffix compression. I was now storing the entries with a redundant data2 number as key 0xff data1 0xff, so to test this theory I removed the trailing 0xff and adjusted the code that uses the trie to cope. The new trie went down from 557MB to 535MB.

            So removing a single redundant trailing byte made a 2x larger improvement than removing the same number of 4-byte sequences, so either the suffix compression theory is dead wrong, or it's implemented in some very convoluted way.

            My remaining theory is that adding in the ("", base_data2) entry at a higher point in the trie somehow throws off the compression in some terrible way, but it should just be adding in 4 more bytes when I've removed many more than that from lower down in the trie.

            I'm not optimistic for a fix, but I'd dearly like to know why I'm seeing this behavior! Thank you for your attention.

            ...

            ANSWER

            Answered 2017-Jul-12 at 20:47

            As I suspected, it's caused by padding.

            in lib/marisa/grimoire/vector/vector.h, there is the following function:

            Source https://stackoverflow.com/questions/44895094

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install marisa-trie

            You can install using 'pip install marisa-trie' or download it from GitHub, PyPI.
            You can use marisa-trie like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            Install
          • PyPI

            pip install marisa-trie

          • CLONE
          • HTTPS

            https://github.com/pytries/marisa-trie.git

          • CLI

            gh repo clone pytries/marisa-trie

          • sshUrl

            git@github.com:pytries/marisa-trie.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link

            Consider Popular Natural Language Processing Libraries

            transformers

            by huggingface

            funNLP

            by fighting41love

            bert

            by google-research

            jieba

            by fxsjy

            Python

            by geekcomputers

            Try Top Libraries by pytries

            datrie

            by pytriesPython

            DAWG

            by pytriesC++

            hat-trie

            by pytriesC

            DAWG-Python

            by pytriesPython