bk-tree | BK-tree Java library | Dataset library

 by   gtri Java Version: 1.0 License: Apache-2.0

kandi X-RAY | bk-tree Summary

kandi X-RAY | bk-tree Summary

bk-tree is a Java library typically used in Artificial Intelligence, Dataset, Example Codes applications. bk-tree has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has high support. You can download it from GitHub, Maven.

A Java [BK-tree] library. BK-trees offer a simple index of elements in a [metric space] that allows for searching the tree for elements within a certain distance of the search query with sub-linear efficiency. For example, a BK-tree with string elements and a metric like the [Damerau–Levenshtein distance] can serve as a [fuzzy search] index.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              bk-tree has a highly active ecosystem.
              It has 26 star(s) with 11 fork(s). There are 3 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              There are 1 open issues and 0 have been closed. On average issues are closed in 2545 days. There are no pull requests.
              It has a positive sentiment in the developer community.
              The latest version of bk-tree is 1.0

            kandi-Quality Quality

              bk-tree has 0 bugs and 39 code smells.

            kandi-Security Security

              bk-tree has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              bk-tree code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              bk-tree is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              bk-tree releases are not available. You will need to build from source code and install.
              Deployable package is available in Maven.
              Build file is available. You can build the component from source.
              Installation instructions are not available. Examples and code snippets are available.
              It has 543 lines of code, 54 functions and 12 files.
              It has medium code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed bk-tree and discovered the below as its top functions. This is intended to give you an instant insight into bk-tree implemented functionality, and help decide if they suit your requirements.
            • Searches the tree for elements that match the given query
            • Compares two BTK trees for equality
            • String representation of immutable BkTree
            • Returns a hashCode of this metric
            • Creates a metric for charSequence metric
            Get all kandi verified functions for this library.

            bk-tree Key Features

            No Key Features are available at this moment for bk-tree.

            bk-tree Examples and Code Snippets

            Example usage
            Javadot img1Lines of Code : 39dot img1License : Permissive (Apache-2.0)
            copy iconCopy
            import edu.gatech.gtri.bktree.*;
            import edu.gatech.gtri.bktree.BkTreeSearcher.Match;
            import java.util.Set;
            // The Hamming distance is a simple metric that counts the number
            // of positions on which the strings (of equal length) differ.
            Metric hamming  

            Community Discussions

            QUESTION

            How to implement a fast fuzzy-search engine using BK-tree when the corpus has 10 billion unique DNA sequences?
            Asked 2021-May-05 at 05:52

            I am trying to use the BK-tree data structure in python to store a corpus with ~10 billion entries (1e10) in order to implement a fast fuzzy search engine.

            Once I add over ~10 million (1e7) values to a single BK-tree, I start to see a significant degradation in the performance of querying.

            I was thinking to store the corpus into a forest of a thousand BK-trees and to query them in parallel.

            Does this idea sound feasible? Should I create and query 1,000 BK-trees simultaneously? What else can I do in order to use BK-tree for this corpus.

            I use pybktree.py and my queries are intended to find all entries within an edit distance d.

            Is there some architecture or database which will allow me to store those trees?

            Note: I don’t run out of memory, rather the tree begins to be inefficient (presumably each node has too many children).

            ...

            ANSWER

            Answered 2021-Jan-18 at 12:18
            Few thoughts

            BK-trees
            Kudos to Ben Hoyt and his link to the issue which I will draw from. That being said, the first observation from the mentioned issue is that the BK tree isn't exactly logarithmic. From what you told us your usual d is ~6, which is 3/10 of your string length. Unfortunately, that means that if we look at the tables from the issue you will get the complexity of somewhere between O(N^0.8) to O(N). In the optimistic case of the exponent being 0.8(it will likely be slightly worse) you get an improvement factor of ~100 on your 10B entries. So if you have a reasonably fast implementation of BK-trees it can still be worth it to use them or use them as a basis for a further optimization.

            The downside of this is that even if you use 1000 trees in parallel, you will only get the improvement from the parallelization as the perfomance of the trees depends on the d rather than on the amount of the nodes within the tree. However even if you run all the 1000 trees at once with a massive machine, we are at the ~10M nodes/tree which you reported as slow. Still, computation wise, this seems doable.

            A brute force approach
            If you don't mind paying a little I would look into something like Google cloud big query if that doesn't clash with some kind of data confidentiality. They will brute force the solution for you - for a fee. The current rate is $5/TB of a query. Your dataset is ~10B rows * 20chars. Taking one byte per char, one query would take 200GB so ~1$ per query if you went the lazy way.
            However, since the charge is per byte of a data in a column and not per complexity of a question, you could improve on this by storing your strings as bits - 2bits per a letter, this would save you 75% of the expenses.
            Improving further, you can write your query in such a way that it will ask for a dozen strings at once. You might need to be a bit careful to use a batch of similar strings for the purpose of the query to avoid clogging of the result with too many one-offs though.

            Brute forcing of the BK-trees
            Since if you go with the route above, you will have to pay depending on the volume, the ~100-fold decrease in the computations needed becomes ~100-fold decrease in price which might be useful, especially if you have a lot of queries to run.
            However you would need to figure out a way to store this tree in a several layers of databases to query recursively as the Bigquery pricing depends on the volume of the data in the queried table.
            Building a smart batch engine for recursive processing of the queries to minimize the costs could be fun optimization excercise.

            A choice of language
            One more thing. While I think that Python is a good language for fast prototyping, analysis and thinking about code in general you are past that stage. You are currently looking for a way to do a specific, well defined and well thought operation as fast as possible. Python is not a great language for this as this example shows. While I used all the tricks I could think of in Python, the Java and C solutions were still several times faster. (Not to mention the rust one that beat us all - but he beat us by algorithm as well so it's hard to compare.) So if you go from python to a faster language, you might gain another factor or ten or maybe even more of a performance gain. This could be another fun optimization exercise.
            Note: I am being rather conservative with the estimate as the fuzzywuzzy already offers to use a C library in the background so I'm not too sure about how much of the work still depends on the python. My experience in similar cases is that the performance gain can be factor of 100 from pure python(or worse, pure R) to a compiled language.

            Source https://stackoverflow.com/questions/65588433

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install bk-tree

            You can download it from GitHub, Maven.
            You can use bk-tree like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the bk-tree component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            Install
            Maven
            Gradle
            CLONE
          • HTTPS

            https://github.com/gtri/bk-tree.git

          • CLI

            gh repo clone gtri/bk-tree

          • sshUrl

            git@github.com:gtri/bk-tree.git

          • Download

            Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link