ann-benchmark | artificial neural network library for Spark MLlib | Machine Learning library
kandi X-RAY | ann-benchmark Summary
kandi X-RAY | ann-benchmark Summary
The goal is to benchmark the library, compare it with the other tools and test scalability with the number of nodes in the cluster. The intention is to test a big model. Data is small so the time needed to read the data can be ignored.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of ann-benchmark
ann-benchmark Key Features
ann-benchmark Examples and Code Snippets
Community Discussions
Trending Discussions on ann-benchmark
QUESTION
I'm looking for an algorithm with the fastest time per query for a problem similar to nearest-neighbor search, but with two differences:
- I need to only approximately confirm (tolerating Type I and Type II error) the existence of a neighbor within some distance k or return the approximate distance of the nearest neighbor.
- I can query many at once
I'd like better throughput than the approximate nearest neighbor libraries out there (https://github.com/erikbern/ann-benchmarks) which seem better designed for single queries. In particular, the algorithmic relaxation of the first criteria seems like it should leave room for an algorithmic shortcut, but I can't find any solutions in the literature nor can I figure out how to design one.
Here's my current best solution, which operates at about 10k queries / sec on per CPU. I'm looking for something close to an order-of-magnitude speedup if possible.
...ANSWER
Answered 2020-Sep-21 at 04:54I'm a bit skeptical of benchmarks such as the one you have linked, as in my experience I have found that the definition of the problem at hand far outweighs in importance the merits of any one algorithm across a set of other (possibly similar looking) problems.
More simply put, an algorithm being a high performer on a given benchmark does not imply it will be a higher performer on the problem you care about. Even small or apparently trivial changes to the formulation of your problem can significantly change the performance of any fixed set of algorithms.
That said, given the specifics of the problem you care about I would recommend the following:
- use the cascading approach described in the paper [1]
- use SIMD operations (either SSE on intel chips or GPUs) to accelerate, the nearest neighbour problem is one where operations closer to the metal and parallelism can really shine
- tune the parameters of the algorithm to maximize your objective; in particular, the algorithm of [1] has a few easy to tune parameters which will dramatically trade performance for accuracy, make sure you perform a grid search over these parameters to set them to the sweet spot for your problem
Note: I have recommended the paper [1] because I have tried many of the algorithms listed in the benchmark you linked and found them all inferior (for the task of image reconstruction) to the approach listed in [1] while at the same time being much more complicated than [1], both undesirable properties. YMMV depending on your problem definition.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install ann-benchmark
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page