faiss | efficient similarity search and clustering of dense vectors | Machine Learning library

 by   facebookresearch C++ Version: 1.5.3 License: MIT

kandi X-RAY | faiss Summary

kandi X-RAY | faiss Summary

faiss is a C++ library typically used in Artificial Intelligence, Machine Learning, Deep Learning, Tensorflow, Bert applications. faiss has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

Faiss contains several methods for similarity search. It assumes that the instances are represented as vectors and are identified by an integer, and that the vectors can be compared with L2 (Euclidean) distances or dot products. Vectors that are similar to a query vector are those that have the lowest L2 distance or the highest dot product with the query vector. It also supports cosine similarity, since this is a dot product on normalized vectors. Most of the methods, like those based on binary vectors and compact quantization codes, solely use a compressed representation of the vectors and do not require to keep the original vectors. This generally comes at the cost of a less precise search but these methods can scale to billions of vectors in main memory on a single server. The GPU implementation can accept input from either CPU or GPU memory. On a server with GPUs, the GPU indexes can be used a drop-in replacement for the CPU indexes (e.g., replace IndexFlatL2 with GpuIndexFlatL2) and copies to/from GPU memory are handled automatically. Results will be faster however if both input and output remain resident on the GPU. Both single and multi-GPU usage is supported.

            kandi-support Support

              faiss has a medium active ecosystem.
              It has 22571 star(s) with 3004 fork(s). There are 450 watchers for this library.
              It had no major release in the last 12 months.
              There are 298 open issues and 1795 have been closed. On average issues are closed in 510 days. There are 19 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of faiss is 1.5.3

            kandi-Quality Quality

              faiss has 0 bugs and 0 code smells.

            kandi-Security Security

              faiss has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              faiss code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              faiss is licensed under the MIT License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              faiss releases are available to install and integrate.
              Installation instructions are not available. Examples and code snippets are available.
              It has 13801 lines of code, 1021 functions and 92 files.
              It has medium code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of faiss
            Get all kandi verified functions for this library.

            faiss Key Features

            No Key Features are available at this moment for faiss.

            faiss Examples and Code Snippets

            pypidot img1Lines of Code : 6dot img1no licencesLicense : No License
            copy iconCopy
            pip install pytorch-metric-learning
            pip install pytorch-metric-learning --pre
            pip install torch===1.6.0 torchvision===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
            pip install pytorch-metric-learning
            pip install pytorch-metric-lea  
            pypidot img2Lines of Code : 1dot img2no licencesLicense : No License
            copy iconCopy
            conda install pytorch-metric-learning -c metric-learning -c pytorch
            faiss - demo auto tune
            Pythondot img3Lines of Code : 94dot img3License : Permissive (MIT License)
            copy iconCopy
            #!/usr/bin/env python2
            # Copyright (c) Facebook, Inc. and its affiliates.
            # This source code is licensed under the MIT license found in the
            # LICENSE file in the root directory of this source tree.
            from __future__ import print_function
            import os
            faiss - demo client server ivf
            Pythondot img4Lines of Code : 51dot img4License : Permissive (MIT License)
            copy iconCopy
            #!/usr/bin/env python3
            # Copyright (c) Facebook, Inc. and its affiliates.
            # This source code is licensed under the MIT license found in the
            # LICENSE file in the root directory of this source tree.
            import sys
            import numpy as np
            import faiss
            faiss - demo ondisk ivf
            Pythondot img5Lines of Code : 47dot img5License : Permissive (MIT License)
            copy iconCopy
            #!/usr/bin/env python3
            # Copyright (c) Facebook, Inc. and its affiliates.
            # This source code is licensed under the MIT license found in the
            # LICENSE file in the root directory of this source tree.
            import sys
            import numpy as np
            import faiss
            faiss: How to retrieve vector by id from python
            Pythondot img6Lines of Code : 21dot img6License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            import faiss
            import numpy as np
            a = np.random.uniform(size=30)
            a = a.reshape(-1,10).astype(np.float32)
            d = 10
            index = faiss.index_factory(d,'Flat', faiss.METRIC_L2)
            xb = index.xb
            print(xb.at(0) == a[0][0])
            A weird requirements.txt format
            Pythondot img7Lines of Code : 2dot img7License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            conda env create --file requirements.txt
            What to do if you need packages from both conda and pip?
            Pythondot img8Lines of Code : 4dot img8License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            conda create --name alex python=3.8 pip django requests bla-bla-bla
            conda activate alex
            pip install drm foo bar yada-yada
            Pairwise similarity/similarity matrix calculation optimization
            Pythondot img9Lines of Code : 81dot img9License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            import pandas as pd
            import numpy as np
            import faiss 
            df_1 = pd.DataFrame({'object_id_1': range(10),
                               'feature_0': np.random.uniform(0,1,10),
            Pairwise similarity/similarity matrix calculation optimization
            Pythondot img10Lines of Code : 14dot img10License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            import faiss
            dimension = 100
            value1 = np.random.random((n, dimension)).astype('float32')
            index = faiss.IndexFlatL2(d)
            xq = value2
            k= len(value1)
            D, I = index.search(xq, k) 
            pip install faiss

            Community Discussions


            Writing to a file parallely while processing in a loop in python
            Asked 2022-Feb-23 at 19:25

            I have a CSV data of 65K. I need to do some processing for each csv line which generates a string at the end. I have to write/append that string in a file.

            Psuedo Code:



            Answered 2022-Feb-23 at 19:25

            Q : " Writing to a file parallely while processing in a loop in python ... "

            A :
            Frankly speaking, the file-I/O is not your performance-related enemy.

            "With all due respect to the colleagues, Python (since ever) used GIL-lock to avoid any level of concurrent execution ( actually re-SERIAL-ising the code-execution flow into dancing among any amount of threads, lending about 100 [ms] of code-interpretation time to one-AFTER-another-AFTER-another, thus only increasing the interpreter's overhead times ( and devastating all pre-fetches into CPU-core caches on each turn ... paying the full mem-I/O costs on each next re-fetch(es) ). So threading is ANTI-pattern in python (except, I may accept, for network-(long)-transport latency masking ) – user3666197 44 mins ago "

            Given about the 65k files, listed in CSV, ought get processed ASAP, the performance-tuned orchestration is the goal, file-I/O being just a negligible ( and by-design well latency-maskable ) part thereof ( which does not mean, we can't screw it even more ( if trying to organise it in another performance-devastating ANTI-pattern ), can we? )

            Tip #1 : avoid & resist to use any low-hanging fruit SLOCs if The Performance is the goal

            If the code starts with a cheapest-ever iterator-clause,
            be it a mock-up for aRow in aCsvDataSET: ...
            or the real-code for i in range( len( queries ) ): ... - these (besides being known for ages to be awfully slow part of the python code-interpretation capabilites, the second one being even an iterator-on-range()-iterator in Py3 and even a silent RAM-killer in Py2 ecosystem for any larger sized ranges) look nice in "structured-programming" evangelisation, as they form a syntax-compliant separation of a deeper-level part of the code, yet it does so at an awfully high costs impacts due to repetitively paid overhead-costs accumulation. A finally injected need to "coordinate" unordered concurrent file-I/O operations, not necessary in principle at all, if done smart, are one such example of adverse performance impacts if such a trivial SLOC's ( and similarly poor design decisions' ) are being used.

            Better way?

            • a ) avoid the top-level (slow & overhead-expensive) looping
            • b ) "split" the 65k-parameter space into not much more blocks than how many memory-I/O-channels are present on your physical device ( the scoring process, I can guess from the posted text, is memory-I/O intensive, as some model has to go through all the texts for scoring to happen )
            • c ) spawn n_jobs-many process workers, that will joblib.Parallel( n_jobs = ... )( delayed( <_scoring_fun_> )( block_start, block_end, ...<_params_>... ) ) and run the scoring_fun(...) for such distributed block-part of the 65k-long parameter space.
            • d ) having computed the scores and related outputs, each worker-process can and shall file-I/O its own results in its private, exclusively owned, conflicts-prevented output file
            • e ) having finished all partial block-parts' processing, the main-Python process can just join the already ( just-[CONCURRENTLY] created, smoothly & non-blocking-ly O/S-buffered / interleaved-flow, real-hardware-deposited ) stored outputs, if such a need is ...,
              finito - we are done ( knowing there is no faster way to compute the same block-of-tasks, that are principally embarrasingly independent, besides the need to orchestrate them collision-free with minimised-add-on-costs).

            If interested in tweaking a real-system End-to-End processing-performance,
            start with lstopo-map
            next verify the number of physical memory-I/O-channels
            may a bit experiment with Python joblib.Parallel()-process instantiation, under-subscribing or over-subscribing the n_jobs a bit lower or a bit above the number of physical memory-I/O-channels. If the actual processing has some, hidden to us, maskable latencies, there might be chances to spawn more n_jobs-workers, until the End-to-End processing performance keeps steadily growing, until a system-noise hides any such further performance-tweaking effects

            A Bonus part - why un-managed sources of latency kill The Performance

            Source https://stackoverflow.com/questions/71233138


            ModuleNotFoundError: No module named 'milvus'
            Asked 2022-Feb-15 at 19:23

            Goal: to run this Auto Labelling Notebook on AWS SageMaker Jupyter Labs.

            Kernels tried: conda_pytorch_p36, conda_python3, conda_amazonei_mxnet_p27.



            Answered 2022-Feb-03 at 09:29

            I would recommend to downgrade your milvus version to a version before the 2.0 release just a week ago. Here is a discussion on that topic: https://github.com/deepset-ai/haystack/issues/2081

            Source https://stackoverflow.com/questions/70954157


            What is the equivalent of python's faiss.normalize_L2() in C++?
            Asked 2022-Jan-31 at 11:15

            I want to perfom similarity search using FAISS for 100k facial embeddings in C++. For the distance calculator I would like to use cosine similarity. For this purpose, I choose faiss::IndexFlatIP .But according to the documentation we need to normalize the vector prior to adding it to the index. The documentation suggested the following code in python:



            Answered 2022-Jan-31 at 11:15

            You can build and use the C++ interface of Faiss library (see this).

            If you just want L2 normalization of a vector in C++:

            Source https://stackoverflow.com/questions/70924232


            faiss: How to retrieve vector by id from python
            Asked 2022-Jan-10 at 11:57

            I have a faiss index and want to use some of the embeddings in my python script. Selection of Embeddings should be done by id. As faiss is written in C++, swig is used as an API.

            I guess the function I need is reconstruct :



            Answered 2022-Jan-10 at 11:57

            This is the only way I found manually.

            Source https://stackoverflow.com/questions/70624600


            Flask app.route always giving 404 errors except /
            Asked 2022-Jan-08 at 00:11

            I have a Flask app that kept giving 404 error on any route other than /



            Answered 2022-Jan-08 at 00:11

            I just figured out the root cause. I was running it with docker-compose up which only attaching the existing images rather than build on any updates to the service code.

            Source https://stackoverflow.com/questions/70612942


            A weird requirements.txt format
            Asked 2021-Oct-17 at 09:29

            I downloaded a requirements.txt file from a GitHub repository, but it appears to be little different than the normal format of requirements.txt file.

            1. Can you tell me how the author generated this kind of requirements.txt file? Which tools did they use?
            2. How can I use this particular file format to instantiate the Python environment? I have tried executing the commands conda install --file requirements.txt and pip install -r requirements.txt on a Windows ‘ machine, but to no avail.




            Answered 2021-Oct-17 at 01:46

            This looks like a conda environment.yml file. It can be used to create a conda environment, like so

            Source https://stackoverflow.com/questions/69600759


            k-mean clustering - inertia only gets larger
            Asked 2021-May-20 at 16:46

            I am trying to use the KMeans clustering from faiss on a human pose dataset of body joints. I have 16 body parts so a dimension of 32. The joints are scaled in a range between 0 and 1. My dataset consists of ~ 900.000 instances. As mentioned by faiss (faiss_FAQ):

            As a rule of thumb there is no consistent improvement of the k-means quantizer beyond 20 iterations and 1000 * k training points

            Applying this to my problem I randomly select 50000 instances for training. As I want to check for a number of clusters k between 1 and 30.

            Now to my "problem":

            The inertia is increasing directly as the number of cluster increases (n_cluster on the x-axis):

            I tried varying the number of iterations, the number of redos, verbose and spherical, but the results stay the same or get worse. I do not think that it is a problem of my implementation; I tested it on a small example with 2D data and very clear clusters and it worked.

            Is it that the data is just bad clustered or is there another problem/mistake I have missed? Maybe the scaling of the values between 0 and 1? Should I try another approach?



            Answered 2021-May-20 at 16:46

            I found my mistake. I had to increase the parameter max_points_per_centroid. As I have so many data points it sampled a sub-batch for the fit. For a larger number of clusters this sub-batch is larger. See FAQ of faiss:

            max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled

            The larger subbatch of course has a larger inertia as there are more points in total.

            Source https://stackoverflow.com/questions/67550997


            What is the best approach to measure a similarity between texts in multiple languages in python?
            Asked 2021-Mar-23 at 08:38

            So, I have a task where I need to measure the similarity between two texts. These texts are short descriptions of products from a grocery store. They always include a name of a product (for example, milk), and they may include a producer and/or size, and maybe some other characteristics of a product.

            I have a whole set of such texts, and then, when a new one arrives, I need to determine whether there are similar products in my database and measure how similar they are (on a scale from 0 to 100%).

            The thing is: the texts may be in two different languages: Ukrainian and Russian. Also, if there is a foreign brand (like, Coca Cola), it will be written in English.

            My initial idea on solving this task was to get multilingual word embeddings (where similar words in different languages are located nearby) and find the distance between those texts. However, I am not sure how efficient this will be, and if it is ok, what to start with.

            Because each text I have is just a set of product characteristics, some word embeddings based on a context may not work (I'm not sure in this statement, it is just my assumption).

            So far, I have tried to get familiar with the MUSE framework, but I encountered an issue with faiss installation.

            Hence, my questions are:

            • Is my idea with word embeddings worth trying?
            • Is there maybe a better approach?
            • If the idea with word embeddings is okay, which ones should I use?

            Note: I have Windows 10 (in case some libraries don't work on Windows), and I need the library to work with Ukrainian and Russian languages.

            Thanks in advance for any help! Any advice would be highly appreciated!



            Answered 2021-Mar-12 at 23:32

            Word embedding is meaningful inside the language but can't be transferrable to other languages. An observation for this statement is: if two words co-occur with a lot inside sentences, their embeddings can be near each other. Hence, as there is no one-to-one mapping between two general languages, you cannot compare word embeddings.

            However, if two languages are similar enough to one-to-one mapping words, you may count on your idea.

            In sum, without translation, your idea is not applicable to two general languages anymore.

            Source https://stackoverflow.com/questions/66608244


            What to do if you need packages from both conda and pip?
            Asked 2021-Jan-31 at 00:28

            I've been developing my project with pip (django, drm etc). Now I need to use faiss, which only has unofficial package on pip (official - in conda). What should I do in this situation? Can I combine them somehow? Or should I migrate to conda?



            Answered 2021-Jan-31 at 00:28

            If you're using a non-conda environment, then you're limited to using pip only. That is, pip does not know how to install conda packages.

            But if you switch to using conda, then you can use either. The general recommendation is to install everything with conda if possible, but use pip when you have no other choice.

            I recommend installing Miniconda, then creating a new environment for all of your dependencies. If necessary, add pip-only dependencies (if you have any).

            Source https://stackoverflow.com/questions/65952928


            Unable to import faiss
            Asked 2020-Nov-14 at 15:24

            I've installed faiss. However when I try to import, it throws the following error.



            Answered 2020-Nov-14 at 15:24

            If you installed faiss on CPU with pip install faiss try this:

            Source https://stackoverflow.com/questions/64250476

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network


            No vulnerabilities reported

            Install faiss

            You can download it from GitHub.


            The following are entry points for documentation:.
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
          • PyPI

            pip install faiss

          • CLONE
          • HTTPS


          • CLI

            gh repo clone facebookresearch/faiss

          • sshUrl


          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link