faiss | efficient similarity search and clustering of dense vectors | Machine Learning library
kandi X-RAY | faiss Summary
kandi X-RAY | faiss Summary
Faiss contains several methods for similarity search. It assumes that the instances are represented as vectors and are identified by an integer, and that the vectors can be compared with L2 (Euclidean) distances or dot products. Vectors that are similar to a query vector are those that have the lowest L2 distance or the highest dot product with the query vector. It also supports cosine similarity, since this is a dot product on normalized vectors. Most of the methods, like those based on binary vectors and compact quantization codes, solely use a compressed representation of the vectors and do not require to keep the original vectors. This generally comes at the cost of a less precise search but these methods can scale to billions of vectors in main memory on a single server. The GPU implementation can accept input from either CPU or GPU memory. On a server with GPUs, the GPU indexes can be used a drop-in replacement for the CPU indexes (e.g., replace IndexFlatL2 with GpuIndexFlatL2) and copies to/from GPU memory are handled automatically. Results will be faster however if both input and output remain resident on the GPU. Both single and multi-GPU usage is supported.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of faiss
faiss Key Features
faiss Examples and Code Snippets
pip install pytorch-metric-learning
pip install pytorch-metric-learning --pre
pip install torch===1.6.0 torchvision===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install pytorch-metric-learning
pip install pytorch-metric-lea
#!/usr/bin/env python2
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
from __future__ import print_function
import os
#!/usr/bin/env python3
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import sys
import numpy as np
import faiss
from
#!/usr/bin/env python3
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import sys
import numpy as np
import faiss
from
import faiss
import numpy as np
a = np.random.uniform(size=30)
a = a.reshape(-1,10).astype(np.float32)
d = 10
index = faiss.index_factory(d,'Flat', faiss.METRIC_L2)
index.add(a)
xb = index.xb
print(xb.at(0) == a[0][0])
conda create --name alex python=3.8 pip django requests bla-bla-bla
conda activate alex
pip install drm foo bar yada-yada
python==3.6
pandas==0.25.0
numpy==1.17.1
faiss==1.5.3
import pandas as pd
import numpy as np
import faiss
df_1 = pd.DataFrame({'object_id_1': range(10),
'feature_0': np.random.uniform(0,1,10),
import faiss
dimension = 100
value1 = np.random.random((n, dimension)).astype('float32')
index = faiss.IndexFlatL2(d)
index.add(value1)
xq = value2
k= len(value1)
D, I = index.search(xq, k)
pip install faiss
Community Discussions
Trending Discussions on faiss
QUESTION
I have a CSV data of 65K. I need to do some processing for each csv line which generates a string at the end. I have to write/append that string in a file.
Psuedo Code:
...ANSWER
Answered 2022-Feb-23 at 19:25Q : " Writing to a file parallely while processing in a loop in python ... "
A :
Frankly speaking, the file-I/O is not your performance-related enemy.
"With all due respect to the colleagues, Python (since ever) used GIL-lock to avoid any level of concurrent execution ( actually re-SERIAL-ising the code-execution flow into dancing among any amount of threads, lending about 100 [ms] of code-interpretation time to one-AFTER-another-AFTER-another, thus only increasing the interpreter's overhead times ( and devastating all pre-fetches into CPU-core caches on each turn ... paying the full mem-I/O costs on each next re-fetch(es) ). So threading is ANTI-pattern in python (except, I may accept, for network-(long)-transport latency masking ) – user3666197 44 mins ago "
Given about the 65k files, listed in CSV, ought get processed ASAP, the performance-tuned orchestration is the goal, file-I/O being just a negligible ( and by-design well latency-maskable ) part thereof ( which does not mean, we can't screw it even more ( if trying to organise it in another performance-devastating ANTI-pattern ), can we? )
Tip #1 : avoid & resist to use any low-hanging fruit SLOCs if The Performance is the goal
If the code starts with a cheapest-ever iterator-clause,
be it a mock-up for aRow in aCsvDataSET: ...
or the real-code for i in range( len( queries ) ): ...
- these (besides being known for ages to be awfully slow part of the python code-interpretation capabilites, the second one being even an iterator-on-range()-iterator in Py3 and even a silent RAM-killer in Py2 ecosystem for any larger sized ranges) look nice in "structured-programming" evangelisation, as they form a syntax-compliant separation of a deeper-level part of the code, yet it does so at an awfully high costs impacts due to repetitively paid overhead-costs accumulation. A finally injected need to "coordinate" unordered concurrent file-I/O operations, not necessary in principle at all, if done smart, are one such example of adverse performance impacts if such a trivial SLOC's ( and similarly poor design decisions' ) are being used.
Better way?
- a ) avoid the top-level (slow & overhead-expensive) looping
- b ) "split" the 65k-parameter space into not much more blocks than how many memory-I/O-channels are present on your physical device ( the scoring process, I can guess from the posted text, is memory-I/O intensive, as some model has to go through all the texts for scoring to happen )
- c ) spawn
n_jobs
-many process workers, that willjoblib.Parallel( n_jobs = ... )( delayed( <_scoring_fun_> )( block_start, block_end, ...<_params_>... ) )
and run thescoring_fun(...)
for such distributed block-part of the 65k-long parameter space. - d ) having computed the scores and related outputs, each worker-process can and shall file-I/O its own results in its private, exclusively owned, conflicts-prevented output file
- e ) having finished all partial block-parts' processing, the
main
-Python process can just join the already ( just-[CONCURRENTLY] created, smoothly & non-blocking-ly O/S-buffered / interleaved-flow, real-hardware-deposited ) stored outputs, if such a need is ...,
and
finito - we are done ( knowing there is no faster way to compute the same block-of-tasks, that are principally embarrasingly independent, besides the need to orchestrate them collision-free with minimised-add-on-costs).
If interested in tweaking a real-system End-to-End processing-performance,
start with lstopo
-map
next verify the number of physical memory-I/O-channels
and
may a bit experiment with Python joblib.Parallel()
-process instantiation, under-subscribing or over-subscribing the n_jobs
a bit lower or a bit above the number of physical memory-I/O-channels. If the actual processing has some, hidden to us, maskable latencies, there might be chances to spawn more n_jobs
-workers, until the End-to-End processing performance keeps steadily growing, until a system-noise hides any such further performance-tweaking effects
A Bonus part - why un-managed sources of latency kill The Performance
QUESTION
Goal: to run this Auto Labelling Notebook on AWS SageMaker Jupyter Labs.
Kernels tried: conda_pytorch_p36
, conda_python3
, conda_amazonei_mxnet_p27
.
ANSWER
Answered 2022-Feb-03 at 09:29I would recommend to downgrade your milvus version to a version before the 2.0 release just a week ago. Here is a discussion on that topic: https://github.com/deepset-ai/haystack/issues/2081
QUESTION
I want to perfom similarity search using FAISS for 100k facial embeddings in C++.
For the distance calculator I would like to use cosine similarity. For this purpose, I choose faiss::IndexFlatIP
.But according to the documentation we need to normalize the vector prior to adding it to the index. The documentation suggested the following code in python:
ANSWER
Answered 2022-Jan-31 at 11:15You can build and use the C++ interface of Faiss
library (see this).
If you just want L2 normalization of a vector in C++:
QUESTION
I have a faiss index and want to use some of the embeddings in my python script. Selection of Embeddings should be done by id. As faiss is written in C++, swig is used as an API.
I guess the function I need is reconstruct :
...ANSWER
Answered 2022-Jan-10 at 11:57This is the only way I found manually.
QUESTION
I have a Flask app that kept giving 404 error on any route other than /
ANSWER
Answered 2022-Jan-08 at 00:11I just figured out the root cause.
I was running it with docker-compose up
which only attaching the existing images rather than build on any updates to the service code.
QUESTION
I downloaded a requirements.txt
file from a GitHub repository, but it appears to be little different than the normal format of requirements.txt
file.
- Can you tell me how the author generated this kind of
requirements.txt
file? Which tools did they use? - How can I use this particular file format to instantiate the Python environment? I have tried executing the commands
conda install --file requirements.txt
andpip install -r requirements.txt
on a Windows ‘ machine, but to no avail.
https://github.com/wvangansbeke/Unsupervised-Classification/blob/master/requirements.txt
...ANSWER
Answered 2021-Oct-17 at 01:46This looks like a conda environment.yml
file. It can be used to create a conda environment, like so
QUESTION
I am trying to use the KMeans clustering from faiss on a human pose dataset of body joints. I have 16 body parts so a dimension of 32. The joints are scaled in a range between 0 and 1. My dataset consists of ~ 900.000 instances. As mentioned by faiss (faiss_FAQ):
As a rule of thumb there is no consistent improvement of the k-means quantizer beyond 20 iterations and 1000 * k training points
Applying this to my problem I randomly select 50000 instances for training. As I want to check for a number of clusters k between 1 and 30.
Now to my "problem":
The inertia is increasing directly as the number of cluster increases (n_cluster on the x-axis):
I tried varying the number of iterations, the number of redos, verbose and spherical, but the results stay the same or get worse. I do not think that it is a problem of my implementation; I tested it on a small example with 2D data and very clear clusters and it worked.
Is it that the data is just bad clustered or is there another problem/mistake I have missed? Maybe the scaling of the values between 0 and 1? Should I try another approach?
...ANSWER
Answered 2021-May-20 at 16:46I found my mistake. I had to increase the parameter max_points_per_centroid. As I have so many data points it sampled a sub-batch for the fit. For a larger number of clusters this sub-batch is larger. See FAQ of faiss:
max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled
The larger subbatch of course has a larger inertia as there are more points in total.
QUESTION
So, I have a task where I need to measure the similarity between two texts. These texts are short descriptions of products from a grocery store. They always include a name of a product (for example, milk), and they may include a producer and/or size, and maybe some other characteristics of a product.
I have a whole set of such texts, and then, when a new one arrives, I need to determine whether there are similar products in my database and measure how similar they are (on a scale from 0 to 100%).
The thing is: the texts may be in two different languages: Ukrainian and Russian. Also, if there is a foreign brand (like, Coca Cola
), it will be written in English.
My initial idea on solving this task was to get multilingual word embeddings (where similar words in different languages are located nearby) and find the distance between those texts. However, I am not sure how efficient this will be, and if it is ok, what to start with.
Because each text I have is just a set of product characteristics, some word embeddings based on a context may not work (I'm not sure in this statement, it is just my assumption).
So far, I have tried to get familiar with the MUSE framework, but I encountered an issue with faiss
installation.
Hence, my questions are:
- Is my idea with word embeddings worth trying?
- Is there maybe a better approach?
- If the idea with word embeddings is okay, which ones should I use?
Note: I have Windows 10 (in case some libraries don't work on Windows), and I need the library to work with Ukrainian and Russian languages.
Thanks in advance for any help! Any advice would be highly appreciated!
...ANSWER
Answered 2021-Mar-12 at 23:32Word embedding is meaningful inside the language but can't be transferrable to other languages. An observation for this statement is: if two words co-occur with a lot inside sentences, their embeddings can be near each other. Hence, as there is no one-to-one mapping between two general languages, you cannot compare word embeddings.
However, if two languages are similar enough to one-to-one mapping words, you may count on your idea.
In sum, without translation, your idea is not applicable to two general languages anymore.
QUESTION
I've been developing my project with pip (django, drm etc). Now I need to use faiss, which only has unofficial package on pip (official - in conda). What should I do in this situation? Can I combine them somehow? Or should I migrate to conda?
...ANSWER
Answered 2021-Jan-31 at 00:28If you're using a non-conda environment, then you're limited to using pip
only. That is, pip
does not know how to install conda packages.
But if you switch to using conda
, then you can use either. The general recommendation is to install everything with conda
if possible, but use pip
when you have no other choice.
I recommend installing Miniconda, then creating a new environment for all of your dependencies. If necessary, add pip
-only dependencies (if you have any).
QUESTION
I've installed faiss. However when I try to import, it throws the following error.
...ANSWER
Answered 2020-Nov-14 at 15:24If you installed faiss on CPU with pip install faiss
try this:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install faiss
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page