kandi background

faiss | efficient similarity search and clustering of dense vectors | Machine Learning library

Download this library from

kandi X-RAY | faiss Summary

faiss is a C++ library typically used in Artificial Intelligence, Machine Learning, Deep Learning, Tensorflow, Bert applications. faiss has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.
Faiss contains several methods for similarity search. It assumes that the instances are represented as vectors and are identified by an integer, and that the vectors can be compared with L2 (Euclidean) distances or dot products. Vectors that are similar to a query vector are those that have the lowest L2 distance or the highest dot product with the query vector. It also supports cosine similarity, since this is a dot product on normalized vectors. Most of the methods, like those based on binary vectors and compact quantization codes, solely use a compressed representation of the vectors and do not require to keep the original vectors. This generally comes at the cost of a less precise search but these methods can scale to billions of vectors in main memory on a single server. The GPU implementation can accept input from either CPU or GPU memory. On a server with GPUs, the GPU indexes can be used a drop-in replacement for the CPU indexes (e.g., replace IndexFlatL2 with GpuIndexFlatL2) and copies to/from GPU memory are handled automatically. Results will be faster however if both input and output remain resident on the GPU. Both single and multi-GPU usage is supported.

kandi-support Support

  • faiss has a medium active ecosystem.
  • It has 14276 star(s) with 2299 fork(s). There are 439 watchers for this library.
  • There were 1 major release(s) in the last 12 months.
  • There are 171 open issues and 1389 have been closed. On average issues are closed in 42 days. There are 12 open pull requests and 0 closed requests.
  • It has a neutral sentiment in the developer community.
  • The latest version of faiss is v1.7.1

quality kandi Quality

  • faiss has 0 bugs and 0 code smells.

securitySecurity

  • faiss has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • faiss code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.

license License

  • faiss is licensed under the MIT License. This license is Permissive.
  • Permissive licenses have the least restrictions, and you can use them in most projects.

buildReuse

  • faiss releases are available to install and integrate.
  • Installation instructions are not available. Examples and code snippets are available.
  • It has 13801 lines of code, 1021 functions and 92 files.
  • It has medium code complexity. Code complexity directly impacts maintainability of the code.
Top functions reviewed by kandi - BETA

Coming Soon for all Libraries!

Currently covering the most popular Java, JavaScript and Python libraries. See a SAMPLE HERE.
kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.

faiss Key Features

A library for efficient similarity search and clustering of dense vectors.

faiss Examples and Code Snippets

  • Reference
  • What is the equivalent of python's faiss.normalize_L2() in C++?
  • faiss: How to retrieve vector by id from python
  • A weird requirements.txt format
  • What to do if you need packages from both conda and pip?
  • Unable to import faiss
  • Numpy - referencing values in other array for multidimensional array
  • ValueError: None is only supported in the 1st dimension. Tensor 'flatbuffer_data' has invalid shape '[None, None, 1, 512]'

Reference

@article{JDH17,
  title={Billion-scale similarity search with GPUs},
  author={Johnson, Jeff and Douze, Matthijs and J{\'e}gou, Herv{\'e}},
  journal={arXiv preprint arXiv:1702.08734},
  year={2017}
}

Community Discussions

Trending Discussions on faiss
  • Writing to a file parallely while processing in a loop in python
  • ModuleNotFoundError: No module named 'milvus'
  • What is the equivalent of python's faiss.normalize_L2() in C++?
  • faiss: How to retrieve vector by id from python
  • Flask app.route always giving 404 errors except /
  • A weird requirements.txt format
  • k-mean clustering - inertia only gets larger
  • What is the best approach to measure a similarity between texts in multiple languages in python?
  • What to do if you need packages from both conda and pip?
  • Unable to import faiss
Trending Discussions on faiss

QUESTION

Writing to a file parallely while processing in a loop in python

Asked 2022-Feb-23 at 19:25

I have a CSV data of 65K. I need to do some processing for each csv line which generates a string at the end. I have to write/append that string in a file.

Psuedo Code:

for row in csv_data:
   processed_string = ...
   file_pointer.write(processed_string + '\n')

How can I make this write operation run in parallel such that main processing operation does not have to include time taken for writing to file? I tried using batch writing (store n lines and then write them at the same time). But it would be really great if you can suggest me some method that can do this parallely. Thanks!

Edit: There are 65K records in a csv file. I am processing it which generate a string (multiline about 10-12). I have to write it to a file. For 65K records, getting a results with 10-15 lines each. Normally code takes 10 mins to run. But adding this file operations increses time to +2-3 mins. So if I can do it parallely without affecting code execution?

Here is the code part.

for i in range(len(queries)): # 65K runs
    Logs.log_query(i, name, version)

    # processed_results = Some processing ...

    # Final Answer
    s = final_results(name, version, processed_results) # Returns a multiline string
    f.write(s + '\n')

"""
EXAMPLE OUTPUT:
-----------------
[0] NAME: Adobe Acrobat Reader DC | VERSION: 21.005
FAISS RESULTS (with cutoff 0.63)
     id                                               name                         version   eol_date extended_eol_date                   major_version minor_version    score
1486469                            Adobe Acrobat Reader DC                    21.005.20054 07-04-2020        07-07-2020                              21           005 0.966597
 327901                            Adobe Acrobat Reader DC                    21.005.20048 07-04-2020        07-07-2020                              21           005 0.961541
 327904                            Adobe Acrobat Reader DC                    21.007.20095 07-04-2020        07-07-2020                              21           007 0.960825
 327905                            Adobe Acrobat Reader DC                    21.007.20099 07-04-2020        07-07-2020                              21           007 0.960557
 327902                            Adobe Acrobat Reader DC                    21.005.20060 07-04-2020        07-07-2020                              21           005 0.958580
 327900                            Adobe Acrobat Reader DC                    21.001.20145 07-04-2020        07-07-2020                              21           001 0.956085
 327903                            Adobe Acrobat Reader DC                    21.007.20091 07-04-2020        07-07-2020                              21           007 0.954148
1486465                            Adobe Acrobat Reader DC                    20.006.20034 07-04-2020        07-07-2020                              20           006 0.941820
1486459                            Adobe Acrobat Reader DC                    19.012.20035 07-04-2020        07-07-2020                              19           012 0.928502
1486466                            Adobe Acrobat Reader DC                    20.012.20048 07-04-2020        07-07-2020                              20           012 0.928366
1486458                            Adobe Acrobat Reader DC                    19.012.20034 07-04-2020        07-07-2020                              19           012 0.925761
1486461                            Adobe Acrobat Reader DC                    19.021.20047 07-04-2020        07-07-2020                              19           021 0.922519
1486463                            Adobe Acrobat Reader DC                    19.021.20049 07-04-2020        07-07-2020                              19           021 0.919659
1486462                            Adobe Acrobat Reader DC                    19.021.20048 07-04-2020        07-07-2020                              19           021 0.917590
1486464                            Adobe Acrobat Reader DC                    19.021.20061 07-04-2020        07-07-2020                              19           021 0.912260
1486460                            Adobe Acrobat Reader DC                    19.012.20040 07-04-2020        07-07-2020                              19           012 0.909160
1486457                            Adobe Acrobat Reader DC                    15.008.20082 07-04-2020        07-07-2020                              15           008 0.902536
 327899                                   Adobe Acrobat DC                    21.007.20099 07-04-2020        07-07-2020                              21           007 0.895940
1277732                        Acrobat Reader DC (classic)                            2015 07-07-2020                 *                            2015           NaN 0.875471

OPEN SEARCH RESULTS (with cutoff 13)
{ "score": 67.98198, "id": 327901, "name": Adobe Acrobat Reader DC, "version": 21.005.20048, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
{ "score": 66.63623, "id": 327902, "name": Adobe Acrobat Reader DC, "version": 21.005.20060, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
{ "score": 65.96028, "id": 1486469, "name": Adobe Acrobat Reader DC, "version": 21.005.20054, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
FINAL ANSWER [OPENSEARCH]
{ "score": 67.98198, "id": 327901, "name": Adobe Acrobat Reader DC, "version": 21.005.20048, "eol_date": 2020-04-07, "extended_eol_date": 2020-07-07 }
----------------------------------------------------------------------------------------------------

"""

ANSWER

Answered 2022-Feb-23 at 19:25

Q : " Writing to a file parallely while processing in a loop in python ... "

A :
Frankly speaking, the file-I/O is not your performance-related enemy.

"With all due respect to the colleagues, Python (since ever) used GIL-lock to avoid any level of concurrent execution ( actually re-SERIAL-ising the code-execution flow into dancing among any amount of threads, lending about 100 [ms] of code-interpretation time to one-AFTER-another-AFTER-another, thus only increasing the interpreter's overhead times ( and devastating all pre-fetches into CPU-core caches on each turn ... paying the full mem-I/O costs on each next re-fetch(es) ). So threading is ANTI-pattern in python (except, I may accept, for network-(long)-transport latency masking ) – user3666197 44 mins ago "

Given about the 65k files, listed in CSV, ought get processed ASAP, the performance-tuned orchestration is the goal, file-I/O being just a negligible ( and by-design well latency-maskable ) part thereof ( which does not mean, we can't screw it even more ( if trying to organise it in another performance-devastating ANTI-pattern ), can we? )


Tip #1 : avoid & resist to use any low-hanging fruit SLOCs if The Performance is the goal


If the code starts with a cheapest-ever iterator-clause,
be it a mock-up for aRow in aCsvDataSET: ...
or the real-code for i in range( len( queries ) ): ... - these (besides being known for ages to be awfully slow part of the python code-interpretation capabilites, the second one being even an iterator-on-range()-iterator in Py3 and even a silent RAM-killer in Py2 ecosystem for any larger sized ranges) look nice in "structured-programming" evangelisation, as they form a syntax-compliant separation of a deeper-level part of the code, yet it does so at an awfully high costs impacts due to repetitively paid overhead-costs accumulation. A finally injected need to "coordinate" unordered concurrent file-I/O operations, not necessary in principle at all, if done smart, are one such example of adverse performance impacts if such a trivial SLOC's ( and similarly poor design decisions' ) are being used.

Better way?

  • a ) avoid the top-level (slow & overhead-expensive) looping
  • b ) "split" the 65k-parameter space into not much more blocks than how many memory-I/O-channels are present on your physical device ( the scoring process, I can guess from the posted text, is memory-I/O intensive, as some model has to go through all the texts for scoring to happen )
  • c ) spawn n_jobs-many process workers, that will joblib.Parallel( n_jobs = ... )( delayed( <_scoring_fun_> )( block_start, block_end, ...<_params_>... ) ) and run the scoring_fun(...) for such distributed block-part of the 65k-long parameter space.
  • d ) having computed the scores and related outputs, each worker-process can and shall file-I/O its own results in its private, exclusively owned, conflicts-prevented output file
  • e ) having finished all partial block-parts' processing, the main-Python process can just join the already ( just-[CONCURRENTLY] created, smoothly & non-blocking-ly O/S-buffered / interleaved-flow, real-hardware-deposited ) stored outputs, if such a need is ...,
    and
    finito - we are done ( knowing there is no faster way to compute the same block-of-tasks, that are principally embarrasingly independent, besides the need to orchestrate them collision-free with minimised-add-on-costs).

If interested in tweaking a real-system End-to-End processing-performance,
start with lstopo-map
next verify the number of physical memory-I/O-channels
and
may a bit experiment with Python joblib.Parallel()-process instantiation, under-subscribing or over-subscribing the n_jobs a bit lower or a bit above the number of physical memory-I/O-channels. If the actual processing has some, hidden to us, maskable latencies, there might be chances to spawn more n_jobs-workers, until the End-to-End processing performance keeps steadily growing, until a system-noise hides any such further performance-tweaking effects

A Bonus part - why un-managed sources of latency kill The Performance

Source https://stackoverflow.com/questions/71233138

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install faiss

You can download it from GitHub.

Support

The following are entry points for documentation:.

Build your Application

Share this kandi XRay Report