BM25 | A Python implementation of the BM25 ranking function | Search Engine library

by nhirakawa Python Version: Current License: MIT

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | BM25 Summary

BM25 is a Python library typically used in Database, Search Engine applications. BM25 has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. However BM25 build file is not available. You can download it from GitHub.

A Python implementation of the BM25 ranking function.

Support

Quality

Security

License

Reuse

Support

BM25 has a low active ecosystem.

It has 163 star(s) with 57 fork(s). There are 10 watchers for this library.

It had no major release in the last 6 months.

There are 1 open issues and 0 have been closed. On average issues are closed in 302 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of BM25 is current.

Quality

BM25 has 0 bugs and 0 code smells.

Security

BM25 has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

BM25 code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

BM25 is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

BM25 releases are not available. You will need to build from source code and install.

BM25 has no build file. You will be need to create the build yourself to build the component from source.

Top functions reviewed by kandi - BETA

kandi has reviewed BM25 and discovered the below as its top functions. This is intended to give you an instant insight into BM25 implemented functionality, and help decide if they suit your requirements.

Run the query
Calculate the score for a given query
Compute the Bayesian Score Function
Returns the average length of the table
Return the length of a document
Compute the KL coefficient for a given density
Build data structures from a corpus
Add a word to the index
Add a document to the table

Get all kandi verified functions for this library.

BM25 Key Features

No Key Features are available at this moment for BM25.

BM25 Examples and Code Snippets

No Code Snippets are available at this moment for BM25.

Community Discussions

Trending Discussions on BM25

Elasticsearch Became case sensitive after add synonym analyzer

Is it possible to add a new similarity metric to an existing index in Elasticsearch?

How does Elasticsearch search documents? How to customize preprocess pipeline and scoring functions in ES?

Elasticsearch best similarity for retrieving exact matches

Vespa response: Summary data is incomplete: Timed out waiting for summary data

Why does elastic search returns response when trying to index?

ARANGODB. Can I use LOWER() in PHRASE search?

Recall returns nothing when querying rank-profile

LM in elastic search

How to Add Static Path in Python Coding in Flask?

QUESTION

Elasticsearch Became case sensitive after add synonym analyzer

Asked 2022-Mar-23 at 04:43

After I added synonym analyzer to my_index, the index became case-sensitive

I have one property called nationality that has synonym analyzer. But it seems that this property become case sensitive because of the synonym analyzer.

Here is my /my_index/_mappings

...

ANSWER

Answered 2022-Mar-22 at 21:14

Did you apply synonym filter after adding your data into index?

If so, probably "India COUNTRY" phrase was indexed exactly as "India COUNTRY". When you sent a match query to index, your query was analyzed and sent as "INDIA COUNTRY" because you have uppercase filter anymore, it is matched because you are using match query, it is enough to match one of the words. "COUNTRY" word provide this.

But, when you sent one word query "india" then it is analyzed and converted to "INDIA" because of your uppercase filter but you do not have any matching word on your index. You just have a document contains "India COUNTRY".

My answer has a little bit assumption. I hope that it will be useful to understand your problem.

Source https://stackoverflow.com/questions/71568435

QUESTION

Is it possible to add a new similarity metric to an existing index in Elasticsearch?

Asked 2022-Feb-08 at 07:47

Let's say there is an existing index with a customized BM25 similarity metric like this:

...

ANSWER

Answered 2022-Feb-07 at 09:40

TLDR;

I believe you can't.

Source https://stackoverflow.com/questions/70973345

QUESTION

How does Elasticsearch search documents? How to customize preprocess pipeline and scoring functions in ES?

Asked 2021-Nov-04 at 10:53

I want to implement Elasticsearch on a customized corpus. I have installed elasticsearch of version 7.5.1 and I do all my work in python using the official client.

Here I have a few questions:

How to customize preprocess pipeline? For example, I want to use a BertTokenizer to convert strings to tokens instead of ngrams
How to customize scoring function of each document w.r.t. the query? For example, I want to compare effects of tf-idf with bm25, or even using some neural models for scoring.

If there is great tutorial in python, please share with me. Thanks in advance.

...

ANSWER

Answered 2021-Nov-04 at 10:53

You can customize the similarity function when creating an index. See the Similarity Module section of the documentation. You can find a good article that compares classical TF_IDF with BM25 on the OpenSource Connections site.

In general, Elasticsearch uses an inverted index to look up all documents that contain a specific word or token.

It sounds like you want to use vector fields for scoring, there is a good article on the elastic blog that explains how you can achieve that. Be aware that as of now Elasticsearch is not using vector fields for retrieval, only for scoring, if you want to use vector fields for retrieval you have to use a plugin, or the OpenSearch fork, or wait for version 8.

In my opinion, using ANN in real-time during search is too slow and expensive, and i have yet to see improvements in relevancy with normal search requests.

I would do the preprocessing of your documents in your own python environment before indexing and not use any Elasticsearch pipelines or plugins. It is easier to debug and iterate outside of Elasticsearch.

You could also take a look at the Haystack Project, it might have a lot of the functionality that you are looking for, already build in.

Source https://stackoverflow.com/questions/69821536

QUESTION

Elasticsearch best similarity for retrieving exact matches

Asked 2021-Nov-03 at 11:36

I have an index with 1 million phrases and I want to search in the index with some query phrases in italian (and that is not the problem). The problem is in the order in which the matches are retrieved: I want to have first the exact matches so I changed the default similarity to "boolean" and I thought it was a good idea but sometimes it does not work. For example: searching in my index for phrases containing the words "film cortometraggio" the first matches are:

Distribuito dalla General Film Company, il film- un cortometraggio in due bobine
Distribuito dalla General Film Company, il film - un cortometraggio di 150 metri - uscì nelle sale cinematografiche

But there are some better phrases that should be returned before those ones like:

Robinet aviatore Robinet aviatore è un film cortometraggio del 1911 diretto da Luigi Maggi;

This last phrase should be returned first in my opinion because there is no space between the two words I am searching for.

Using the BM25 algorithm the first match that I get is "Pappi Corsicato Ha diretto film, cortometraggi, documentari e videoclip.". In this case also should be provided the phrase "Robinet aviatore Robinet aviatore è un film cortometraggio del 1911 diretto da Luigi Maggi;" because is an exact match and I don't get why the algorithm gives the other phrase a higher score.

I am using the Java Rest high level client and the search query that I'm doing are simple match Phrase query, like this: searchSourceBuilder.query(QueryBuilders.matchPhraseQuery(field, text).slop(5)

This is the structure of the documents in my index:

...

ANSWER

Answered 2021-Nov-03 at 01:20

I have replicated your problem in my ambient, same version, same analyzers but I still received the same results. Probably that is for the BM25 algorithm, the other millions of docs influence the score.

I have some suggestions that could help you to solve the problem:

Don't use the full steaming Analyzers because they are too intrusive, use the light version
You could complement the light analyzer using the ngram tokenizer
You could create a bool query that matches first to the fields without the analyzer using a multifield

mapping Example:

Source https://stackoverflow.com/questions/69811159

QUESTION

Vespa response: Summary data is incomplete: Timed out waiting for summary data

Asked 2021-Aug-29 at 19:27

I am deploying a simple text retrieval system with Vespa. However, I found when setting topk to some large number, e.g. 40, the response will include the error message "Summary data is incomplete: Timed out waiting for summary data." and also some unexpected ids. The system works fine for some small topk like 10. The response was as follows:

{'root': {'id': 'toplevel', 'relevance': 1.0, 'fields': {'totalCount': 1983140}, 'coverage': {'coverage': 19, 'documents': 4053984, 'degraded': {'match-phase': False, 'timeout': True, 'adaptive-timeout': False, 'non-ideal-state': False}, 'full': False, 'nodes': 1, 'results': 1, 'resultsFull': 0}, 'errors': [{'code': 12, 'summary': 'Timed out', 'message': 'Summary data is incomplete: Timed out waiting for summary data. 1 responses outstanding.'}], 'children': [{'id': 'index:square_datastore_content/0/34b46b2e96fc0aa18ed4941b', 'relevance': 44.44359956427316, 'source': 'square_datastore_content'}, {'id': 'index:square_datastore_content/0/16dbc34c5e77684cd6f554fd', 'relevance': 43.94371735208669, 'source': 'square_datastore_content'}, {'id': 'index:square_datastore_content/0/9f2fd93f6d74e88f96d7014f', 'relevance': 43.298002713993384, 'source': 'square_datastore_content'}, {'id': 'index:square_datastore_content/0/76c4e3ee15dc684a78938a9d', 'relevance': 40.908658368905485, 'source': 'square_datastore_content'}, {'id': 'index:square_datastore_content/0/c04ceee4b9085a4d041d8c81', 'relevance': 36.13561898237115, 'source': 'square_datastore_content'}, {'id': 'index:square_datastore_content/0/13806c518392ae7b80ab4e4c', 'relevance': 35.688377118163714, 'source': 'square_datastore_content'}, {'id': 'index:square_datastore_content/0/87e0f13fdef1a1c404d3c8c6', 'relevance': 34.74150232183567, 'source': 'square_datastore_content'}, ...]}}

I am using the schema:

...

ANSWER

Answered 2021-Aug-29 at 19:27

The default Vespa timeout is 500 ms and can be adjusted by &timeout=x where x is given in seconds, e.g &timeout=2 would use an overall request timeout of 2 seconds.

A query is executed in two protocol phases:

Find the top k matches given the query/ranking profile combination, each node returns up to k results
The stateless container merges the results and finally asks for summary data (e.g the contents of only the top k results)

See https://docs.vespa.ai/en/performance/sizing-search.html for an explanation of this.

In your case you are hit by two things

A soft timeout at the content node (coverage is reported to be only 19%) so within the default timeout of 500ms it could retrieve and rank 19% of the available content. At 500ms minus a factor it timed out and returned what it had retrieved and rank up til the.
When trying to use the time left it also timed out waiting for the hits data for those documents which it managed to retrieve and rank within the soft timeout, this is the incomplete summary data response.

Generally, if you want cheap BM25 search use WAND (https://docs.vespa.ai/en/using-wand-with-vespa.html) If you want to search using embeddings, use ANN instead of brute force NN. We also have a complete sample application reproducing the DPR (Dense Passage Retrieval) here https://github.com/vespa-engine/sample-apps/tree/master/dense-passage-retrieval-with-ann

Source https://stackoverflow.com/questions/68972157

QUESTION

Why does elastic search returns response when trying to index?

Asked 2020-Nov-23 at 16:38

I was trying to follow the tutorial - http://ethen8181.github.io/machine-learning/search/bm25_intro.html#ElasticSearch-BM25

I successfully started my elastic node by running as a daemon and it did respond upon issuing the query - curl -X GET "localhost:9200/

When I try running the following code here, it returns 400.

...

ANSWER

Answered 2020-Nov-23 at 16:38

Calling response.json() or response.text will give you the response body, which may tell you exactly what's wrong with the request

Source https://stackoverflow.com/questions/64972412

QUESTION

ARANGODB. Can I use LOWER() in PHRASE search?

Asked 2020-Nov-19 at 05:45

I know when we use the filter function, we could apply a LOWER()/UPPER() function to match our search criterion.

...

ANSWER

Answered 2020-Nov-19 at 05:45

You can check the analyzer option. the en_text analyzer should already lower the case if not you can create another analyzer of type text

You can check the analyzers docs here

https://www.arangodb.com/docs/stable/arangosearch-analyzers.html#text

Source https://stackoverflow.com/questions/64901550

QUESTION

Recall returns nothing when querying rank-profile

Asked 2020-Oct-12 at 18:14

I have a sample Vespa instance and I want to train a lightgbm model from the rank-profile. https://docs.vespa.ai/documentation/learning-to-rank.html

However, anytime I specify the recall with the docID, I get 0 hits. I'm using example code from here: https://github.com/vespa-engine/sample-apps/blob/master/text-search/src/python/collect_training_data.py

...

ANSWER

Answered 2020-Oct-12 at 18:14

The collect script/function expects that there is a field called id in your document schema. If you alter the script to use the uri field instead you should be able to retrieve the documents.

Source https://stackoverflow.com/questions/64322983

QUESTION

LM in elastic search

Asked 2020-Aug-10 at 09:01

how can I improve recall for this condition ?any suggestion? I want to create an index with 39 million passages each one containing at least four sentences in English. My queries are short and interrogative sentences. I know that a language model with Dirichlet smoothing, stop word removal and stemmer is best for this condition. how can I index with these conditions (I've indexed with this configs but there is no difference in results with default bm25)

My index:

...

ANSWER

Answered 2020-Aug-10 at 09:01

you can try similarity in query

Source https://stackoverflow.com/questions/63316759

QUESTION

How to Add Static Path in Python Coding in Flask?

Asked 2020-Jul-12 at 17:58

I am developing some search engine type application. My Code is look like this:

...

ANSWER

Answered 2020-Jul-12 at 17:58

use url_for() function to build the url

Source https://stackoverflow.com/questions/62861918

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install BM25

You can download it from GitHub.
You can use BM25 like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: