Document-Similarity | Context-based similar documents with gensim | Topic Modeling library
kandi X-RAY | Document-Similarity Summary
kandi X-RAY | Document-Similarity Summary
In this repository, I implemented Context-based similar documents according to Rich Anchor's blog. Context-based similar documents is a problem that finding the most similar documents of the given document. The blog's approach is using LDA (Latent Dirichlet Allocation) Model to build the generic topics of the documents in database, then vectorize them, and using LSH (Locality Sensitive Hashing) to find the most similar documents (nearest neighbors) of the given document. This software is written by Python 2.x with LDA Model provided by gensim and LSHForest provided by scikit-learn.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Build the corpus
- Reads the docs from the input file
Document-Similarity Key Features
Document-Similarity Examples and Code Snippets
Community Discussions
Trending Discussions on Document-Similarity
QUESTION
I have a question about the word representation algorithms: Which one of the algorithms word2Vec, doc2Vec and Tf-IDF is more suitable for handling text classification tasks ? The corpus used in my supervised learning classification is composed of a list of multiple sentences, with both short length sentences and long length ones. As discussed in this thread, doc2vec vs word2vec choice is a matter of document length. As for Tf-Idf vs. word embedding, it's more a matter of text representation.
My other question is, what if for the same corpus I had more than one label to link to the sentences in it ? If I create multiple entries/labels for the same sentence, it affects the decision of the final classification algorithm. How can I tell the model that every label counts equal for every sentence of the document ?
Thank you in advance,
...ANSWER
Answered 2020-Mar-27 at 20:42You should try multiple methods of turning your sentences into 'feature vectors'. There are no hard-and-fast rules; what works best for your project will depend a lot on your specific data, problem-domains, & classification goals.
(Don't extrapolate guidelines from other answers – such as the one you've linked that's about document-similarity rather than classification – as best practices for your project.)
To get initially underway, you may want to focus on some simple 'binary classification' aspect of your data, first. For example, pick a single label. Train on all the texts, merely trying to predict if that one label applies or not.
When you have that working, so you have a understanding of each step – corpus prep, text processing, feature-vectorization, classification-training, classification-evaluation – then you can try extending/adapting those steps to either single-label classification (where each text should have exactly one unique label) or multi-label classification (where each text might have any number of combined labels).
QUESTION
One sentence backdrop: I have text data from auto-transcribed talks, and I want to compare their similarity of their content (e.g. what they are talking about) to do clustering and recommendation. I am quite new to NLP.
Data: The data I am using is available here. For all the lazy ones
clone https://github.com/TMorville/transcribed_data
and here is a snippet of code to put it in a df:
...ANSWER
Answered 2019-Jan-28 at 07:10You can do most of that with SpaCY and some regexes.
So, you have to take a look at the SpaCY API documentation.
Basic steps in any NLP pipeline are the following:
Language detection (self explanatory, if you're working with some dataset, you know what the language is and you can adapt your pipeline to that). When you know a language you have to download a correct models from SpaCY. The instructions are here. Let's use English for this example. In your command line just type
python -m spacy download en
and then import it to the preprocessing script like this:
QUESTION
I seem to be getting all the correct results until the very last step. My array of results keeps coming back empty.
I'm trying to follow this tutorial to compare 6 sets of notes:
https://www.oreilly.com/learning/how-do-i-compare-document-similarity-using-python
I have this so far:
...ANSWER
Answered 2018-Apr-17 at 12:19Depending on the content of your raw_docs
, this can be the correct behaviour.
Your code returns an empty tf_idf
although your query words appear in your original documents and your dictionary. tf_idf
is computed by term_frequency * inverse_document_frequency
. inverse_document_frequency
is computed by log(N/d)
, where N
is your total number of documents and d
is the number of documents a specific term occurs in.
My guess is that your query terms ['client', 'is']
occur in each document of yours, resulting in an inverse_document_frequency
of 0
and an empty tf_idf
list. You can check this behaviour with the documents I took and modified from the tutorial you mentioned:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install Document-Similarity
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page