doc2vec | the example of doc2vec to calculate the similarity of docs | Machine Learning library
kandi X-RAY | doc2vec Summary
kandi X-RAY | doc2vec Summary
the example of doc2vec to calculate the similarity of docs 10.txt is a example of input file.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Subtract elements from a b .
doc2vec Key Features
doc2vec Examples and Code Snippets
Community Discussions
Trending Discussions on doc2vec
QUESTION
I'm working on a django project where I have to use Doc2Vec model to predict most similar articles based on the user input. I have trained a model with the help of articles in our database and when I test that model using a python file .py
by right clicking in the file and selecting run from the context menu its working. The problem is Now I'm moving that working code to a django function to load model and predict article based on user-given abstract text But I'm getting FileNotFoundError
.
I have searched how to load model in django and it seems the way is already OK. here is the complete exception:
FileNotFoundError at /searchresult
[Errno 2] No such file or directory: 'd2vmodel.model'
Request Method: GET
Request URL: http://127.0.0.1:8000/searchresult
Django Version: 3.1.5
Exception Type: FileNotFoundError
Exception Value:
[Errno 2] No such file or directory: 'd2vmodel.model'
Exception Location: C:\Users\INZIMAM_TARIQ\AppData\Roaming\Python\Python37\site-packages\smart_open\smart_open_lib.py, line 346, in _shortcut_open
Python Executable: C:\Program Files\Python37\python.exe
Python Version: 3.7.9
Python Path:
['D:\Web Work\doc2vec final submission',
'C:\Program Files\Python37\python37.zip',
'C:\Program Files\Python37\DLLs',
'C:\Program Files\Python37\lib',
'C:\Program Files\Python37',
'C:\Users\INZIMAM_TARIQ\AppData\Roaming\Python\Python37\site-packages',
'C:\Program Files\Python37\lib\site-packages']
Server time: Mon, 24 May 2021 12:44:47 +0000
D:\Web Work\doc2vec final submission\App\views.py, line 171, in searchresult
model = Doc2Vec.load("d2vmodel.model")
Here is my django function where I'm loading Doc2Vec model.
...ANSWER
Answered 2021-May-24 at 13:16Move the models from App to root directory of your project, I think it is 'doc2vec final submission'
Or create a folder inside 'doc2vec final submission' named 'models'
Change this
QUESTION
I have trained a gensim doc2vec model for an English news recommender system. the model was trained with 40K news data. I am using the code below to recommend the top 5 most similar news for e.g. news_1:
...ANSWER
Answered 2021-May-19 at 09:07There's a bulk contiguous vector structure initially created by training, for the initial known set of vectors. It's amenable to the every-candidate bulk vector calculation at the heart of most_similar()
- so that operation goes about as fast as it can, with the right vector libraries for your OS/processor.
But, that structure wasn't originally designed with incremental expansion in mind. Indeed, if you have 1 million vectors in a dense array, then want to add 1 to the end, the straightforward approach requires you to allocate a new 1-million-and-1 long array, bulk copy over the 1 million, then add the last 1. That works, but what seems like a "tiny" operation then takes a while, and ever-longer as the structure grows. And, each add more-than-doubles the temporary memory usage, for the bulk copy. So, the naive pattern of adding a whole bunch of new items individuall in a loop can be really slow & memory-intensive.
So, Gensim hasn't yet focused on providing a set-of-vectors that's easy & efficient to incrementally grow with new vectors. But, it's still indirectly possible, if you understand the caveats.
Especially in gensim-4.0.0
& above, the .dv
set of doc-vectors is an instance of KeyedVectors
with all that class's standard functions. Thos include the add_vector()
and add_vectors()
methods:
You can try these methods to add your new inferred vectors to the model.dv
object - and then they'll also be ncluded in folloup most_similar()
results.
But keep in mind:
The above caveats about performance & memory-usage - which may be minor concerns as long as your dataset isn't too large, or manageable if you do additions in occasional larger batches.
The containing
Doc2Vec
model generally isn't expecting its internal.dv
to be arbitrarily modified or expanded by other code. So, once you start doing that, parts of themodel
may not behave as expected. If you have problems with this, you could consider saving-aside the fullDoc2Vec
model
before any direct-tampering with its.dv
, and/or only expanding a completely separate instance of the doc-vectors, for example by saving them aside (eg:model.dv.save(DOC_VECS_FILENAME)
) & reloading them into a separateKeyedVectors
(eg:growing_docvecs = KeyedVectors.load(DOC_VECS_FILENAME)
).
QUESTION
I have trained a doc2vec (PV-DM) model in gensim
on documents which fall into a few classes. I am working in a non-linguistic setting where both the number of documents and the number of unique words are small (~100 documents, ~100 words) for practical reasons. Each document has perhaps 10k tokens. My goal is to show that the doc2vec embeddings are more predictive of document class than simpler statistics and to explain which words (or perhaps word sequences, etc.) in each document are indicative of class.
I have good performance of a (cross-validated) classifier trained on the embeddings compared to one compared on the other statistic, but I am still unsure of how to connect the results of the classifier to any features of a given document. Is there a standard way to do this? My first inclination was to simply pass the co-learned word embeddings through the document classifier in order to see which words inhabited which classifier-partitioned regions of the embedding space. The document classes output on word embeddings are very consistent across cross validation splits, which is encouraging, although I don't know how to turn these effective labels into a statement to the effect of "Document X got label Y because of such and such properties of words A, B and C in the document".
Another idea is to look at similarities between word vectors and document vectors. The ordering of similar word vectors is pretty stable across random seeds and hyperparameters, but the output of this sort of labeling does not correspond at all to the output from the previous method.
Thanks for help in advance.
Edit: Here are some clarifying points. The tokens in the "documents" are ordered, and they are measured from a discrete-valued process whose states, I suspect, get their "meaning" from context in the sequence, much like words. There are only a handful of classes, usually between 3 and 5. The documents are given unique tags and the classes are not used for learning the embedding. The embeddings have rather dimension, always < 100, which are learned over many epochs, since I am only worried about overfitting when the classifier is learned, not the embeddings. For now, I'm using a multinomial logistic regressor for classification, but I'm not married to it. On that note, I've also tried using the normalized regressor coefficients as vector in the embedding space to which I can compare words, documents, etc.
...ANSWER
Answered 2021-May-18 at 16:20That's a very small dataset (100 docs) and vocabulary (100 words) compared to much published work of Doc2Vec
, which has usually used tens-of-thousands or millions of distinct documents.
That each doc is thousands of words and you're using PV-DM mode that mixes both doc-to-word and word-to-word contexts for training helps a bit. I'd still expect you might need to use a smaller-than-defualt dimensionaity (vector_size<<100), & more training epochs - but if it does seem to be working for you, great.
You don't mention how many classes you have, nor what classifier algorithm you're using, nor whether known classes are being mixed into the (often unsupervised) Doc2Vec
training mode.
If you're only using known classes as the doc-tags, and your "a few" classes is, say, only 3, then to some extent you only have 3 unique "documents", which you're training on in fragments. Using only "a few" unique doctags might be prematurely hiding variety on the data that could be useful to a downstream classifier.
On the other hand, if you're giving each doc a unique ID - the original 'Paragraph Vectors' paper approach, and then you're feeding those to a downstream classifier, that can be OK alone, but may also benefit from adding the known-classes as extra tags, in addition to the per-doc IDs. (And perhaps if you have many classes, those may be OK as the only doc-tags. It can be worth comparing each approach.)
I haven't seen specific work on making Doc2Vec
models explainable, other than the observation that when you are using a mode which co-trains both doc- and word- vectors, the doc-vectors & word-vectors have the same sort of useful similarities/neighborhoods/orientations as word-vectors alone tend to have.
You could simply try creating synthetic documents, or tampering with real documents' words via targeted removal/addition of candidate words, or blended mixes of documents with strong/correct classifier predictions, to see how much that changes either (a) their doc-vector, & the nearest other doc-vectors or class-vectors; or (b) the predictions/relative-confidences of any downstream classifier.
(A wishlist feature for Doc2Vec
for a while has been to synthesize a pseudo-document from a doc-vector. See this issue for details, including a link to one partial implementation. While the mere ranked list of such words would be nonsense in natural language, it might give doc-vectors a certain "vividness".)
Whn you're not using real natural language, some useful things to keep in mind:
- if your 'texts' are really unordered bags-of-tokens, then
window
may not really be an interesting parameter. Setting it to a very-large number can make sense (to essentially put all words in each others' windows), but may not be practical/appropriate given your large docs. Or, trying PV-DBOW instead - potentially even mixing known-classes & word-tokens in eithertags
orwords
. - the default
ns_exponent=0.75
is inherited from word2vec & natural-language corpora, & at least one research paper (linked from the class documentation) suggests that for other applications, especially recommender systems, very different values may help.
QUESTION
I am implementing a simple doc2vec
with gensim
, not a word2vec
I need to remove stopwords without losing the correct order to a list of list.
Each list is a document and, as I understood for doc2vec, the model will have as input a list of TaggedDocuments
model = Doc2Vec(lst_tag_documents, vector_size=5, window=2, min_count=1, workers=4)
ANSWER
Answered 2021-Apr-25 at 12:30lower
is a list of one element, word not in STOPWORDS
will return False
. Take the first item in the list with index and split by blank space
QUESTION
I have noticed that my gensim Doc2Vec (DBOW) model is sensitive to document tags. My understanding was that these tags are cosmetic and so they should not influence the learned embeddings. Am I misunderstanding something? Here is a minimal example:
...ANSWER
Answered 2021-Apr-20 at 14:02Have you checked the magnitude of the differences?
Just running:
QUESTION
I preprocess my docs, trained my model, and saved it by following the guidelines given here: https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html
After a period of time, I want to re-train my model with different parameters. However, I don't want to preprocess docs and create "train corpus" again because it takes nearly 3 days. Is there a solution to easily load saved model, change parameters and train the model with these new parameters for the following codes:
...ANSWER
Answered 2021-Apr-20 at 18:24First, note that this section of your current code does nothing with the loaded model, because it's immediately replaced by the new model created by the 2nd line's instantiation of a model from scratch:
QUESTION
ANSWER
Answered 2021-Apr-16 at 07:51The problem is the same as in my prior anwer to a similar question:
https://stackoverflow.com/a/66976706/130288
Doc2Vec needs far more data to start working. 9 texts, with maybe 55 total words and perhaps around half that unique words is far too small to show any interesting results with this algorithm.
A few of Gensim's Doc2Vec-specific test cases & tutorials manage to squeeze some vaguely understandable similarities out of a test dataset (from a file lee_background.cor
) that has 300 documents, each of a few hundred words - so tens of thousands of words, several thousand of which are unique. But it still needs to reduce the dimensionality & up the epochs, and the results are still very weak.
If you want to see meaningful results from Doc2Vec, you should be aiming for tens-of-thousands of documents, ideally with each document having dozens or hundreds or words.
Everything short of that is going to be disappointing and not-representative of what sort of tasks the algorithm was designed to work with.
There's a tutorial using a larger movie-review dataset (100K documents) that was also used in the original 'Paragraph Vector' paper at:
There's a tutorial based on Wikipedia (millions of documents) that might need some fixup to work nowadays at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb
QUESTION
I use gensim 4.0.1 and train doc2vec:
...ANSWER
Answered 2021-Apr-06 at 21:30Five dimensions is still too many for a toy-sized dataset of just 6 words, 6 unique words, and 3 2-word texts.
None of the Word2Vec
/Doc2Vec
/FastText
-type algorithms works well on tiny amounts of contrived data. They only learn their patterns from many, subtly-contrasting usages of words in varied contexts.
Their real strengths only emerge with vectors that are 50, 100, or hundreds-of-dimensions wide - and training that many dimensions requires a unique vocabulary of (at least) many thousands of words – ideally tens or hundreds of thousands of words – with many usage examples of each. (For a variant like Doc2Vec
, you'd similarly want many thousands of varied documents.)
You'll see improved correlations with expected results when using sufficient training data.
QUESTION
I've built a Doc2Vec model with around 3M documents, now I want to compare it to another model I've previously built. The second model has been scaled to 0-1 so I now also want to scale the gensim model to the same range so that they are comparable. This is my first time using gensim so I'm not sure how this is done. It's nothing fancy but this is the code I have so far (model generation code ommited). I thought about scaling (minmax scaling with max/min in the union of vectors) the inferred vectors (v1 and v2) but I don't think this would be correct approach. The idea here is to compare two documents (with tokens likely to be in the corpus) and output a similarity score between them. I've seen a few Gensim's tutorials and they often compare a single string to the corpus' documents, which is not really the idea here.
...
ANSWER
Answered 2021-Apr-01 at 17:27Note that 'cosine similarity' & 'cosine distance' are different things.
A cosine-similarity can range from -1.0
to 1.0
– but in some models, such as those based only on positive word counts, you might only practically see values from 0.0
to 1.0
. But in both cases, items with similarities close to 1.0
are most-similar.
On the other hand, a cosine-distance can range from 0.0
to 2.0
, and items with a distance of 0.0
are least-distant (or nearest). A cosine-distance can be larger than 1.0
- but you might only see such distances in models which use the dense coordinate space (like Doc2Vec
), not in word-count models which leave half the coordinate space empty (all negative coordinates).
So: you shouldn't really be calling your function similarity
if it's returning a distance, and if it's now returning surprise numbers over 1.0
, there's nothing wrong: that's possible in some models, but not others.
You could naively rescale the 0.0
to 2.0
distances that your calculation will get with Doc2Vec
vectors, with some crude hammer like:
QUESTION
I am building a Doc2Vec model with 1000 documents using Gensim. Each document has consisted of several sentences which include multiple words.
Example)
Doc1: [[word1, word2, word3], [word4, word5, word6, word7],[word8, word9, word10]]
Doc2: [[word7, word3, word1, word2], [word1, word5, word6, word10]]
Initially, to train the Doc2Vec, I first split sentences and tag each sentence with the same document tag using "TaggedDocument". As a result, I got the final training input for Doc2Vec as follows:
TaggedDocument(words=[word1, word2, word3], tags=['Doc1'])
TaggedDocument(words=[word4, word5, word6, word7], tags=['Doc1'])
TaggedDocument(words=[word8, word9, word10], tags=['Doc1'])
TaggedDocument(words=[word7, word3, word1, word2], tags=['Doc2'])
TaggedDocument(words=[word1, word5, word6, word10], tags=['Doc2'])
However, would it be okay to train the model with the document as a whole without splitting sentences?
TaggedDocument(words=[word1, word2, word3,word4, word5, word6, word7,word8, word9, word10], tags=['Doc1'])
TaggedDocument(words=[word4, word5, word6, word7,word1, word5, word6, word10], tags=['Doc2'])
Thank you in advance :)
...ANSWER
Answered 2021-Mar-17 at 04:11Both approaches are going to be very similar in their effect.
The slight difference is that in PV-DM modes (dm=1
), or PV-DBOW with added skip-gram training (dm=0, dbow_words=1
), if you split by sentence, words in different sentences will never be within the same context-window.
For example, your 'Doc1'
words 'word3'
and 'word4'
would never be averaged-together in the same PV-DM context-window-average, nor be used to PV-DBOW skip-gram predict-each-other, if you split by sentences. If you just run the whole doc's words together into a single TaggedDocument
example, they would interact more, via appearing in shared context-windows.
Whether one or the other is better for your purposes is something you'd have to evaluate in your own analysis - it could depend a lot on the nature of the data & desired similarity results.
But, I can say that your second option, all the words in one TaggedDocument
, is the more common/traditional approach.
(That is, as long as the document is still no more than 10,000 tokens long. If longer, splitting the doc's words into multiple TaggedDocument
instances, each with the same tags
, is a common workaround for an internal 10,000-token implementation limit.)
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install doc2vec
You can use doc2vec like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page