doc2vec | the example of doc2vec to calculate the similarity of docs | Machine Learning library

by iamxiaomu Python Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | doc2vec Summary

doc2vec is a Python library typically used in Artificial Intelligence, Machine Learning applications. doc2vec has no bugs, it has no vulnerabilities and it has high support. However doc2vec build file is not available. You can download it from GitHub.

the example of doc2vec to calculate the similarity of docs 10.txt is a example of input file.

Support

Quality

Security

License

Reuse

Support

doc2vec has a highly active ecosystem.

It has 29 star(s) with 25 fork(s). There are 1 watchers for this library.

It had no major release in the last 6 months.

There are 3 open issues and 0 have been closed. On average issues are closed in 1060 days. There are no pull requests.

It has a positive sentiment in the developer community.

The latest version of doc2vec is current.

Quality

doc2vec has 0 bugs and 0 code smells.

Security

doc2vec has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

doc2vec code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

doc2vec does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

doc2vec releases are not available. You will need to build from source code and install.

doc2vec has no build file. You will be need to create the build yourself to build the component from source.

doc2vec saves you 18 person hours of effort in developing the same functionality from scratch.

It has 52 lines of code, 1 functions and 1 files.

It has low code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed doc2vec and discovered the below as its top functions. This is intended to give you an instant insight into doc2vec implemented functionality, and help decide if they suit your requirements.

Subtract elements from a b .

Get all kandi verified functions for this library.

doc2vec Key Features

No Key Features are available at this moment for doc2vec.

doc2vec Examples and Code Snippets

No Code Snippets are available at this moment for doc2vec.

Community Discussions

Trending Discussions on doc2vec

How to load a trained model in django

How to interpret doc2vec classifier in terms of words?

Preprocessing a list of list removing stopwords for doc2vec using map without losing words order

Why is my Doc2Vec model in gensim not reproducible?

How to change parameters of saved model without training docs in Gensim Doc2Vec?

Understanding Gensim Doc2vec ranking

How to interpret doc2vec results on previously seen data?

Scale cosine distance to 0-1 using Gensim

Should I split sentences in a document for Doc2Vec?

QUESTION

How to load a trained model in django

Asked 2021-May-24 at 13:16

I'm working on a django project where I have to use Doc2Vec model to predict most similar articles based on the user input. I have trained a model with the help of articles in our database and when I test that model using a python file .py by right clicking in the file and selecting run from the context menu its working. The problem is Now I'm moving that working code to a django function to load model and predict article based on user-given abstract text But I'm getting FileNotFoundError.
I have searched how to load model in django and it seems the way is already OK. here is the complete exception:

FileNotFoundError at /searchresult
[Errno 2] No such file or directory: 'd2vmodel.model'
Request Method: GET
Request URL: http://127.0.0.1:8000/searchresult
Django Version: 3.1.5
Exception Type: FileNotFoundError
Exception Value:
[Errno 2] No such file or directory: 'd2vmodel.model'
Exception Location: C:\Users\INZIMAM_TARIQ\AppData\Roaming\Python\Python37\site-packages\smart_open\smart_open_lib.py, line 346, in _shortcut_open
Python Executable: C:\Program Files\Python37\python.exe
Python Version: 3.7.9
Python Path:
['D:\Web Work\doc2vec final submission',
'C:\Program Files\Python37\python37.zip',
'C:\Program Files\Python37\DLLs',
'C:\Program Files\Python37\lib',
'C:\Program Files\Python37',
'C:\Users\INZIMAM_TARIQ\AppData\Roaming\Python\Python37\site-packages',
'C:\Program Files\Python37\lib\site-packages']
Server time: Mon, 24 May 2021 12:44:47 +0000
D:\Web Work\doc2vec final submission\App\views.py, line 171, in searchresult
model = Doc2Vec.load("d2vmodel.model")

Here is my django function where I'm loading Doc2Vec model.

...

ANSWER

Answered 2021-May-24 at 13:16

Move the models from App to root directory of your project, I think it is 'doc2vec final submission'

Or create a folder inside 'doc2vec final submission' named 'models'

Change this

Source https://stackoverflow.com/questions/67672494

QUESTION

How to get similarity score for unseen documents using Gensim Doc2Vec model?

Asked 2021-May-19 at 09:07

I have trained a gensim doc2vec model for an English news recommender system. the model was trained with 40K news data. I am using the code below to recommend the top 5 most similar news for e.g. news_1:

...

ANSWER

Answered 2021-May-19 at 09:07

There's a bulk contiguous vector structure initially created by training, for the initial known set of vectors. It's amenable to the every-candidate bulk vector calculation at the heart of most_similar() - so that operation goes about as fast as it can, with the right vector libraries for your OS/processor.

But, that structure wasn't originally designed with incremental expansion in mind. Indeed, if you have 1 million vectors in a dense array, then want to add 1 to the end, the straightforward approach requires you to allocate a new 1-million-and-1 long array, bulk copy over the 1 million, then add the last 1. That works, but what seems like a "tiny" operation then takes a while, and ever-longer as the structure grows. And, each add more-than-doubles the temporary memory usage, for the bulk copy. So, the naive pattern of adding a whole bunch of new items individuall in a loop can be really slow & memory-intensive.

So, Gensim hasn't yet focused on providing a set-of-vectors that's easy & efficient to incrementally grow with new vectors. But, it's still indirectly possible, if you understand the caveats.

Especially in gensim-4.0.0 & above, the .dv set of doc-vectors is an instance of KeyedVectors with all that class's standard functions. Thos include the add_vector() and add_vectors() methods:

https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.add_vector

https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.add_vectors

You can try these methods to add your new inferred vectors to the model.dv object - and then they'll also be ncluded in folloup most_similar() results.

But keep in mind:

The above caveats about performance & memory-usage - which may be minor concerns as long as your dataset isn't too large, or manageable if you do additions in occasional larger batches.
The containing Doc2Vec model generally isn't expecting its internal .dv to be arbitrarily modified or expanded by other code. So, once you start doing that, parts of the model may not behave as expected. If you have problems with this, you could consider saving-aside the full Doc2Vec model before any direct-tampering with its .dv, and/or only expanding a completely separate instance of the doc-vectors, for example by saving them aside (eg: model.dv.save(DOC_VECS_FILENAME)) & reloading them into a separate KeyedVectors (eg: growing_docvecs = KeyedVectors.load(DOC_VECS_FILENAME)).

Source https://stackoverflow.com/questions/67596945

QUESTION

How to interpret doc2vec classifier in terms of words?

Asked 2021-May-18 at 22:36

I have trained a doc2vec (PV-DM) model in gensim on documents which fall into a few classes. I am working in a non-linguistic setting where both the number of documents and the number of unique words are small (~100 documents, ~100 words) for practical reasons. Each document has perhaps 10k tokens. My goal is to show that the doc2vec embeddings are more predictive of document class than simpler statistics and to explain which words (or perhaps word sequences, etc.) in each document are indicative of class.

I have good performance of a (cross-validated) classifier trained on the embeddings compared to one compared on the other statistic, but I am still unsure of how to connect the results of the classifier to any features of a given document. Is there a standard way to do this? My first inclination was to simply pass the co-learned word embeddings through the document classifier in order to see which words inhabited which classifier-partitioned regions of the embedding space. The document classes output on word embeddings are very consistent across cross validation splits, which is encouraging, although I don't know how to turn these effective labels into a statement to the effect of "Document X got label Y because of such and such properties of words A, B and C in the document".

Another idea is to look at similarities between word vectors and document vectors. The ordering of similar word vectors is pretty stable across random seeds and hyperparameters, but the output of this sort of labeling does not correspond at all to the output from the previous method.

Thanks for help in advance.

Edit: Here are some clarifying points. The tokens in the "documents" are ordered, and they are measured from a discrete-valued process whose states, I suspect, get their "meaning" from context in the sequence, much like words. There are only a handful of classes, usually between 3 and 5. The documents are given unique tags and the classes are not used for learning the embedding. The embeddings have rather dimension, always < 100, which are learned over many epochs, since I am only worried about overfitting when the classifier is learned, not the embeddings. For now, I'm using a multinomial logistic regressor for classification, but I'm not married to it. On that note, I've also tried using the normalized regressor coefficients as vector in the embedding space to which I can compare words, documents, etc.

...

ANSWER

Answered 2021-May-18 at 16:20

That's a very small dataset (100 docs) and vocabulary (100 words) compared to much published work of Doc2Vec, which has usually used tens-of-thousands or millions of distinct documents.

That each doc is thousands of words and you're using PV-DM mode that mixes both doc-to-word and word-to-word contexts for training helps a bit. I'd still expect you might need to use a smaller-than-defualt dimensionaity (vector_size<<100), & more training epochs - but if it does seem to be working for you, great.

You don't mention how many classes you have, nor what classifier algorithm you're using, nor whether known classes are being mixed into the (often unsupervised) Doc2Vec training mode.

If you're only using known classes as the doc-tags, and your "a few" classes is, say, only 3, then to some extent you only have 3 unique "documents", which you're training on in fragments. Using only "a few" unique doctags might be prematurely hiding variety on the data that could be useful to a downstream classifier.

On the other hand, if you're giving each doc a unique ID - the original 'Paragraph Vectors' paper approach, and then you're feeding those to a downstream classifier, that can be OK alone, but may also benefit from adding the known-classes as extra tags, in addition to the per-doc IDs. (And perhaps if you have many classes, those may be OK as the only doc-tags. It can be worth comparing each approach.)

I haven't seen specific work on making Doc2Vec models explainable, other than the observation that when you are using a mode which co-trains both doc- and word- vectors, the doc-vectors & word-vectors have the same sort of useful similarities/neighborhoods/orientations as word-vectors alone tend to have.

You could simply try creating synthetic documents, or tampering with real documents' words via targeted removal/addition of candidate words, or blended mixes of documents with strong/correct classifier predictions, to see how much that changes either (a) their doc-vector, & the nearest other doc-vectors or class-vectors; or (b) the predictions/relative-confidences of any downstream classifier.

(A wishlist feature for Doc2Vec for a while has been to synthesize a pseudo-document from a doc-vector. See this issue for details, including a link to one partial implementation. While the mere ranked list of such words would be nonsense in natural language, it might give doc-vectors a certain "vividness".)

Whn you're not using real natural language, some useful things to keep in mind:

if your 'texts' are really unordered bags-of-tokens, then window may not really be an interesting parameter. Setting it to a very-large number can make sense (to essentially put all words in each others' windows), but may not be practical/appropriate given your large docs. Or, trying PV-DBOW instead - potentially even mixing known-classes & word-tokens in either tags or words.
the default ns_exponent=0.75 is inherited from word2vec & natural-language corpora, & at least one research paper (linked from the class documentation) suggests that for other applications, especially recommender systems, very different values may help.

Source https://stackoverflow.com/questions/67580388

QUESTION

Preprocessing a list of list removing stopwords for doc2vec using map without losing words order

Asked 2021-Apr-26 at 00:23

I am implementing a simple doc2vec with gensim, not a word2vec

I need to remove stopwords without losing the correct order to a list of list.

Each list is a document and, as I understood for doc2vec, the model will have as input a list of TaggedDocuments

model = Doc2Vec(lst_tag_documents, vector_size=5, window=2, min_count=1, workers=4)

...

ANSWER

Answered 2021-Apr-25 at 12:30

lower is a list of one element, word not in STOPWORDS will return False. Take the first item in the list with index and split by blank space

Source https://stackoverflow.com/questions/67253213

QUESTION

Why is my Doc2Vec model in gensim not reproducible?

Asked 2021-Apr-20 at 18:49

I have noticed that my gensim Doc2Vec (DBOW) model is sensitive to document tags. My understanding was that these tags are cosmetic and so they should not influence the learned embeddings. Am I misunderstanding something? Here is a minimal example:

...

ANSWER

Answered 2021-Apr-20 at 14:02

Have you checked the magnitude of the differences?

Just running:

Source https://stackoverflow.com/questions/67179473

QUESTION

How to change parameters of saved model without training docs in Gensim Doc2Vec?

Asked 2021-Apr-20 at 18:24

I preprocess my docs, trained my model, and saved it by following the guidelines given here: https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html

After a period of time, I want to re-train my model with different parameters. However, I don't want to preprocess docs and create "train corpus" again because it takes nearly 3 days. Is there a solution to easily load saved model, change parameters and train the model with these new parameters for the following codes:

...

ANSWER

Answered 2021-Apr-20 at 18:24

First, note that this section of your current code does nothing with the loaded model, because it's immediately replaced by the new model created by the 2nd line's instantiation of a model from scratch:

Source https://stackoverflow.com/questions/67143926

QUESTION

Understanding Gensim Doc2vec ranking

Asked 2021-Apr-16 at 07:51

I use gensim 4.0.1 and follow tutorial 1 and 2:

...

ANSWER

Answered 2021-Apr-16 at 07:51

The problem is the same as in my prior anwer to a similar question:

https://stackoverflow.com/a/66976706/130288

Doc2Vec needs far more data to start working. 9 texts, with maybe 55 total words and perhaps around half that unique words is far too small to show any interesting results with this algorithm.

A few of Gensim's Doc2Vec-specific test cases & tutorials manage to squeeze some vaguely understandable similarities out of a test dataset (from a file lee_background.cor) that has 300 documents, each of a few hundred words - so tens of thousands of words, several thousand of which are unique. But it still needs to reduce the dimensionality & up the epochs, and the results are still very weak.

If you want to see meaningful results from Doc2Vec, you should be aiming for tens-of-thousands of documents, ideally with each document having dozens or hundreds or words.

Everything short of that is going to be disappointing and not-representative of what sort of tasks the algorithm was designed to work with.

There's a tutorial using a larger movie-review dataset (100K documents) that was also used in the original 'Paragraph Vector' paper at:

https://radimrehurek.com/gensim/auto_examples/howtos/run_doc2vec_imdb.html#sphx-glr-auto-examples-howtos-run-doc2vec-imdb-py

There's a tutorial based on Wikipedia (millions of documents) that might need some fixup to work nowadays at:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

Source https://stackoverflow.com/questions/67116370

QUESTION

How to interpret doc2vec results on previously seen data?

Asked 2021-Apr-06 at 21:30

I use gensim 4.0.1 and train doc2vec:

...

ANSWER

Answered 2021-Apr-06 at 21:30

Five dimensions is still too many for a toy-sized dataset of just 6 words, 6 unique words, and 3 2-word texts.

None of the Word2Vec/Doc2Vec/FastText-type algorithms works well on tiny amounts of contrived data. They only learn their patterns from many, subtly-contrasting usages of words in varied contexts.

Their real strengths only emerge with vectors that are 50, 100, or hundreds-of-dimensions wide - and training that many dimensions requires a unique vocabulary of (at least) many thousands of words – ideally tens or hundreds of thousands of words – with many usage examples of each. (For a variant like Doc2Vec, you'd similarly want many thousands of varied documents.)

You'll see improved correlations with expected results when using sufficient training data.

Source https://stackoverflow.com/questions/66975292

QUESTION

Scale cosine distance to 0-1 using Gensim

Asked 2021-Apr-01 at 17:27

I've built a Doc2Vec model with around 3M documents, now I want to compare it to another model I've previously built. The second model has been scaled to 0-1 so I now also want to scale the gensim model to the same range so that they are comparable. This is my first time using gensim so I'm not sure how this is done. It's nothing fancy but this is the code I have so far (model generation code ommited). I thought about scaling (minmax scaling with max/min in the union of vectors) the inferred vectors (v1 and v2) but I don't think this would be correct approach. The idea here is to compare two documents (with tokens likely to be in the corpus) and output a similarity score between them. I've seen a few Gensim's tutorials and they often compare a single string to the corpus' documents, which is not really the idea here.

...

ANSWER

Answered 2021-Apr-01 at 17:27

Note that 'cosine similarity' & 'cosine distance' are different things.

A cosine-similarity can range from -1.0 to 1.0 – but in some models, such as those based only on positive word counts, you might only practically see values from 0.0 to 1.0. But in both cases, items with similarities close to 1.0 are most-similar.

On the other hand, a cosine-distance can range from 0.0 to 2.0, and items with a distance of 0.0 are least-distant (or nearest). A cosine-distance can be larger than 1.0 - but you might only see such distances in models which use the dense coordinate space (like Doc2Vec), not in word-count models which leave half the coordinate space empty (all negative coordinates).

So: you shouldn't really be calling your function similarity if it's returning a distance, and if it's now returning surprise numbers over 1.0, there's nothing wrong: that's possible in some models, but not others.

You could naively rescale the 0.0 to 2.0 distances that your calculation will get with Doc2Vec vectors, with some crude hammer like:

Source https://stackoverflow.com/questions/66867375

QUESTION

Should I split sentences in a document for Doc2Vec?

Asked 2021-Mar-17 at 04:11

I am building a Doc2Vec model with 1000 documents using Gensim. Each document has consisted of several sentences which include multiple words.

Example)

Doc1: [[word1, word2, word3], [word4, word5, word6, word7],[word8, word9, word10]]

Doc2: [[word7, word3, word1, word2], [word1, word5, word6, word10]]

Initially, to train the Doc2Vec, I first split sentences and tag each sentence with the same document tag using "TaggedDocument". As a result, I got the final training input for Doc2Vec as follows:

TaggedDocument(words=[word1, word2, word3], tags=['Doc1'])

TaggedDocument(words=[word4, word5, word6, word7], tags=['Doc1'])

TaggedDocument(words=[word8, word9, word10], tags=['Doc1'])

TaggedDocument(words=[word7, word3, word1, word2], tags=['Doc2'])

TaggedDocument(words=[word1, word5, word6, word10], tags=['Doc2'])

However, would it be okay to train the model with the document as a whole without splitting sentences?

TaggedDocument(words=[word1, word2, word3,word4, word5, word6, word7,word8, word9, word10], tags=['Doc1'])

TaggedDocument(words=[word4, word5, word6, word7,word1, word5, word6, word10], tags=['Doc2'])

Thank you in advance :)

...

ANSWER

Answered 2021-Mar-17 at 04:11

Both approaches are going to be very similar in their effect.

The slight difference is that in PV-DM modes (dm=1), or PV-DBOW with added skip-gram training (dm=0, dbow_words=1), if you split by sentence, words in different sentences will never be within the same context-window.

For example, your 'Doc1' words 'word3' and 'word4' would never be averaged-together in the same PV-DM context-window-average, nor be used to PV-DBOW skip-gram predict-each-other, if you split by sentences. If you just run the whole doc's words together into a single TaggedDocument example, they would interact more, via appearing in shared context-windows.

Whether one or the other is better for your purposes is something you'd have to evaluate in your own analysis - it could depend a lot on the nature of the data & desired similarity results.

But, I can say that your second option, all the words in one TaggedDocument, is the more common/traditional approach.

(That is, as long as the document is still no more than 10,000 tokens long. If longer, splitting the doc's words into multiple TaggedDocument instances, each with the same tags, is a common workaround for an internal 10,000-token implementation limit.)

Source https://stackoverflow.com/questions/66665981

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install doc2vec

You can download it from GitHub.
You can use doc2vec like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: