doc2vec | document similarity based on gensim doc2vec | Topic Modeling library

by mahaoyang Python Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | doc2vec Summary

doc2vec is a Python library typically used in Artificial Intelligence, Topic Modeling applications. doc2vec has no bugs, it has no vulnerabilities and it has low support. However doc2vec build file is not available. You can download it from GitHub.

document similarity based on gensim doc2vec.

Support

Quality

Security

License

Reuse

Support

doc2vec has a low active ecosystem.

It has 2 star(s) with 1 fork(s). There are 1 watchers for this library.

It had no major release in the last 6 months.

doc2vec has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of doc2vec is current.

Quality

doc2vec has 0 bugs and 0 code smells.

Security

doc2vec has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

doc2vec code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

doc2vec does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

doc2vec releases are not available. You will need to build from source code and install.

doc2vec has no build file. You will be need to create the build yourself to build the component from source.

It has 80 lines of code, 8 functions and 2 files.

It has low code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed doc2vec and discovered the below as its top functions. This is intended to give you an instant insight into doc2vec implemented functionality, and help decide if they suit your requirements.

Return a list of TaggedDocument objects .
Predict the most similar similarity of a sentence .
Train Doc2Vec .
Cut text without stop words .

Get all kandi verified functions for this library.

doc2vec Key Features

No Key Features are available at this moment for doc2vec.

doc2vec Examples and Code Snippets

No Code Snippets are available at this moment for doc2vec.

Community Discussions

Trending Discussions on doc2vec

'Doc2Vec' object has no attribute 'outputs', while saving doc2vec for tensorflow serving

Doc2Vec results not as expected

How to obtain a parameter 'total_words' for model.train() of gensim's doc2vec

Unable to load pre-trained gensim Doc2Vec from publication data

Gensim doc2vec's d2v.wv.most_similar() gives not relevant words with high similarity scores

How to get Doc2Vec to run faster with a CPU count of 40?

How to feature-select scikit transformers within FeatureUnion

What are the negative & sample parameters?

Doc2Vec: Do we need to train model with utils.shuffle?

QUESTION

'Doc2Vec' object has no attribute 'outputs', while saving doc2vec for tensorflow serving

Asked 2022-Feb-24 at 18:20

I have been trying to save a movie recommendation model from github to then serve using tf-serving. The code below will first create a list of taggs from my corpus and then provide me vectors based on those lists

...

ANSWER

Answered 2022-Feb-24 at 18:20

I wouldn't expect the tf.keras.models.suave_model() function – which sounds from its naming to be specific to TensorFlow & Keras – to work on a Gensim Doc2Vec model, which is not part of, or related to, or built upon either TensorFlow or Keras.

Looking at the docs for save_model(), I see its declared functionality is:

Saves a model as a TensorFlow SavedModel or HDF5 file.

Neither "TensorFlow SavedModel" nor "HDF5 file" should be expected as sufficient formats to save another project's custom model (in this case a Gensim Doc2Vec object), unless it specifically claimed that as a capability. So some sort of failure or error here is expected behavior.

If you real goal is to simply be able to re-load the model later, don't involve TensorFlow/Keras at all. You could either:

use Python's internal pickle mechanism, or
use the .save(fname) method native-to model classes in the Gensim package, which uses its own pickel-and-numpy-based save format. For example:

Source https://stackoverflow.com/questions/71252330

QUESTION

Doc2Vec results not as expected

Asked 2022-Feb-11 at 19:38

I'm evaluating Doc2Vec for a recommender API. I wasn't able to find a decent pre-trained model, so I trained a model on the corpus, which is about 8,000 small documents.

...

ANSWER

Answered 2022-Feb-11 at 18:11

Without seeing your training code, there could easily be errors in text prep & training. Many online code examples are bonkers wrong in their Doc2Vec training technique!

Note that min_count=1 is essentially always a bad idea with this sort of algorithm: any example suggesting that was likely from a misguided author.

Is a mere .split() also the only tokenization applied for training? (The inference list-of-tokens should be prepped the same as the training lists-of-tokens.)

How was "not very good" and "oddly even worse" evaluated? For example, did the results seem arbitrary, or in-the-right-direction-but-just-weak?

"8,000 small documents" is a bit on the thin side for a training corpus, but it somewhat depends on "how small" – a few words, a sentence, a few sentences? Moving to smaller vectors, or more training epochs, can sometimes make the best of a smallish training set - but this sort of algorithm works best with lots of data, such that dense 100d-or-more vectors can be trained.

Source https://stackoverflow.com/questions/71083740

QUESTION

pairwise similarity with consecutive points

Asked 2022-Feb-05 at 18:00

I have a large matrix of document similarity created with paragraph2vec_similarity in doc2vec package. I converted it to a data frame and added a TITLE column to the beginning to later sort or group it.

Current Dummy Output:

Title Header DocName_1900.txt_1 DocName_1900.txt_2 DocName_1900.txt_3 DocName_1901.txt_1 DocName_1901.txt_2 Doc1 DocName_1900.txt_1 1.000000 0.7369358 0.6418045 0.6268959 0.6823404 Doc1 DocName_1900.txt_2 0.7369358 1.000000 0.6544884 0.7418507 0.5174367 Doc1 DocName_1900.txt_3 0.6418045 0.6544884 1.000000 0.6180578 0.5274650 Doc2 DocName_1901.txt_1 0.6268959 0.7418507 0.6180578 1.000000 0.5755243 Doc2 DocName_1901.txt_2 0.6823404 0.5174367 0.5274650 0.5755243 1.000000

What I want is a data frame giving similarity in consecutive order for each following document. That is, the score for Doc1.1 and Doc1.2; and Doc1.2 and Doc1.3. Because I am only interested with similarity scores inside each individual document -- in diagonal order as shown in bold above.

Expected Output

Title Similarity for 1-2 Similarity for 2-3 Similarity for 3-4 Doc1 0.7369358 0.6544884 NA Doc2 0.5755243 NA NA NA Doc3 0.6049844 0.5250659 0.5113757

I was able to produce one giving the similarity scores of one doc with the remaining all docs with x<-data.frame(col=colnames(m)[col(m)], row=rownames(m)[row(last)], similarity=c(m)). This is the closest I could get. Is there a better way? Because I am dealing with more than 500 titles with varying lengths. There is still the option of using diag but it gets everything to the end of matrix and I loose document grouping.

...

ANSWER

Answered 2022-Feb-05 at 18:00

If I understood your problem correctly one possible solution within the tidyverse is to make the data long, remove the leading letters from Title and Header, split them on the dot and filter by comparing the results. Finally a new column is generated to serve as column names after this the data is made wide again:

Source https://stackoverflow.com/questions/70993956

QUESTION

How to obtain a parameter 'total_words' for model.train() of gensim's doc2vec

Asked 2022-Jan-30 at 17:45

As you might know, when you make a doc2vec model, you might do model.build_vocab(corpus_file='...') first, then do model.train(corpus_file='...', total_examples=..., total_words=..., epochs=10).

I am making the model w/ huge wikipedia data file. So, I have to designate the 'total_examples' and the 'total_words' for parameters of train(). Gensim's Tutorial says that I can get the first one as total_examples=model.corpus_count. This is fine. But I don't know how to get second one, total_words. I can see the # of total words in the last log from model.build_vocab() as below. So, I directory put the number, like total_words=1304592715, but I'd like to designate it like model.corpus_count manner. Can someone tell me how to obtain the number? Thank you,

...

ANSWER

Answered 2022-Jan-30 at 17:45

Similar to model.corpus_count, the tally of words from the last corpus provided to .build_vocab() should be cached in the model as model.corpus_total_words.

Source https://stackoverflow.com/questions/70906111

QUESTION

Unable to load pre-trained gensim Doc2Vec from publication data

Asked 2022-Jan-17 at 20:29

I want to use an already trained Doc2Vec from a published paper.

Paper

Whalen, R., Lungeanu, A., DeChurch, L., & Contractor, N. (2020). Patent Similarity Data and Innovation Metrics. Journal of Empirical Legal Studies, 17(3), 615–639. https://doi.org/10.1111/jels.12261

Code

https://github.com/ryanwhalen/patent_similarity_data

Data

https://zenodo.org/record/3552078#.YeWkFvgxmUk

However, when trying to load the model (patent_doc2v_10e.model) an error is raised. Edit: The file can be downloaded from the data repository (link above). I am not the author of the paper nor the creator of the model.

...

ANSWER

Answered 2022-Jan-17 at 18:37

Where did the file patent_doc2v_10e.model come from?

If trying to load that file, it generates such an error about another file with the name patent_doc2v_10e.model.trainables.syn1neg.npy, then that other file is a necessary part of the full model that should have been created alongside patent_doc2v_10e.model when that patent_doc2v_10e.model file was first .save()-persisted to disk.

You'll need to go back to where patent_doc2v_10e.model was created, & find the extra missing patent_doc2v_10e.model.trainables.syn1neg.npy file (& possibly others also starting patent_doc2v_10e.model…). All such files created at the same .save() must be kept/moved together, at the same filesystem path, for any future .load() to succeed.

(Additionally, if you are training these yourself from original data, I'd suggest being sure to use a current version of Gensim. Only older pre-4.0 versions will create any save files with trainables in the name.)

Source https://stackoverflow.com/questions/70745209

QUESTION

Gensim doc2vec's d2v.wv.most_similar() gives not relevant words with high similarity scores

Asked 2021-Dec-14 at 20:14

I've got a dataset of job listings with about 150 000 records. I extracted skills from descriptions using NER using a dictionary of 30 000 skills. Every skill is represented as an unique identificator.

My data example:

...

ANSWER

Answered 2021-Dec-14 at 20:14

If the your gold standard of what the model should report is skills that appeared in the training data, are you sure you don't want a simple count-based solution? For example, just provide a ranked list of the skills that appear most often in Director Of Commercial Operations listings?

On the other hand, the essence of compressing N job titles, and 30,000 skills, into a smaller (in this case vector_size=80) coordinate-space model is to force some non-intuitive (but perhaps real) relationships to be reflected in the model.

Might there be some real pattern in the model – even if, perhaps, just some idiosyncracies in the appearance of less-common skills – that makes aeration necessarily slot near those other skills? (Maybe it's a rare skill whose few contextual appearances co-occur with other skills very much near 'capacity utilization' -meaning with the tiny amount of data available, & tiny amount of overall attention given to this skill, there's no better place for it.)

Taking note of whether your 'anomalies' are often in low-frequency skills, or lower-freqeuncy job-ids, might enable a closer look at the data causes, or some disclaimering/filtering of most_similar() results. (The most_similar() method can limit its returned rankings to the more frequent range of the known vocabulary, for cases when the long-tail or rare words are, in with their rougher vectors, intruding in higher-quality results from better-reqpresented words. See the restrict_vocab parameter.)

That said, tinkering with training parameters may result in rankings that better reflect your intent. A larger min_count might remove more tokens that, lacking sufficient varied examples, mostly just inject noise into the rest of training. A different vector_size, smaller or larger, might better capture the relationships you're looking for. A more-aggressive (smaller) sample could discard more high-frequency words that might be starving more-interesting less-frequent words of a chance to influence the model.

Note that with dbow_words=1 & a large window, and records with (perhaps?) dozens of skills each, the words are having a much-more neighborly effect on each other, in the model, than the tag<->word correlations. That might be good or bad.

Source https://stackoverflow.com/questions/70350954

QUESTION

How to get Doc2Vec to run faster with a CPU count of 40?

Asked 2021-Dec-01 at 22:11

I am building my own vocabulary to measure document similarity. I also attached the log of the run.

...

ANSWER

Answered 2021-Dec-01 at 22:11

Generally due to threading-contention inherent to both the Python 'Global Interpreter Lock' ('GIL'), and the default Gensim master-reader-thread, many-worker-thread approach, the training can't keep all cores mostly-busy with separate threads, once you get past about 8-16 cores.

If you can accept that the only tag for each text will be its ordinal number in the corpus, the alternate corpus_file method of specifying the training-data allow arbitrarily many threads to each open their own readers into the (whitespace-token-delimited) plain-text corpus file, achieving much higher core utilization when you have 16+ cores/workers.

See the Gensim docs for the corpus_file parameter:

https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec

Note though there are some unsolved bugs that hint this mode might mishandle or miss training data at some segmentation boundaries. (This may not be significant in large training data.)

Otherwise, other tweaks to parameters that help Word2Vec/Doc2Vec training run faster may be worth trying, such as altered window, vector_size, negative values. (Though note that counterintuitively, when the bottleneck is thread-contention in Gensim's default corpus-iterable mode, some values in these parameters that normally require more computation and thus imply slower training manage to mainly soak up previously-idle contention time, and thus are comparitively 'free'. So when suffering contention, trying more-expensive values for window/negative/vector_size may become more practical.)

Generally, a higher min_count (discarding more rare words), or a more-aggressive (smaller) sample value (discarding more of the overrepresented high-frequency words), can also reduce the amount of raw training happening and thus finish training faster, with minimal effect on quality. (Sometimes, more-aggressive sample values manage to both speed training & improve results on downstream evaluations, by letting the model spend relatively more time on rarer words that are still important downstream.)

Source https://stackoverflow.com/questions/70191077

QUESTION

How to feature-select scikit transformers within FeatureUnion

Asked 2021-Nov-05 at 18:31

I have a machine-learning classification task that trains from the concatenation of various fixed-length vector representations. How can I perform auto feature selection or grid search or any other established technique in scikit-learn to find the best combination of transformers for my data?

Take this text classification flow as an example:

...

ANSWER

Answered 2021-Nov-05 at 18:31

While not quite able to "choose the best (all or nothing) transformer subset of features", we can use scikit's feature selection or dimensionality reduction modules to "choose/simplify the best feature subset across ALL transformers" as an extra step before classification:

Source https://stackoverflow.com/questions/69684058

QUESTION

What are the negative & sample parameters?

Asked 2021-Oct-29 at 02:30

I am new to NLP and doc2Vec. I want to understand the parameters of doc2Vec. Thank you

...

ANSWER

Answered 2021-Oct-29 at 02:30

As a beginner, only vector_size will be of initial interest.

Typical values are 100-1000, but larger dimensionalities require far more training data & more memory. There's no hard & fast rules – try different values, & see what works for your purposes.

Very vaguely, you'll want your count of unique vocabulary words to be much larger than the vector_size, at least the square of the vector_size: the gist of the algorithm is to force many words into a smaller-number of dimensions. (If for some reason you're running experiments on tiny amounts of data with a tiny vocabulary – for which word2vec isn't really good anyway – you'll have to shrink the vector_size very low.)

The negative value controls a detail of how the internal neural network is adjusted: how many random 'noise' words the network is tuned away from predicting for each target positive word it's tuned towards predicting. The default of 5 is good unless/until you have a repeatable way to rigorously score other values against it.

Similarly, sample controls how much (if at all) more-frquent words are sometimes randomly skipped (down-sampled). (So many redundant usage examples are overkill, wasting training time/effort that could better be spent on rarer words.) Again, you'd only want to tinker with this if you've got a way to compare the results of alternate values. Smaller values make the downsampling more aggressive (dropping more words). sample=0 would turn off such down-sampling completely, leaving all training text words used.

Though you didn't ask:

dm=0 turns off the default PV-DM mode in favor of the PV-DBOW mode. That will train doc-vectors a bit faster, and often works very well on short texts, but won't train word-vectors at all (unless you turn on an extra dbow_words=1 mode to add-back interleaved ski-gram word-vector training).

hs is an alternate mode to train the neural-network that uses multi-node encodings of words, rather than one node per (positive or negative) word. If enabled via hs=1, you should disable the negative-sampling with negative=0. But negative-sampling mode is the default for a reason, & tends to get relatively better with larger amounts of training data - so it's rare to use this mode.

Source https://stackoverflow.com/questions/69762635

QUESTION

Doc2Vec: Do we need to train model with utils.shuffle?

Asked 2021-Oct-28 at 16:19

I am new to NLP and Doc2Vec. I noted some website train the Doc2Vec by shuffling the training data in each epoch (option 1), while some website use option 2. In option 2, there is no shuffling of training data

What is the difference? Also how do I select the optimal alpha? Thank you

...

ANSWER

Answered 2021-Oct-28 at 16:19

If your corpus might have some major difference-in-character between early & late documents – such as certain words/topics that are all front-loaded to early docs, or all back-loaded in later docs – then performing one shuffle up-front to eliminate any such pattern may help a little. It's not strictly necessary & its effects on end results will likely be small.

Re-shuffling between every training pass is not common & I wouldn't expect it to offer a detectable benefit justifying its cost/code-complexity.

Regarding your "Option 1" vs "Option 2": Don't call train() multiple times in your own loop unless you're an expert who knows exactly why you're doing that. (And: any online example suggesting that is often a poor/buggy one.)

Source https://stackoverflow.com/questions/69754204

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install doc2vec

You can download it from GitHub.
You can use doc2vec like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: