doc2vec | document similarity based on gensim doc2vec | Topic Modeling library
kandi X-RAY | doc2vec Summary
kandi X-RAY | doc2vec Summary
document similarity based on gensim doc2vec.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Return a list of TaggedDocument objects .
- Predict the most similar similarity of a sentence .
- Train Doc2Vec .
- Cut text without stop words .
doc2vec Key Features
doc2vec Examples and Code Snippets
Community Discussions
Trending Discussions on doc2vec
QUESTION
I have been trying to save a movie recommendation model from github to then serve using tf-serving. The code below will first create a list of taggs from my corpus and then provide me vectors based on those lists
...ANSWER
Answered 2022-Feb-24 at 18:20I wouldn't expect the tf.keras.models.suave_model()
function – which sounds from its naming to be specific to TensorFlow & Keras – to work on a Gensim Doc2Vec
model, which is not part of, or related to, or built upon either TensorFlow or Keras.
Looking at the docs for save_model()
, I see its declared functionality is:
Saves a model as a TensorFlow SavedModel or HDF5 file.
Neither "TensorFlow SavedModel" nor "HDF5 file" should be expected as sufficient formats to save another project's custom model (in this case a Gensim Doc2Vec
object), unless it specifically claimed that as a capability. So some sort of failure or error here is expected behavior.
If you real goal is to simply be able to re-load the model later, don't involve TensorFlow/Keras at all. You could either:
- use Python's internal
pickle
mechanism, or - use the
.save(fname)
method native-to model classes in the Gensim package, which uses its ownpickel
-and-numpy
-based save format. For example:
QUESTION
I'm evaluating Doc2Vec for a recommender API. I wasn't able to find a decent pre-trained model, so I trained a model on the corpus, which is about 8,000 small documents.
...ANSWER
Answered 2022-Feb-11 at 18:11Without seeing your training code, there could easily be errors in text prep & training. Many online code examples are bonkers wrong in their Doc2Vec
training technique!
Note that min_count=1
is essentially always a bad idea with this sort of algorithm: any example suggesting that was likely from a misguided author.
Is a mere .split()
also the only tokenization applied for training? (The inference list-of-tokens should be prepped the same as the training lists-of-tokens.)
How was "not very good" and "oddly even worse" evaluated? For example, did the results seem arbitrary, or in-the-right-direction-but-just-weak?
"8,000 small documents" is a bit on the thin side for a training corpus, but it somewhat depends on "how small" – a few words, a sentence, a few sentences? Moving to smaller vectors, or more training epochs, can sometimes make the best of a smallish training set - but this sort of algorithm works best with lots of data, such that dense 100d-or-more vectors can be trained.
QUESTION
I have a large matrix of document similarity created with paragraph2vec_similarity
in doc2vec
package. I converted it to a data frame and added a TITLE column to the beginning to later sort or group it.
Current Dummy Output:
Title Header DocName_1900.txt_1 DocName_1900.txt_2 DocName_1900.txt_3 DocName_1901.txt_1 DocName_1901.txt_2 Doc1 DocName_1900.txt_1 1.000000 0.7369358 0.6418045 0.6268959 0.6823404 Doc1 DocName_1900.txt_2 0.7369358 1.000000 0.6544884 0.7418507 0.5174367 Doc1 DocName_1900.txt_3 0.6418045 0.6544884 1.000000 0.6180578 0.5274650 Doc2 DocName_1901.txt_1 0.6268959 0.7418507 0.6180578 1.000000 0.5755243 Doc2 DocName_1901.txt_2 0.6823404 0.5174367 0.5274650 0.5755243 1.000000What I want is a data frame giving similarity in consecutive order for each following document. That is, the score for Doc1.1 and Doc1.2; and Doc1.2 and Doc1.3. Because I am only interested with similarity scores inside each individual document -- in diagonal order as shown in bold above.
Expected Output
Title Similarity for 1-2 Similarity for 2-3 Similarity for 3-4 Doc1 0.7369358 0.6544884 NA Doc2 0.5755243 NA NA NA Doc3 0.6049844 0.5250659 0.5113757I was able to produce one giving the similarity scores of one doc with the remaining all docs with x<-data.frame(col=colnames(m)[col(m)], row=rownames(m)[row(last)], similarity=c(m))
. This is the closest I could get. Is there a better way? Because I am dealing with more than 500 titles with varying lengths. There is still the option of using diag
but it gets everything to the end of matrix and I loose document grouping.
ANSWER
Answered 2022-Feb-05 at 18:00If I understood your problem correctly one possible solution within the tidyverse
is to make the data long, remove the leading letters from Title and Header, split them on the dot and filter by comparing the results. Finally a new column is generated to serve as column names after this the data is made wide again:
QUESTION
As you might know, when you make a doc2vec model, you might do model.build_vocab(corpus_file='...')
first, then do model.train(corpus_file='...', total_examples=..., total_words=..., epochs=10)
.
I am making the model w/ huge wikipedia data file. So, I have to designate the 'total_examples' and the 'total_words' for parameters of train(). Gensim's Tutorial says that I can get the first one as total_examples=model.corpus_count
. This is fine. But I don't know how to get second one, total_words
. I can see the # of total words in the last log from model.build_vocab() as below. So, I directory put the number, like total_words=1304592715
, but I'd like to designate it like model.corpus_count manner.
Can someone tell me how to obtain the number?
Thank you,
ANSWER
Answered 2022-Jan-30 at 17:45Similar to model.corpus_count
, the tally of words from the last corpus provided to .build_vocab()
should be cached in the model as model.corpus_total_words
.
QUESTION
I want to use an already trained Doc2Vec from a published paper.
Paper
Whalen, R., Lungeanu, A., DeChurch, L., & Contractor, N. (2020). Patent Similarity Data and Innovation Metrics. Journal of Empirical Legal Studies, 17(3), 615–639. https://doi.org/10.1111/jels.12261
Code
Data
However, when trying to load the model (patent_doc2v_10e.model) an error is raised. Edit: The file can be downloaded from the data repository (link above). I am not the author of the paper nor the creator of the model.
...ANSWER
Answered 2022-Jan-17 at 18:37Where did the file patent_doc2v_10e.model
come from?
If trying to load that file, it generates such an error about another file with the name patent_doc2v_10e.model.trainables.syn1neg.npy
, then that other file is a necessary part of the full model that should have been created alongside patent_doc2v_10e.model
when that patent_doc2v_10e.model
file was first .save()
-persisted to disk.
You'll need to go back to where patent_doc2v_10e.model
was created, & find the extra missing patent_doc2v_10e.model.trainables.syn1neg.npy
file (& possibly others also starting patent_doc2v_10e.model…
). All such files created at the same .save()
must be kept/moved together, at the same filesystem path, for any future .load()
to succeed.
(Additionally, if you are training these yourself from original data, I'd suggest being sure to use a current version of Gensim. Only older pre-4.0 versions will create any save files with trainables
in the name.)
QUESTION
I've got a dataset of job listings with about 150 000 records. I extracted skills from descriptions using NER using a dictionary of 30 000 skills. Every skill is represented as an unique identificator.
My data example:
...ANSWER
Answered 2021-Dec-14 at 20:14If the your gold standard of what the model should report is skills that appeared in the training data, are you sure you don't want a simple count-based solution? For example, just provide a ranked list of the skills that appear most often in Director Of Commercial Operations
listings?
On the other hand, the essence of compressing N job titles, and 30,000 skills, into a smaller (in this case vector_size=80
) coordinate-space model is to force some non-intuitive (but perhaps real) relationships to be reflected in the model.
Might there be some real pattern in the model – even if, perhaps, just some idiosyncracies in the appearance of less-common skills – that makes aeration
necessarily slot near those other skills? (Maybe it's a rare skill whose few contextual appearances co-occur with other skills very much near 'capacity utilization' -meaning with the tiny amount of data available, & tiny amount of overall attention given to this skill, there's no better place for it.)
Taking note of whether your 'anomalies' are often in low-frequency skills, or lower-freqeuncy job-ids, might enable a closer look at the data causes, or some disclaimering/filtering of most_similar()
results. (The most_similar()
method can limit its returned rankings to the more frequent range of the known vocabulary, for cases when the long-tail or rare words are, in with their rougher vectors, intruding in higher-quality results from better-reqpresented words. See the restrict_vocab
parameter.)
That said, tinkering with training parameters may result in rankings that better reflect your intent. A larger min_count
might remove more tokens that, lacking sufficient varied examples, mostly just inject noise into the rest of training. A different vector_size
, smaller or larger, might better capture the relationships you're looking for. A more-aggressive (smaller) sample
could discard more high-frequency words that might be starving more-interesting less-frequent words of a chance to influence the model.
Note that with dbow_words=1
& a large window, and records with (perhaps?) dozens of skills each, the words are having a much-more neighborly effect on each other, in the model, than the tag
<->word
correlations. That might be good or bad.
QUESTION
I am building my own vocabulary to measure document similarity. I also attached the log of the run.
...ANSWER
Answered 2021-Dec-01 at 22:11Generally due to threading-contention inherent to both the Python 'Global Interpreter Lock' ('GIL'), and the default Gensim master-reader-thread, many-worker-thread approach, the training can't keep all cores mostly-busy with separate threads, once you get past about 8-16 cores.
If you can accept that the only tag
for each text will be its ordinal number in the corpus, the alternate corpus_file
method of specifying the training-data allow arbitrarily many threads to each open their own readers into the (whitespace-token-delimited) plain-text corpus file, achieving much higher core utilization when you have 16+ cores/workers.
See the Gensim docs for the corpus_file
parameter:
https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec
Note though there are some unsolved bugs that hint this mode might mishandle or miss training data at some segmentation boundaries. (This may not be significant in large training data.)
Otherwise, other tweaks to parameters that help Word2Vec
/Doc2Vec
training run faster may be worth trying, such as altered window
, vector_size
, negative
values. (Though note that counterintuitively, when the bottleneck is thread-contention in Gensim's default corpus-iterable mode, some values in these parameters that normally require more computation and thus imply slower training manage to mainly soak up previously-idle contention time, and thus are comparitively 'free'. So when suffering contention, trying more-expensive values for window
/negative
/vector_size
may become more practical.)
Generally, a higher min_count
(discarding more rare words), or a more-aggressive (smaller) sample
value (discarding more of the overrepresented high-frequency words), can also reduce the amount of raw training happening and thus finish training faster, with minimal effect on quality. (Sometimes, more-aggressive sample
values manage to both speed training & improve results on downstream evaluations, by letting the model spend relatively more time on rarer words that are still important downstream.)
QUESTION
I have a machine-learning classification task that trains from the concatenation of various fixed-length vector representations. How can I perform auto feature selection or grid search or any other established technique in scikit-learn to find the best combination of transformers for my data?
Take this text classification flow as an example:
...ANSWER
Answered 2021-Nov-05 at 18:31While not quite able to "choose the best (all or nothing) transformer subset of features", we can use scikit's feature selection
or dimensionality reduction
modules to "choose/simplify the best feature subset across ALL transformers" as an extra step before classification:
QUESTION
I am new to NLP and doc2Vec. I want to understand the parameters of doc2Vec. Thank you
...ANSWER
Answered 2021-Oct-29 at 02:30As a beginner, only vector_size
will be of initial interest.
Typical values are 100-1000, but larger dimensionalities require far more training data & more memory. There's no hard & fast rules – try different values, & see what works for your purposes.
Very vaguely, you'll want your count of unique vocabulary words to be much larger than the vector_size
, at least the square of the vector_size
: the gist of the algorithm is to force many words into a smaller-number of dimensions. (If for some reason you're running experiments on tiny amounts of data with a tiny vocabulary – for which word2vec isn't really good anyway – you'll have to shrink the vector_size
very low.)
The negative
value controls a detail of how the internal neural network is adjusted: how many random 'noise' words the network is tuned away from predicting for each target positive word it's tuned towards predicting. The default of 5 is good unless/until you have a repeatable way to rigorously score other values against it.
Similarly, sample
controls how much (if at all) more-frquent words are sometimes randomly skipped (down-sampled). (So many redundant usage examples are overkill, wasting training time/effort that could better be spent on rarer words.) Again, you'd only want to tinker with this if you've got a way to compare the results of alternate values. Smaller values make the downsampling more aggressive (dropping more words). sample=0
would turn off such down-sampling completely, leaving all training text words used.
Though you didn't ask:
dm=0
turns off the default PV-DM mode in favor of the PV-DBOW mode. That will train doc-vectors a bit faster, and often works very well on short texts, but won't train word-vectors at all (unless you turn on an extra dbow_words=1
mode to add-back interleaved ski-gram word-vector training).
hs
is an alternate mode to train the neural-network that uses multi-node encodings of words, rather than one node per (positive or negative) word. If enabled via hs=1
, you should disable the negative-sampling with negative=0
. But negative-sampling mode is the default for a reason, & tends to get relatively better with larger amounts of training data - so it's rare to use this mode.
QUESTION
I am new to NLP and Doc2Vec. I noted some website train the Doc2Vec by shuffling the training data in each epoch (option 1), while some website use option 2. In option 2, there is no shuffling of training data
What is the difference? Also how do I select the optimal alpha? Thank you
...ANSWER
Answered 2021-Oct-28 at 16:19If your corpus might have some major difference-in-character between early & late documents – such as certain words/topics that are all front-loaded to early docs, or all back-loaded in later docs – then performing one shuffle up-front to eliminate any such pattern may help a little. It's not strictly necessary & its effects on end results will likely be small.
Re-shuffling between every training pass is not common & I wouldn't expect it to offer a detectable benefit justifying its cost/code-complexity.
Regarding your "Option 1" vs "Option 2": Don't call train()
multiple times in your own loop unless you're an expert who knows exactly why you're doing that. (And: any online example suggesting that is often a poor/buggy one.)
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install doc2vec
You can use doc2vec like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page