gensim | Topic Modelling for Humans | Topic Modeling library
kandi X-RAY | gensim Summary
kandi X-RAY | gensim Summary
Topic Modelling for Humans
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Update the model with a given corpus
- Perform the inference on the given document
- Compute the phinorm
- Evaluate the model
- Updates the Lda model with a given corpus
- Evaluate a single step
- Add metrics to the plot
- Set the model
- Fit LDAPE algorithm
- Merge two projections
- Write a corpus to a file
- Estimate the probability of a boolean sliding window
- Extract articles and positions from file
- Load a model
- Add new documents to the LsiModel
- Return unit vector
- Updates the model with the given corpus
- Update the LDA
- Add a model to the model
- Train the model
- Evaluate the word analogies in the model
- Evaluate a list of words
- Compute the difference between two topics
- Construct a sparse term similarity matrix
- Compute the inner product between two matrices
- Compute the distance between two documents
gensim Key Features
gensim Examples and Code Snippets
import gensim
gensim.__version__
# 3.6.0
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, window=5, min_count=1, workers=4) # do not specify size, leave the default 10
# get average if more than 1 word is included in the "text" column
def document_vector(items):
# remove out-of-vocabulary words
doc = [word for word in items.split() if word in model_glove]
if doc:
doc_vector = model_gl
filename = 'my_doc2vec_model'
initial_model.save(filename)
reloaded_model = Doc2Vec.load(filename)
finite_set = set(['word_d', 'word_e', 'word_f']) # set for efficient 'in'
all_candidates = wv_from_bin.most_similar(positive=["word_a", "word_b"],
topn=len(vw_from_bin))
filtered_results = [word_s
serv/GoogleNews-vectors-negative300.bin
/Users/Ile-Maurice/Desktop/Flask/flaskapp/serv/GoogleNews-vectors-negative300.bin
trigram_transformer.save(TRIPHRASER_PATH)
reloads_trigram_transformer = Phrases.load(TRIPHRASER_PATH)
import gensim
import numpy as np
import pandas as pd
from sklearn.manifold import TSNE
import plotly.express as px
import plotly.graph_objects as go
import json
import dash
from dash import dcc, html, Input, Output
external_stylesheets
from gensim import corpora
corpus = [
['door', 'cat', 'mom'],
]
dictionary = corpora.Dictionary(corpus)
corpus2 = [dct.doc2bow(filtered_sentence),]
error: can't find Rust compiler
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
Community Discussions
Trending Discussions on gensim
QUESTION
I have a sample dataframe as below
...ANSWER
Answered 2022-Mar-29 at 18:47Remove the .vocab
here in model_glove.vocab
, this is not supported in the current version of gensim any more: Edit: also needs split() to iterate over words and not characters here.
QUESTION
i have migrated from gensim 3.8.3 to 4.1.2 and i am using this
claim = [token for token in claim_text if token in w2v_model.wv.vocab]
reference = [token for token in ref_text if token in w2v_model.wv.vocab]
i am not sure how to replace w2v_model.wv.vocab to newer attribute and i am getting this error
KeyedVectors' object has no attribute 'wv' can anyone please help.
...ANSWER
Answered 2022-Mar-20 at 19:43You only use the .wv
property to fetch the KeyedVectors
object from another more complete algorithmic model, like a full Word2Vec
model (which contains a KeyedVectors
in its .wv
attribute).
If you're already working with just-the-vectors, there's no need to request the word-vectors subcomponent. Whatever you were going to do, you just do to the KeyedVectors
directly.
However, you're also using the .vocab
attribute, which has been replaced. See the migration FAQ for more details:
(Mainly: instead of doing an in w2v_model.wv.vocab
, you may only need to do in kv_model
or in kv_model.key_to_index
.)
QUESTION
I iteratively apply the...
...ANSWER
Answered 2022-Mar-14 at 19:50By default, to avoid using an unbounded amount of RAM, the Gensim Phrases
class uses a default parameter max_vocab_size=40000000
, per the source code & docs at:
https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases
Unfortunately, the mechanism behind this cap is very crude & non-intuitive. Whenever the tally of all known keys in they survey-dict (which includes both unigrams & bigrams) hits this threshold (default 40,000,000), a prune
operation is performed that discards all token counts (unigrams & bigrams) at low-frequencies until the total unique-keys is under the threshold. And, it sets the low-frequency floor for future prunes to be at least as high as was necessary for this prune.
For example, the 1st time this is hit, it might need to discard all the 1-count tokens. And due to the typical Zipfian distribution of word-frequencies, that step along might not just get the total count of known tokens slightly under the threshold, but massively under the threshold. And, any subsequent prune will start by eliminated at least everything with fewer than 2 occurrences.
This results in the sawtooth counts you're seeing. When the model can't fit in max_vocab_size
, it overshrinks. It may do this many times in the course of processing a very-large corpus. As a result, final counts of lower-frequency words/bigrams can also be serious undercounts - depending somewhat arbitrarily on whether a key's counts survived the various prune-thresholds. (That's also influenced by where in the corpus a token appears. A token that only appears in the corpus after the last prune will still have a precise count, even if it only appears once! Although rare tokens that appeared any number of times could be severely undercounted, if they were always below the cutoff at each prior prune.)
The best solution would be to use a precise count that uses/correlates some spillover storage on-disk, to only prune (if at all) at the very end, ensuring only the truly-least-frequent keys are discarded. Unfortunately, Gensim's never implemented that option.
The next-best, for many cases, could be to use a memory-efficient approximate counting algorithm, that vaguely maintains the right magnitudes of counts for a much-larger number of keys. There's been a litte work in Gensim on this in the past, but not yet integrated with the Phrases
functionality.
That leaves you with the only practical workaround in the short term: change the max_vocab_size
parameter to be larger.
You could try setting it to math.inf
(might risk lower performance due to int-vs-float comparisons) or sys.maxsize
– essentially turning off the pruning entirely, to see if your survey can complete without exhausting your RAM. But, you might run out of memory anyway.
You could also try a larger-but-not-essentially-infinite cap – whatever fits in your RAM – so that far less pruning is done. But you'll still see the non-intuitive decreases in total counts, sometimes, if in fact the threshold is ever enforced. Per the docs, a very rough (perhaps outdated) estimate is that the default max_vocab_size=40000000
consumes about 3.6GB at peak saturation. So if you've got a 64GB machine, you could possibly try a max_vocab_size
thats 10-14x larger than the default, etc.
QUESTION
I am curious to know if there are any implications of using a different source while calling the build_vocab
and train
of Gensim FastText
model. Will this impact the contextual representation of the word embedding?
My intention for doing this is that there is a specific set of words I am interested to get the vector representation for and when calling model.wv.most_similar
. I only want words defined in this vocab list to get returned rather than all possible words in the training corpus. I would use the result of this to decide if I want to group those words to be relevant to each other based on similarity threshold.
Following is the code snippet that I am using, appreciate your thoughts if there are any concerns or implication with this approach.
- vocab.txt contains a list of unique words of interest
- corpus.txt contains full conversation text (i.e. chat messages) where each line represents a paragraph/sentence per chat
A follow up question to this is what values should I set for total_examples
& total_words
during training in this case?
ANSWER
Answered 2022-Mar-07 at 22:50Incase someone has similar question, I'll paste the reply I got when asking this question in the Gensim Disussion Group for reference:
You can try it, but I wouldn't expect it to work well for most purposes.
The
build_vocab()
call establishes the known vocabulary of the model, & caches some stats about the corpus.If you then supply another corpus – & especially one with more words – then:
- You'll want your
train()
parameters to reflect the actual size of your training corpus. You'll want to provide a truetotal_examples
andtotal_words
count that are accurate for the training-corpus.- Every word in the training corpus that's not in the know vocabulary is ignored completely, as if it wasn't even there. So you might as well filter your corpus down to just the words-of-interest first, then use that same filtered corpus for both steps. Will the example texts still make sense? Will that be enough data to train meaningful, generalizable word-vectors for just the words-of-interest, alongside other words-of-interest, without the full texts? (You could look at your pref-filtered corpus to get a sense of that.) I'm not sure - it could depend on how severely trimming to just the words-of-interest changed the corpus. In particular, to train high-dimensional dense vectors – as with
vector_size=300
– you need a lot of varied data. Such pre-trimming might thin the corpus so much as to make the word-vectors for the words-of-interest far less useful.You could certainly try it both ways – pre-filtered to just your words-of-interest, or with the full original corpus – and see which works better on downstream evaluations.
More generally, if the concern is training time with the full corpus, there are likely other ways to get an adequate model in an acceptable amount of time.
If using
corpus_file
mode, you can increaseworkers
to equal the local CPU core count for a nearly-linear speedup from number of cores. (In traditionalcorpus_iterable
mode, max throughput is usually somewhere in the 6-12workers
threads, as long as you ahve that many cores.)
min_count=1
is usually a bad idea for these algorithms: they tend to train faster, in less memory, leaving better vectors for the remaining words when you discard the lowest-frequency words, as the defaultmin_count=5
does. (It's possibleFastText
can eke a little bit of benefit out of lower-frequency words via their contribution to character-n-gram-training, but I'd only ever lower the defaultmin_count
if I could confirm it was actually improving relevant results.If your corpus is so large that training time is a concern, often a more-aggressive (smaller)
sample
parameter value not only speeds training (by dropping many redundant high-frequency words), but ofthen improves final word-vector quality for downstream purposes as well (by letting the rarer words have relatively more influence on the model in the absense of the downsampled words).And again if the corpus is so large that training time is a concern, than
epochs=100
is likely overkill. I believe theGoogleNews
vectors were trained using only 3 passes – over a gigantic corpus. A sufficiently large & varied corpus, with plenty of examples of all words all throughout, could potentially train in 1 pass – because each word-vector can then get more total training-updates than many epochs with a small corpus. (In general largerepochs
values are more often used when the corpus is thin, to eke out something – not on a corpus so large you're considering non-standard shortcuts to speed the steps.)-- Gordon
QUESTION
I have created a class for word2vec vectorisation which is working fine. But when I create a model pickle file and use that pickle file in a Flask App, I am getting an error like:
AttributeError: module
'__main__'
has no attribute 'GensimWord2VecVectorizer'
I am creating the model on Google Colab.
Code in Jupyter Notebook:
...ANSWER
Answered 2022-Feb-24 at 11:48Import GensimWord2VecVectorizer
in your Flask Web app python file.
QUESTION
For the following list:
...ANSWER
Answered 2022-Feb-12 at 13:11Word2Vec expects a list of lists as input, where the corpus (main list) is composed of individual documents. The individual documents are composed of individual words (tokens). Word2Vec iterates over all documents and all tokens. In your example you have passed a single list to Word2Vec, therefore Word2Vec interprets each word as an individual document and iterates over each word character which is interpreted as a token. Therefore you have built a vocabulary of characters not words. To build a vocabulary of words you can pass a nested list to Word2Vec as in the example below.
QUESTION
I have this code :
...ANSWER
Answered 2022-Feb-04 at 06:08The 'current working directory' that the Python process will consider active, and thus will use as the expected location for your plain relative filename GoogleNews-vectors-negative300.bin
, will depend on how you launched Flask.
You could print out the directory to be sure – see some ways at How do you properly determine the current script directory? – but I suspect it may just be the /Users/Ile-Maurice/Desktop/Flask/flaskapp/
directory.
If so, you could relatively-reference your file with the path relative to the above directory...
QUESTION
I would like to know can I store the gensim Phrase model after training it on the sentences
...ANSWER
Answered 2022-Feb-03 at 18:40Convert list or that partular format into an numpy array and save it as a .npy file easy to save and easy to read, using this by numpy gives you advantage of loading it in almost every platform like google colab, replit ..... refer to this link for more details on saving a npy file numpy.save()
Using pickle is also a good option but things get a bit tricky at points when difference in encoding standards and such problems arise.
QUESTION
ANSWER
Answered 2022-Feb-02 at 04:15In plotly-python, I don't think there's an easy way of retrieving the location of the cursor. You can attempt to use go.FigureWidget to highlight a trace as described in this answer, but i think you're going to be limited with with plotly-python and i'm not sure if highlighting the closest n points will be possible.
However, I believe that you can accomplish what you want in plotly-dash
since callbacks are supported - meaning you would be able to retrieve location of your cursor and then calculate the n
closest data points to your cursor and highlight the data points as needed.
Below is an example of such a solution. If you haven't seen it before, it looks complicated, but what is happening is that I am taking the point where you clicked as an input. plotly is plotly.js under the hood so it comes us in the form of a dictionary (and not some kind of plotly-python object). Then I calculate the closest three data points to the clicked input point by comparing the coordinates of every other point in the dataframe, add the information from the three closest points as traces to the input with the color teal
(or any color of your choosing), and send this modified input back as the output, and update the figure.
I am using click instead of hover because hover would cause the highlighted points to flicker too much as you drag your mouse through the points.
Also the dash app doesn't work perfectly as I believe there is some issue when you double click on points (you can see me click once in the gif below before getting it to start working), but this basic framework is hopefully close enough to what you want. Cheers!
QUESTION
I trained w2v on rather big (> 200 million sentences) corpus, and got, in addition to file w2v_model.model, files: w2v_model.model.trainables.syn1neg.npy and w2v.model_model.wv.vectors.npy. Model file was successfully loaded and read all npy files without any exceptions. The obtained model performed OK.
Now I retrained the model on much bigger corpus (> 1 billion sentences). The same 3 files were automatically saved, as expected.
When I try to load my new retrained model:
...ANSWER
Answered 2022-Jan-24 at 18:39If a .save()
is creating any files with the word trainables
in it, you're using a older version fo Gensim. Any new training should definitely prefer using a current version. As of now (January 2022), that's gensim-4.1.2
, released 2021-09.
If an attempt at a .load()
generated that particular error, then there should've been that file, alongside the others you mention, created when the .save()
had been done. (In fact, the only way that the main file you named with path_filename
should be able to know that other filename is if that other file was written successfully, allowing the main file to complete writing.)
Are you sure that file wasn't written, but then somehow left behind, perhaps getting deleted or not moving alongside the other few files to some new filesystem path?
In general, I would suggest:
- using latest Gensim for any new training
- always enable Python logging at the INFO level, & watch the logging/console output of training/saving processes closely to see confirmation of expected activity/steps
- keep all files from a
.save()
that begin with the same main filename (in your examples above,w2v_US.model
) together - & keep in mind that for larger models it may be a larger roster of files than for a small test model
You will probably have to re-train the model, but you might be able to re-generate a compatible lockf
file via steps like the following:
- save aside all files of any potential use
- from the exact same configuration as your original
.save()
– including the same outdated Gensim version, exact same model parameters, & exact same training corpus – repeat all the model-building steps you did before up through the.build_vocab()
step. (That is: no extra need to.train()
.) This will create an untrained dummy model that should exactly match the vocabulary 'shape' of your broken model. - use
.save()
to save that dummy model again - watching the logs/output for errors. There should be, alongside the other files, a file with a name likedummy.model.trainables.vectors_lockf.npy
. If so, you might be able to copy that away, rename it to tbe the file expected by the original model whose load failed, then leave it alongside that original model - and the.load()
might then succeed, or fail in a different way.
(If there were other problems/corruption at the time of the original model creation, this might not work. In particular, I wonder if when you talk about retraining the model, you didn't start with a fresh Word2Vec
instance, but somehow expanded the older one, which might've added other problems/complications. In that case, a full retraining, ideally in the latest Gensim, would be necessary, and also a better basis for going forward.)
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install gensim
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page