gensim-word2vec | 利用gensim训练词向量
kandi X-RAY | gensim-word2vec Summary
kandi X-RAY | gensim-word2vec Summary
利用gensim训练词向量
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Processes the XML dump file
- Generate a list of pages from a string
- Reserve the given size
- Load templates from file
- Return True if the page matches the given criteria
- Extract magic words
- Tries to drop nested blocks
- Drops spans in text
- Convert string to upper
- Cleans text
- Find occurrences of source pattern
- Returns sharp expression
- Lower case
- R Return the string substring
- Implements sharp test
- Returns the first part of a string
- Return the position of a string
- Get substlength from string
- Convert a string to TRACITION
- Expand a test value
- Process the given jobs queue
- Replace string occurrences
- Similar to sharp_if_value
- Run the program
- Normalize title
- Performs a sharp switch
- Reduce a single process
- Return a list of all pages
gensim-word2vec Key Features
gensim-word2vec Examples and Code Snippets
Community Discussions
Trending Discussions on gensim-word2vec
QUESTION
ANSWER
Answered 2020-Oct-01 at 17:23Those answers are correct for reading the declared token-counts out of a model which has them.
But in some cases, your model may only have been initialized with a fake, descending-by-1 count for each word. This is most likely, in using Gensim, if it was loaded from a source where either the counts weren't available, or weren't used.
In particular, if you created the model using load_word2vec_format()
, that simple vectors-only format (whether binary
or plain-text) inherently contains no word counts. But such words are almost always, by convention, sorted in most-frequent to least-frequent order.
So, Gensim has chosen, when frequencies are not present, to synthesize fake counts, with linearly descending int values, where the (first) most-frequent word begins with the count of all unique words, and the (last) least-frequent word has a count of 1.
(I'm not sure this is a good idea, but Gensim's been doing it for a while, and it ensures code relying on the per-token count
won't break, and will preserve the original order, though obviously not the unknowable original true-proportions.)
In some cases, the original source of the file may have saved a separate .vocab
file with the word-frequencies alongside the word2vec_format
vectors. (In Google's original word2vec.c
code release, this is the file generated by the optional -save-vocab
flag. In Gensim's .save_word2vec_format()
method, the optional fvocab
parameter can be used to generate this side file.)
If so, that 'vocab' frequencies filename may be supplied, when you call .load_word2vec_format()
, as the fvocab
parameter - and then your vector-set will have true counts.
If you word-vectors were originally created in Gensim from a corpus giving actual frequencies, and were always saved/loaded using the Gensim native functions .save()
/.load()
which use an extended form of Python-pickling, then the original true count
info will never have been lost.
If you've lost the original frequency data, but you know the data was from a real natural-language source, and you want a more realistic (but still faked) set of frequencies, an option could be to use the Zipfian distribution. (Real natural-language usage frequencies tend to roughly fit this 'tall head, long tail' distribution.) A formula for creating such more-realistic dummy counts is available in the answer:
Gensim: Any chance to get word frequency in Word2Vec format?
QUESTION
I am doing research that requires direct manipulation & embedding of one-hot vectors and I am trying to use gensim to load a pretrained word2vec model for this.
The problem is they don't seem to have a direct api for working with 1-hot-vectors. And I am looking for work arounds.
So I wanted to know if anyone knows of a way to do this? Or more specifically if these vocab indices (which are defined quite ambiguously). Could be indices into corresponding 1-hot-vectors?
Context I have found:
- Seems this question is related but I tried accessing the 'input embeddings' (assuming they were one-hot representations), via model.syn0 (from link in answer), but I got a non-sparse matrix...
- Also appears they refer to word indices as 'doctags' (search for Doctag/index).
- Here is another question giving some context to the indices (although not quite answering my question).
- Here is the official documentation:
################################################
class gensim.models.keyedvectors.Vocab(**kwargs) Bases: object
A single vocabulary item, used internally for collecting per-word frequency/sampling info, and for constructing binary trees (incl. both word leaves and inner nodes).
################################################
...ANSWER
Answered 2020-Aug-23 at 18:14Yes, you can think of the index
(position) of gensim's Word2Vec
word-vectors as being the one dimension that would be 1.0
– with all other V dimensions, where V is the count of unique words, being 0.0
.
The implementation doesn't actually ever create one-hot vectors, as a sparse or explicit representation. It's just using the word's index as a look-up for its dense vector – following in the path of the word2vec.c
code from Google on which the gensim implementation was originally based.
(The term 'doctags' is only relevant in the Doc2Vec
– aka 'Paragraph Vector' – implementation. There it is the name for the distinct tokens/ints that are used for looking up document-vectors, using a different namespace from in-document words. That is, in Doc2Vec
you could use 'doc_007'
as a doc-vector name, aka a 'doctag', and even if the string-token 'doc_007'
also appears as a word inside documents, the doc-vector referenced by doctag-key 'doc_007'
and the word-vector referenced by word-key 'doc_007'
wouldn't be the same internal vector.)
QUESTION
I am trying to make a word2vec model by Gensim on Persian language which has "space" as the character delimiter, I use python 3.5. The problem that I encounter was I gave a text file as input and it returns a model which only consists of each character separately instead of words. I also gave the input as a list of words which is recommended on :
Python Gensim word2vec vocabulary key
It doesn't work for me and I think it doesn't consider sequence of words in a sentence so it wouldn't be correct.
I did some preprocessing on my input which consist of:
collapse multiple whitespaces into a single one
tokenize by splitting on whitespace
remove words less than 3 characters long
remove stop words
I gave the text to word2vec which gave me result correctly, but I need it on python so my choice is limited to use Gensim.
Also I tried to load the model which made by word2vec source on gensim I get error so I need create the word2vec model by Gensim.
my code is:
...ANSWER
Answered 2017-Jul-18 at 16:44The gensim Word2Vec model does not expect strings as its text examples (sentences), but lists-of-tokens. Thus, it's up to your code to tokenize your text, before passing it to Word2Vec.
Your code as shown just passes raw data from 'aggregate.txt' file into Word2Vec as wFileRead
.
Look at examples in the gensim documentation, including the LineSentence
class included with gensim, for ideas
QUESTION
The question is two-fold:
1. How to select the ideal value for size
?
2. How to get the vocabulary size dynamically (per row as I intend) to set that ideal size?
My data looks like the following (example)—just one row and one column:
Row 1
...ANSWER
Answered 2019-Jun-15 at 15:33There's no simple formula for the best size
- it will depend on your data and purposes.
The best practice is to devise a robust, automatable way to score a set of word-vectors for your purposes – likely with some hand-constructed representative subset of the kinds of judgments, and preferred results, you need. Then, try many values of size
(and other parameters) until you find the value(s) that score highest for your purposes.
In the domain of natural language modeling, where vocabularies are at least in the tens-of-thousands of unique words but possibly in the hundreds-of-thousands or millions, typical size
values are usually in the 100-1000 range, but very often in the 200-400 range. So you might start a search of alternate values around there, if your task/vocabulary is similar.
But if your data or vocabulary is small, you may need to try smaller values. (Word2Vec really needs large, diverse training data to work best, though.)
Regarding your code-as-shown:
there's unlikely any point to computing a new
model
for everyitem
in your dataset (discarding the previousmodel
on each loop iteration). If you want a count of the unique tokens in any one tokenized item, you could use idiomatic Python likelen(set(word_tokenize(item)))
. AnyWord2Vec
model of interest would likely need to be trained on the combined corpus of tokens from all items.it's usually the case that
min_count=1
makes a model worse than larger values (like the default ofmin_count=5
). Words that only appear once generally can't get good word-vectors, as the algorithm needs multiple subtly-contrasting examples to work its magic. But, trying-and-failing to make useful word-vectors from such singletons tends to take up training-effort and model-state that could be more helpful for other words with adequate examples – so retaining those rare words even makes other word-vectors worse. (It is most definitely not the case that "retaining every raw word makes the model better", though it is almost always the case that "more real diverse data makes the model better".)
QUESTION
I wanted to see if I can simply set new weights for gensim's Word2Vec without training. I get the 20 News Group data set from scikit-learn (from sklearn.datasets import fetch_20newsgroups) and trained an instance of Word2Vec on it:
...ANSWER
Answered 2019-Jun-14 at 16:06Generally, your approach should work.
It's likely the specific problem you're encountering was caused by an extra probing step you took and is not shown in your code, because you had no reason to think it significant: some sort of most_similar()
-like operation on model_w2v_new
after its build_vocab()
call but before the later, malfunctioning operations.
Traditionally, most_similar()
calculations operate on a version of the vectors that has been normalized to unit-length. The 1st time these unit-normed vectors are needed, they're calculated – and then cached inside the model. So, if you then replace the raw vectors with other values, but don't discard those cached values, you'll see results like you're reporting – essentially random, reflecting the randomly-initialized-but-never-trained starting vector values.
If this is what happened, just discarding the cached values should cause the next most_similar()
to refresh them properly, and then you should get the results you expect:
QUESTION
Following related question solution I created docker container which loads GoogleNews-vectors-negative300 KeyedVector inside docker container and load it all to memory
...ANSWER
Answered 2018-Aug-01 at 06:14I'm not sure if containerization allows containers to share the same memory-mapped files – but even if it does, it's possible that whatever utility you're using to measure per-container memory usage counts the memory twice even if it's shared. What tool are you using to monitor memory usage and are you sure it'd indicate true sharing? (What happens if, outside of gensim, you try using Python's mmap.mmap()
to open the same giant file in two containers? Do you see the same, more, or less memory usage than in the gensim case?)
But also: in order to do a most_similar()
, the KeyedVectors
will create a second array of word-vectors, normalized to unit-length, in property vectors_norm
. (This is done once, when first needed.) This normed array isn't saved, because it can always be re-calculated. So for your usage, each container will create its own, non-shared, vectors_norm
array - undoing any possible memory savings from shared memory-mapped files.
You can work around this by:
after loading a model but before triggering the automatic normalization, explicitly force it yourself with a special argument to clobber the original raw vectors in-place. Then save this pre-normed version:
QUESTION
I have an issue similar to the one discussed here - gensim word2vec - updating word embeddings with newcoming data
I have the following code that saves a model as text8_gensim.bin
...ANSWER
Answered 2018-Jul-26 at 21:30Those docs
aren't in the right format: each text should be a list-of-string-tokens, not a string.
And, the same min_count
threshold will apply to incremental updates: words less frequent that that threshold will be ignored. (Since a min_count
higher than 1 is almost always a good idea, a word that appears only once in any update will never be added to the model.)
Incrementally adding words introduces lots of murky issue with unclear proper choices with regard to model quality, balancing the effects of early-vs-late training, management of the alpha
learning-rate, and so forth. It won't necessarily improve your model; with the wrong choices it could make it worse, by adjusting some words with your new texts in ways that move them out-of-compatible-alignment with earlier-batch-only words.
So be careful and always check with a repeatable automated quantitative quality check that your changes are helping. (The safest approach is to retrain with old and new texts in one combined corpus, so that all words get trained against one another equally on all data.)
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install gensim-word2vec
You can use gensim-word2vec like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page