gensim-word2vec | 利用gensim训练词向量

by FuYanzhe2 Python Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(7)Vulnerabilities Install Support

kandi X-RAY | gensim-word2vec Summary

gensim-word2vec is a Python library. gensim-word2vec has no bugs, it has no vulnerabilities and it has low support. However gensim-word2vec build file is not available. You can download it from GitHub.

利用gensim训练词向量

Support

Quality

Security

License

Reuse

Support

gensim-word2vec has a low active ecosystem.

It has 6 star(s) with 1 fork(s). There are 1 watchers for this library.

It had no major release in the last 6 months.

gensim-word2vec has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of gensim-word2vec is current.

Quality

gensim-word2vec has no bugs reported.

Security

gensim-word2vec has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

gensim-word2vec does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

gensim-word2vec releases are not available. You will need to build from source code and install.

gensim-word2vec has no build file. You will be need to create the build yourself to build the component from source.

Top functions reviewed by kandi - BETA

kandi has reviewed gensim-word2vec and discovered the below as its top functions. This is intended to give you an instant insight into gensim-word2vec implemented functionality, and help decide if they suit your requirements.

Processes the XML dump file
Generate a list of pages from a string
Reserve the given size
Load templates from file
Return True if the page matches the given criteria
Extract magic words
Tries to drop nested blocks
Drops spans in text
Convert string to upper
Cleans text
Find occurrences of source pattern
Returns sharp expression
Lower case
R Return the string substring
Implements sharp test
Returns the first part of a string
Return the position of a string
Get substlength from string
Convert a string to TRACITION
Expand a test value
Process the given jobs queue
Replace string occurrences
Similar to sharp_if_value
Run the program
Normalize title
Performs a sharp switch
Reduce a single process
Return a list of all pages

Get all kandi verified functions for this library.

gensim-word2vec Key Features

No Key Features are available at this moment for gensim-word2vec.

gensim-word2vec Examples and Code Snippets

No Code Snippets are available at this moment for gensim-word2vec.

Community Discussions

Trending Discussions on gensim-word2vec

Extract token frequencies from gensim model

Is a gensim vocab index the index in the corresponding 1-hot-vector?

word2vec models consist of characters instead of words

How to dynamically assign the right "size" for Word2Vec?

Copying embeddings for gensim word2vec

Sharing memory for gensim's KeyedVectors objects between docker containers

gensim word2vec - update model data

QUESTION

Extract token frequencies from gensim model

Asked 2020-Oct-01 at 17:23

Questions like 1 and 2 give answers for retrieving vocabulary frequencies from gensim word2vec models.

For some reason, they actually just give a deprecating counter from n (size of vocab) to 0, alongside the most frequent tokens, ordered.

For example:

...

ANSWER

Answered 2020-Oct-01 at 17:23

Those answers are correct for reading the declared token-counts out of a model which has them.

But in some cases, your model may only have been initialized with a fake, descending-by-1 count for each word. This is most likely, in using Gensim, if it was loaded from a source where either the counts weren't available, or weren't used.

In particular, if you created the model using load_word2vec_format(), that simple vectors-only format (whether binary or plain-text) inherently contains no word counts. But such words are almost always, by convention, sorted in most-frequent to least-frequent order.

So, Gensim has chosen, when frequencies are not present, to synthesize fake counts, with linearly descending int values, where the (first) most-frequent word begins with the count of all unique words, and the (last) least-frequent word has a count of 1.

(I'm not sure this is a good idea, but Gensim's been doing it for a while, and it ensures code relying on the per-token count won't break, and will preserve the original order, though obviously not the unknowable original true-proportions.)

In some cases, the original source of the file may have saved a separate .vocab file with the word-frequencies alongside the word2vec_format vectors. (In Google's original word2vec.c code release, this is the file generated by the optional -save-vocab flag. In Gensim's .save_word2vec_format() method, the optional fvocab parameter can be used to generate this side file.)

If so, that 'vocab' frequencies filename may be supplied, when you call .load_word2vec_format(), as the fvocab parameter - and then your vector-set will have true counts.

If you word-vectors were originally created in Gensim from a corpus giving actual frequencies, and were always saved/loaded using the Gensim native functions .save()/.load() which use an extended form of Python-pickling, then the original true count info will never have been lost.

If you've lost the original frequency data, but you know the data was from a real natural-language source, and you want a more realistic (but still faked) set of frequencies, an option could be to use the Zipfian distribution. (Real natural-language usage frequencies tend to roughly fit this 'tall head, long tail' distribution.) A formula for creating such more-realistic dummy counts is available in the answer:

Gensim: Any chance to get word frequency in Word2Vec format?

Source https://stackoverflow.com/questions/64151977

QUESTION

Is a gensim vocab index the index in the corresponding 1-hot-vector?

Asked 2020-Aug-23 at 18:14

I am doing research that requires direct manipulation & embedding of one-hot vectors and I am trying to use gensim to load a pretrained word2vec model for this.

The problem is they don't seem to have a direct api for working with 1-hot-vectors. And I am looking for work arounds.

So I wanted to know if anyone knows of a way to do this? Or more specifically if these vocab indices (which are defined quite ambiguously). Could be indices into corresponding 1-hot-vectors?

Context I have found:

Seems this question is related but I tried accessing the 'input embeddings' (assuming they were one-hot representations), via model.syn0 (from link in answer), but I got a non-sparse matrix...
Also appears they refer to word indices as 'doctags' (search for Doctag/index).
Here is another question giving some context to the indices (although not quite answering my question).
Here is the official documentation:

################################################

class gensim.models.keyedvectors.Vocab(**kwargs) Bases: object

A single vocabulary item, used internally for collecting per-word frequency/sampling info, and for constructing binary trees (incl. both word leaves and inner nodes).

################################################

...

ANSWER

Answered 2020-Aug-23 at 18:14

Yes, you can think of the index (position) of gensim's Word2Vec word-vectors as being the one dimension that would be 1.0 – with all other V dimensions, where V is the count of unique words, being 0.0.

The implementation doesn't actually ever create one-hot vectors, as a sparse or explicit representation. It's just using the word's index as a look-up for its dense vector – following in the path of the word2vec.c code from Google on which the gensim implementation was originally based.

(The term 'doctags' is only relevant in the Doc2Vec – aka 'Paragraph Vector' – implementation. There it is the name for the distinct tokens/ints that are used for looking up document-vectors, using a different namespace from in-document words. That is, in Doc2Vec you could use 'doc_007' as a doc-vector name, aka a 'doctag', and even if the string-token 'doc_007' also appears as a word inside documents, the doc-vector referenced by doctag-key 'doc_007' and the word-vector referenced by word-key 'doc_007' wouldn't be the same internal vector.)

Source https://stackoverflow.com/questions/63549977

QUESTION

word2vec models consist of characters instead of words

Asked 2019-Jul-07 at 05:05

I am trying to make a word2vec model by Gensim on Persian language which has "space" as the character delimiter, I use python 3.5. The problem that I encounter was I gave a text file as input and it returns a model which only consists of each character separately instead of words. I also gave the input as a list of words which is recommended on :

Python Gensim word2vec vocabulary key

It doesn't work for me and I think it doesn't consider sequence of words in a sentence so it wouldn't be correct.

I did some preprocessing on my input which consist of:

collapse multiple whitespaces into a single one
tokenize by splitting on whitespace
remove words less than 3 characters long remove stop words

I gave the text to word2vec which gave me result correctly, but I need it on python so my choice is limited to use Gensim.

Also I tried to load the model which made by word2vec source on gensim I get error so I need create the word2vec model by Gensim.

my code is:

...

ANSWER

Answered 2017-Jul-18 at 16:44

The gensim Word2Vec model does not expect strings as its text examples (sentences), but lists-of-tokens. Thus, it's up to your code to tokenize your text, before passing it to Word2Vec.

Your code as shown just passes raw data from 'aggregate.txt' file into Word2Vec as wFileRead.

Look at examples in the gensim documentation, including the LineSentence class included with gensim, for ideas

Source https://stackoverflow.com/questions/45159693

QUESTION

How to dynamically assign the right "size" for Word2Vec?

Asked 2019-Jun-15 at 15:33

The question is two-fold: 1. How to select the ideal value for size? 2. How to get the vocabulary size dynamically (per row as I intend) to set that ideal size?

My data looks like the following (example)—just one row and one column:

Row 1

...

ANSWER

Answered 2019-Jun-15 at 15:33

There's no simple formula for the best size - it will depend on your data and purposes.

The best practice is to devise a robust, automatable way to score a set of word-vectors for your purposes – likely with some hand-constructed representative subset of the kinds of judgments, and preferred results, you need. Then, try many values of size (and other parameters) until you find the value(s) that score highest for your purposes.

In the domain of natural language modeling, where vocabularies are at least in the tens-of-thousands of unique words but possibly in the hundreds-of-thousands or millions, typical size values are usually in the 100-1000 range, but very often in the 200-400 range. So you might start a search of alternate values around there, if your task/vocabulary is similar.

But if your data or vocabulary is small, you may need to try smaller values. (Word2Vec really needs large, diverse training data to work best, though.)

Regarding your code-as-shown:

there's unlikely any point to computing a new model for every item in your dataset (discarding the previous model on each loop iteration). If you want a count of the unique tokens in any one tokenized item, you could use idiomatic Python like len(set(word_tokenize(item))). Any Word2Vec model of interest would likely need to be trained on the combined corpus of tokens from all items.
it's usually the case that min_count=1 makes a model worse than larger values (like the default of min_count=5). Words that only appear once generally can't get good word-vectors, as the algorithm needs multiple subtly-contrasting examples to work its magic. But, trying-and-failing to make useful word-vectors from such singletons tends to take up training-effort and model-state that could be more helpful for other words with adequate examples – so retaining those rare words even makes other word-vectors worse. (It is most definitely not the case that "retaining every raw word makes the model better", though it is almost always the case that "more real diverse data makes the model better".)

Source https://stackoverflow.com/questions/56605373

QUESTION

Copying embeddings for gensim word2vec

Asked 2019-Jun-14 at 16:06

I wanted to see if I can simply set new weights for gensim's Word2Vec without training. I get the 20 News Group data set from scikit-learn (from sklearn.datasets import fetch_20newsgroups) and trained an instance of Word2Vec on it:

...

ANSWER

Answered 2019-Jun-14 at 16:06

Generally, your approach should work.

It's likely the specific problem you're encountering was caused by an extra probing step you took and is not shown in your code, because you had no reason to think it significant: some sort of most_similar()-like operation on model_w2v_new after its build_vocab() call but before the later, malfunctioning operations.

Traditionally, most_similar() calculations operate on a version of the vectors that has been normalized to unit-length. The 1st time these unit-normed vectors are needed, they're calculated – and then cached inside the model. So, if you then replace the raw vectors with other values, but don't discard those cached values, you'll see results like you're reporting – essentially random, reflecting the randomly-initialized-but-never-trained starting vector values.

If this is what happened, just discarding the cached values should cause the next most_similar() to refresh them properly, and then you should get the results you expect:

Source https://stackoverflow.com/questions/56591149

QUESTION

Sharing memory for gensim's KeyedVectors objects between docker containers

Asked 2018-Aug-07 at 08:49

Following related question solution I created docker container which loads GoogleNews-vectors-negative300 KeyedVector inside docker container and load it all to memory

...

ANSWER

Answered 2018-Aug-01 at 06:14

I'm not sure if containerization allows containers to share the same memory-mapped files – but even if it does, it's possible that whatever utility you're using to measure per-container memory usage counts the memory twice even if it's shared. What tool are you using to monitor memory usage and are you sure it'd indicate true sharing? (What happens if, outside of gensim, you try using Python's mmap.mmap() to open the same giant file in two containers? Do you see the same, more, or less memory usage than in the gensim case?)

But also: in order to do a most_similar(), the KeyedVectors will create a second array of word-vectors, normalized to unit-length, in property vectors_norm. (This is done once, when first needed.) This normed array isn't saved, because it can always be re-calculated. So for your usage, each container will create its own, non-shared, vectors_norm array - undoing any possible memory savings from shared memory-mapped files.

You can work around this by:

after loading a model but before triggering the automatic normalization, explicitly force it yourself with a special argument to clobber the original raw vectors in-place. Then save this pre-normed version:

Source https://stackoverflow.com/questions/51616074

QUESTION

gensim word2vec - update model data

Asked 2018-Jul-26 at 21:30

I have an issue similar to the one discussed here - gensim word2vec - updating word embeddings with newcoming data

I have the following code that saves a model as text8_gensim.bin

...

ANSWER

Answered 2018-Jul-26 at 21:30

Those docs aren't in the right format: each text should be a list-of-string-tokens, not a string.

And, the same min_count threshold will apply to incremental updates: words less frequent that that threshold will be ignored. (Since a min_count higher than 1 is almost always a good idea, a word that appears only once in any update will never be added to the model.)

Incrementally adding words introduces lots of murky issue with unclear proper choices with regard to model quality, balancing the effects of early-vs-late training, management of the alpha learning-rate, and so forth. It won't necessarily improve your model; with the wrong choices it could make it worse, by adjusting some words with your new texts in ways that move them out-of-compatible-alignment with earlier-batch-only words.

So be careful and always check with a repeatable automated quantitative quality check that your changes are helping. (The safest approach is to retrain with old and new texts in one combined corpus, so that all words get trained against one another equally on all data.)

Source https://stackoverflow.com/questions/51547315

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install gensim-word2vec

You can download it from GitHub.
You can use gensim-word2vec like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: