word2vec | Sogou news corpus , perform corpus analysis
kandi X-RAY | word2vec Summary
kandi X-RAY | word2vec Summary
The idea is to use Google's word2vec, use the Sogou news corpus, perform corpus analysis, word segmentation and other operations, conduct word vector training, and obtain the vectors.bin file
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Demonstrates how to use this method
- Calculate the distance of each word in the list of words
- Read string from input stream
- Load a model from a file
- Main method to learn a model
- Trains a model
- Skip gram
- Calculate cbowgram
- Main entry point
- Splits word by line
- Demonstrates how to use this model
- Explain words for the classifier
- Compare two words
- Inserts a word in the list
- Returns a string representation of the table
- Runs a test
- Splits a word
- Load model
- Test entry point
- Calculate the exp table
word2vec Key Features
word2vec Examples and Code Snippets
def word2vec(batch_gen):
""" Build the graph for word2vec model and train it """
# Step 1: define the placeholders for input and output
# center_words have to be int to work on embedding lookup
# TO DO
# Step 2: define weights.
def word2vec(dataset):
""" Build the graph for word2vec model and train it """
# Step 1: get input, output from the dataset
with tf.name_scope('data'):
iterator = dataset.make_initializable_iterator()
center_words, target_
def word2vec(batch_gen):
""" Build the graph for word2vec model and train it """
# Step 1: define the placeholders for input and output
with tf.name_scope('data'):
center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE], name='
Community Discussions
Trending Discussions on word2vec
QUESTION
Tensoflow Embedding Layer (https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) is easy to use, and there are massive articles talking about "how to use" Embedding (https://machinelearningmastery.com/what-are-word-embeddings/, https://www.sciencedirect.com/topics/computer-science/embedding-method) . However, I want to know the Implemention of the very "Embedding Layer" in Tensorflow or Pytorch. Is it a word2vec? Is it a Cbow? Is it a special Dense Layer?
...ANSWER
Answered 2021-Jun-09 at 09:22Structure wise, both Dense
layer and Embedding
layer are hidden layers with neurons in it. The difference is in the way they operate on the given inputs and weight matrix.
A Dense
layer performs operations on the weight matrix given to it by multiplying inputs to it ,adding biases to it and applying activation function to it. Whereas Embedding
layer uses the weight matrix as a look-up dictionary.
The Embedding layer is best understood as a dictionary that maps integer indices (which stand for specific words) to dense vectors. It takes integers as input, it looks up these integers in an internal dictionary, and it returns the associated vectors. It’s effectively a dictionary lookup.
QUESTION
I use python 3.9.1 on macOS Big Sur with an M1 chip. And, gensim is 4.0.1
I tried to use the pre-trained Word2Vec model and I ran the code below:
...ANSWER
Answered 2021-Jun-07 at 12:37The problem is that the referenced repository trained a model on an incredibly old version of GenSim, which makes it incompatible with current versions.
You can potentially check whether the lifecycle meta data gives you any indication on the actual version, and then try to update your model from there. The documentation also gives some tips for upgrading your older trained models, but even those are relatively weak and point mostly to re-training. Similarly, even migrating from GenSim 3.X to 4.X is not referencing direct upgrade methods, but could give you ideas on what parameters to look out for specifically.
My suggestion would be to try loading it with any of the previous 3.X versions, and see if you have more success loading it there.
QUESTION
I have recently sourced and curated a lot of reddit data from Google Bigquery.
The dataset looks like this:
Before passing this data to word2vec to create a vocabulary and be trained, it is required that I properly tokenize the 'body_cleaned' column.
I have attempted the tokenization with both manually created functions and NLTK's word_tokenize, but for now I'll keep it focused on using word_tokenize.
Because my dataset is rather large, close to 12 million rows, it is impossible for me to open and perform functions on the dataset in one go. Pandas tries to load everything to RAM and as you can understand it crashes, even on a system with 24GB of ram.
I am facing the following issue:
- When I tokenize the dataset (using NTLK word_tokenize), if I perform the function on the dataset as a whole, it correctly tokenizes and word2vec accepts that input and learns/outputs words correctly in its vocabulary.
- When I tokenize the dataset by first batching the dataframe and iterating through it, the resulting token column is not what word2vec prefers; although word2vec trains its model on the data gathered for over 4 hours, the resulting vocabulary it has learnt consists of single characters in several encodings, as well as emojis - not words.
To troubleshoot this, I created a tiny subset of my data and tried to perform the tokenization on that data in two different ways:
- Knowing that my computer can handle performing the action on the dataset, I simply did:
ANSWER
Answered 2021-May-27 at 18:28First & foremost, beyond a certain size of data, & especially when working with raw text or tokenized text, you probably don't want to be using Pandas dataframes for every interim result.
They add extra overhead & complication that isn't fully 'Pythonic'. This is particularly the case for:
- Python
list
objects where each word is a separate string: once you've tokenized raw strings into this format, as for example to feed such texts to Gensim'sWord2Vec
model, trying to put those into Pandas just leads to confusing list-representation issues (as with your columns where the same text might be shown as either['yessir', 'shit', 'is', 'real']
– which is a true Python list literal – or[yessir, shit, is, real]
– which is some other mess likely to break if any tokens have challenging characters). - the raw word-vectors (or later, text-vectors): these are more compact & natural/efficient to work with in raw Numpy arrays than Dataframes
So, by all means, if Pandas helps for loading or other non-text fields, use it there. But then use more fundamntal Python or Numpy datatypes for tokenized text & vectors - perhaps using some field (like a unique ID) in your Dataframe to correlate the two.
Especially for large text corpuses, it's more typical to get away from CSV and instead use large text files, with one text per newline-separated line, and any each line being pre-tokenized so that spaces can be fully trusted as token-separated.
That is: even if your initial text data has more complicated punctuation-sensative tokenization, or other preprocessing that combines/changes/splits other tokens, try to do that just once (especially if it involves costly regexes), writing the results to a single simple text file which then fits the simple rules: read one text per line, split each line only by spaces.
Lots of algorithms, like Gensim's Word2Vec
or FastText
, can either stream such files directly or via very low-overhead iterable-wrappers - so the text is never completely in memory, only read as needed, repeatedly, for multiple training iterations.
For more details on this efficient way to work with large bodies of text, see this artice: https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/
QUESTION
As we know Facebook
's FastText is a great open-source, free, lightweight library which can be used for text classification. But here a problem is the pipeline seem to be end-to end black-box. Yes, we can change the hyper-parameters from these options for setting training configuration. But I couldn't manage to find a way to access to the vector embedding it generates internally.
Actually I want to do some manipulation on the vector embedding - like introducing tf-idf
weighting apart from these word2vec
representations and another thing I want to to is oversampling using SMOTE
which requires numerical representation. For these reasons I need to introduce my custom code in between the overall pipeline which seems to be inaccessible for me. How introduce custom steps in this pipeline?
ANSWER
Answered 2021-Jun-06 at 16:30The full source code is available:
https://github.com/facebookresearch/fastText
So, you can make any changes or extensions you can imagine - if you're comfortable reading & modifying its C++ source code. Nothing is hidden or inaccessible.
Note that both FastText, and its supervised
classification mode, are chiefly conventions for training a shallow neural-network. It may not be helpful to think of it as a "pipeline" like in the architecture of other classifier libraries - as none of the internal interfaces use that sort of language or modular layout.
Specifically, if you get the gist of word2vec training, FastText classifier mode really just replaces attempted-predictions of neighboring (in-context-window) vocabulary words, with attempted-predictions of known labels instead.
For the sake of understanding FastText's relationship to other techniques, and potential aspects for further extension, I think it's useful to also review:
- this skeptical blog post comparing FastText to the much-earlier 'vowpal wabbit' tool: "Fast & easy baseline text categorization with vw"
- Facebook's far-less discussed extension of such vector-training for more generic categorical or numerical tasks, "StarSpace"
QUESTION
https://www.tensorflow.org/tutorials/text/word2vec uses tf.random.log_uniform_candidate_sampler
for negative sampling.
The tutorial sets true_classes to context_class.
My experiment shows no matter what I set for true_classes, the function always yields good results.
...ANSWER
Answered 2021-Jun-06 at 05:33The line in the tutorial:
You can call the function on one skip-grams's target word and pass the context word as a true class to exclude it from being sampled
That's misleading.
What does true_classes mean in this function?
Function returns true_expected_count
which is defined in this line of the source code..
true_classes
seems only used to calculate true_expected_count
. So this function does not exclude negative classes. Every label has a probability to get sampled.
I copy an example code that can be experimented on (in case something happens to the link), taken from this GitHub issue:
QUESTION
from gensim.models import Word2Vec
import time
# Skip-gram model (sg = 1)
size = 1000
window = 3
min_count = 1
workers = 3
sg = 1
word2vec_model_file = 'word2vec_' + str(size) + '.model'
start_time = time.time()
stemmed_tokens = pd.Series(df['STEMMED_TOKENS']).values
# Train the Word2Vec Model
w2v_model = Word2Vec(stemmed_tokens, min_count = min_count, size = size, workers = workers, window = window, sg = sg)
print("Time taken to train word2vec model: " + str(time.time() - start_time))
w2v_model.save(word2vec_model_file)
...ANSWER
Answered 2021-Jun-02 at 16:43A vector size
of 1000 dimensions is very uncommon, and would require massive amounts of data to train. For example, the famous GoogleNews
vectors were for 3 million words, trained on something like 100 billion corpus words - and still only 300 dimensions. Your STEMMED_TOKENS
may not be enough data to justify 100-dimensional vectors, much less 300 or 1000.
A choice of min_count=1
is a bad idea. This algorithm can't learn anything valuable from words that only appear a few times. Typically people get better results by discarding rare words entirely, as the default min_count=5
will do. (If you have a lot of data, you're likely to increase this value to discard even more words.)
Are you examining the model's size or word-to-word results at all to ensure it's doing what you expect? Despite your colum being named STEMMED_TOKENS
, I don't see any actual splitting-into-tokens, and the Word2Vec
class expects each text to be a list-of-strings, not a string.
Finally, without seeing all your other choices for feeding word-vector-enriched data to your other classification steps, it is possible (likely even) that there are other errors there.
Given that a binary-classification model can always get at least 50% accuracy by simply classifying every example with whichever class is more common, any accuracy result less than 50% should immediately cause suspicions of major problems in your process like:
- misalignment of examples & labels
- insufficient/unrepresentative training data
- some steps not running at all due to data-prep or invocation errors
QUESTION
I'm using a pre-trained word2vec model (word2vec-google-news-300) to get the embeddings for a given list of words. Please note that this is NOT a list of words that we get after tokenizing a sentence, it is just a list of words that describe a given image.
Now I'd like to get a single vector representation for the entire list. Does adding all the individual word embeddings make sense? Or should I consider averaging? Also, I would like the vector to be of a constant size so concatenating the embeddings is not an option.
It would be really helpful if someone can explain the intuition behind considering either one of the above approaches.
...ANSWER
Answered 2021-Jun-01 at 17:03Averaging is most typical, when someone is looking for a super-simple way to turn a bag-of-words into a single fixed-length vector.
You could try a simple sum, as well.
But note that the key difference between the sum and average is that the average divides by the number of input vectors. Thus they both result in a vector that's pointing in the exact same 'direction', just of different magnitude. And, the most-often-used way of comparing such vectors, cosine-similarity, is oblivious to magnitudes. So for a lot of cosine-similarity-based ways of later comparing the vectors, sum-vs-average will give identical results.
On the other hand, if you're comparing the vectors in other ways, like via euclidean-distances, or feeding them into other classifiers, sum-vs-average could make a difference.
Similarly, some might try unit-length-normalizing all vectors before use in any comparisons. After such a pre-use normalization, then:
- euclidean-distance (smallest to largest) & cosine-similarity (largest-to-smallest) will generate identical lists of nearest-neighbors
- average-vs-sum will result in different ending directions - as the unit-normalization will have upped some vectors' magnitudes, and lowered others, changing their relative contributions to the average.
What should you do? There's no universally right answer - depending on your dataset & goals, & the ways your downstream steps use the vectors, different choices might offer slight advantages in whatever final quality/desirability evaluation you perform. So it's common to try a few different permutations, along with varying other parameters.
Separately:
- The
GoogleNews
vectors were trained on news articles back around 2013; their word senses thus may not be optimal for an image-labeling task. If you have enough of your own data, or can collect it, training your own word-vectors might result in better results. (Both the use of domain-specific data, & the ability to tune training parameters based on your own evaluations, could offer benefits - especially when your domain is unique, or the tokens aren't typical natural-language sentences.) - There are other ways to create a single summary vector for a run-of-tokens, not just arithmatical-combo-of-word-vectors. One that's a small variation on the word2vec algorithm often goes by the name
Doc2Vec
(or 'Paragraph Vector') - it may also be worth exploring. - There are also ways to compare bags-of-tokens, leveraging word-vectors, that don't collapse the bag-of-tokens to a single fixed-length vector 1st - and while they're more expensive to calculate, sometimes offer better pairwise similarity/distance results than simple cosine-similarity. One such alternate comparison is called "Word Mover's Distance" - at some point,, you may want to try that as well.
QUESTION
Using the Genism library, we can load the model and update the vocabulary when the new sentence will be added. That’s means If you save the model you can continue training it later. I checked with sample data, let’s say I have a word in my vocabulary that was previously trained (i.e. “women”). And after that let’s say I have new sentences and using model.build_vocab(new_sentence, update=True) and model.train(new_sentence), the model is updated. Now, in my new_sentence I have some word that already exists(“women”) in the previous vocabulary list and have some new word(“girl”) that not exists in the previous vocabulary list. After updating the vocabulary, I have both old and new words in the corpus. And I checked using model.wv[‘women’], the vector is updated after update and training new sentence. Also, get the word embedding vector for a new word i.e. model.wv[‘girl’]. All other words that were previously trained and not in the new_sentence, those word vectors not changed.
...ANSWER
Answered 2021-May-27 at 17:40When you perform a new call to .train()
, it only trains on the new data. So only words in the new data can possibly be updated.
And to the extent that the new data may be smaller, and more idiosyncratic in its word usages, any words in the new data will be trained to only be consistent with other words being trained in the new data. (Depending on the size of the new data, and the training parameters chosen like alpha
& epochs
, they might be pulled via the new examples arbitrarily far from their old locations - and thus start to lose comparability to words that were trained earlier.)
(Note also that when providing an different corpus that the original, you shouldn't use a parameter like total_examples=model.corpus_count
, reusing model.corpus_count
, a value cahced in the model from the earlier data. Rather, parameters should describe the current batch of data.)
Frankly, I'm not a fan of this feature. It's possible it could be useful to advanced users. But most people drawn to it are likely misuing it, expecting any number of tiny incremental updates to constantly expand & improve the model - when there's no good support for the idea that will reliably happen with naive use.
In fact, there's reasons to doubt such updates are generally a good idea. There's even an established term for the risk that incremental updates to a neural-network wreck its prior performance: catastrophic forgetting.
The straightforward & best-grounded approach to updating word-vectors for new expanded data is to re-train from scratch, so all words are on equal footing, and go through the same interleaved training, on the same unified optimization (SGD) schedule. (The new new vectors at the end of such a process will not be in a compatible coordinate space, but should be equivalently useful, or better if the data is now bigger and better.)
QUESTION
I am trying to build a Word2vec model but when I try to reshape the vector for tokens, I am getting this error. Any idea ?
...ANSWER
Answered 2021-May-25 at 15:05As of Gensim 4.0 & higher, the Word2Vec
model doesn't support subscripted-indexed access (the ['...']') to individual words. (Previous versions would display a deprecation warning,
Method will be removed in 4.0.0, use self.wv.getitem() instead`, for such uses.)
So, when you want to access a specific word, do it via the Word2Vec
model's .wv
property, which holds just the word-vectors, instead. So, your (unshown) word_vector()
function should have its line highlighted in the error stack changed to:
QUESTION
I have a couple of issues regarding Gensim in its Word2Vec model.
The first is what is happening if I set it to train for 0 epochs? Does it just create the random vectors and calls it done. So they have to be random every time, correct?
The second is concerning the WV object in the doc page says:
...ANSWER
Answered 2021-May-20 at 18:08I've not tried the nonsense parameter epochs=0
, but it might behave as you expect. (Have you tried it and seen otherwise?)
However, if your real goal is to be able to tamper with the model after initialization, but before training, the usual way to do that is to not supply any corpus when constructing the model instance, and instead manually do the two followup steps, .build_vocab()
& .train()
, in your own code - inserting extra steps between the two. (For even finer-grained control, you can examine the source of .build_vocab()
& its helper methods, and simply ensure you do all those necessary things, with your own extra steps interleaved.)
The "word vectors" in the .wv
property of type KeyedVectors
are essentially the "input projection layer" of the model: the data which converts a single word into a vector_size
-dimensional dense embedding. (You can think of the keys – word token strings – as being somewhat like a one-hot word-encoding.)
So, assigning into that structure only changes that "input projection vector", which is the "word vector" usually collected from the model. If you need to tamper with the hidden-to-output weights, you need to look at the model's .syn1neg
(or .syn1
for HS mode) property.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install word2vec
You can use word2vec like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the word2vec component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page