word2vec | Sogou news corpus , perform corpus analysis

by ustbprir1005gao Java Version: Current License: No License

X-Ray Key Features Code Snippets(3)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | word2vec Summary

word2vec is a Java library. word2vec has no bugs, it has no vulnerabilities, it has build file available and it has high support. You can download it from GitHub.

The idea is to use Google's word2vec, use the Sogou news corpus, perform corpus analysis, word segmentation and other operations, conduct word vector training, and obtain the vectors.bin file

Support

Quality

Security

License

Reuse

Support

word2vec has a highly active ecosystem.

It has 8 star(s) with 2 fork(s). There are 2 watchers for this library.

It had no major release in the last 6 months.

word2vec has no issues reported. There are no pull requests.

It has a positive sentiment in the developer community.

The latest version of word2vec is current.

Quality

word2vec has no bugs reported.

Security

word2vec has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

word2vec does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

word2vec releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

Top functions reviewed by kandi - BETA

kandi has reviewed word2vec and discovered the below as its top functions. This is intended to give you an instant insight into word2vec implemented functionality, and help decide if they suit your requirements.

Demonstrates how to use this method
Calculate the distance of each word in the list of words
Read string from input stream
Load a model from a file
Main method to learn a model
Trains a model
Skip gram
Calculate cbowgram
Main entry point
Splits word by line
Demonstrates how to use this model
Explain words for the classifier
Compare two words
Inserts a word in the list
Returns a string representation of the table
Runs a test
Splits a word
Load model
Test entry point
Calculate the exp table

Get all kandi verified functions for this library.

word2vec Key Features

No Key Features are available at this moment for word2vec.

word2vec Examples and Code Snippets

Define a word2vec .

python

Lines of Code : 59

License : Permissive (MIT License)

Copy

def word2vec(batch_gen):
    """ Build the graph for word2vec model and train it """
    # Step 1: define the placeholders for input and output
    # center_words have to be int to work on embedding lookup

    # TO DO


    # Step 2: define weights.

Embed word2vec .

python

Lines of Code : 53

License : Permissive (MIT License)

Copy

def word2vec(dataset):
    """ Build the graph for word2vec model and train it """
    # Step 1: get input, output from the dataset
    with tf.name_scope('data'):
        iterator = dataset.make_initializable_iterator()
        center_words, target_

Create a word2vec .

python

Lines of Code : 49

License : Permissive (MIT License)

Copy

def word2vec(batch_gen):
    """ Build the graph for word2vec model and train it """
    # Step 1: define the placeholders for input and output
    with tf.name_scope('data'):
        center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE], name='

Community Discussions

Trending Discussions on word2vec

What is the network structure inside a Tensorflow Embedding Layer?

AttributeError: Can't get attribute on

Training Word2Vec Model from sourced data - Issue Tokenizing data

How to access to FastText classifier pipeline?

Understanding true_classes in log_uniform_candidate_sampler

I applied W2V on ML Algorithms. It gives error of negative value for NB and gives 0.48 accuracy for for all the other algorithms. How come?

Does adding a list of Word2Vec embeddings give a meaningful represenation?

How did online training work in the Word2vec model using Genism

TypeError: 'Word2Vec' object is not subscriptable

Inner workings of Gensim Word2Vec

QUESTION

What is the network structure inside a Tensorflow Embedding Layer?

Asked 2021-Jun-09 at 09:22

Tensoflow Embedding Layer (https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) is easy to use, and there are massive articles talking about "how to use" Embedding (https://machinelearningmastery.com/what-are-word-embeddings/, https://www.sciencedirect.com/topics/computer-science/embedding-method) . However, I want to know the Implemention of the very "Embedding Layer" in Tensorflow or Pytorch. Is it a word2vec? Is it a Cbow? Is it a special Dense Layer?

...

ANSWER

Answered 2021-Jun-09 at 09:22

Structure wise, both Dense layer and Embedding layer are hidden layers with neurons in it. The difference is in the way they operate on the given inputs and weight matrix.

A Dense layer performs operations on the weight matrix given to it by multiplying inputs to it ,adding biases to it and applying activation function to it. Whereas Embedding layer uses the weight matrix as a look-up dictionary.

The Embedding layer is best understood as a dictionary that maps integer indices (which stand for specific words) to dense vectors. It takes integers as input, it looks up these integers in an internal dictionary, and it returns the associated vectors. It’s effectively a dictionary lookup.

Source https://stackoverflow.com/questions/67896966

QUESTION

AttributeError: Can't get attribute on

Asked 2021-Jun-07 at 12:37

I use python 3.9.1 on macOS Big Sur with an M1 chip. And, gensim is 4.0.1

I tried to use the pre-trained Word2Vec model and I ran the code below:

...

ANSWER

Answered 2021-Jun-07 at 12:37

The problem is that the referenced repository trained a model on an incredibly old version of GenSim, which makes it incompatible with current versions.

You can potentially check whether the lifecycle meta data gives you any indication on the actual version, and then try to update your model from there. The documentation also gives some tips for upgrading your older trained models, but even those are relatively weak and point mostly to re-training. Similarly, even migrating from GenSim 3.X to 4.X is not referencing direct upgrade methods, but could give you ideas on what parameters to look out for specifically.

My suggestion would be to try loading it with any of the previous 3.X versions, and see if you have more success loading it there.

Source https://stackoverflow.com/questions/67865773

QUESTION

Training Word2Vec Model from sourced data - Issue Tokenizing data

Asked 2021-Jun-07 at 01:50

I have recently sourced and curated a lot of reddit data from Google Bigquery.

The dataset looks like this:

Before passing this data to word2vec to create a vocabulary and be trained, it is required that I properly tokenize the 'body_cleaned' column.

I have attempted the tokenization with both manually created functions and NLTK's word_tokenize, but for now I'll keep it focused on using word_tokenize.

Because my dataset is rather large, close to 12 million rows, it is impossible for me to open and perform functions on the dataset in one go. Pandas tries to load everything to RAM and as you can understand it crashes, even on a system with 24GB of ram.

I am facing the following issue:

When I tokenize the dataset (using NTLK word_tokenize), if I perform the function on the dataset as a whole, it correctly tokenizes and word2vec accepts that input and learns/outputs words correctly in its vocabulary.
When I tokenize the dataset by first batching the dataframe and iterating through it, the resulting token column is not what word2vec prefers; although word2vec trains its model on the data gathered for over 4 hours, the resulting vocabulary it has learnt consists of single characters in several encodings, as well as emojis - not words.

To troubleshoot this, I created a tiny subset of my data and tried to perform the tokenization on that data in two different ways:

Knowing that my computer can handle performing the action on the dataset, I simply did:

...

ANSWER

Answered 2021-May-27 at 18:28

First & foremost, beyond a certain size of data, & especially when working with raw text or tokenized text, you probably don't want to be using Pandas dataframes for every interim result.

They add extra overhead & complication that isn't fully 'Pythonic'. This is particularly the case for:

Python list objects where each word is a separate string: once you've tokenized raw strings into this format, as for example to feed such texts to Gensim's Word2Vec model, trying to put those into Pandas just leads to confusing list-representation issues (as with your columns where the same text might be shown as either ['yessir', 'shit', 'is', 'real'] – which is a true Python list literal – or [yessir, shit, is, real] – which is some other mess likely to break if any tokens have challenging characters).
the raw word-vectors (or later, text-vectors): these are more compact & natural/efficient to work with in raw Numpy arrays than Dataframes

So, by all means, if Pandas helps for loading or other non-text fields, use it there. But then use more fundamntal Python or Numpy datatypes for tokenized text & vectors - perhaps using some field (like a unique ID) in your Dataframe to correlate the two.

Especially for large text corpuses, it's more typical to get away from CSV and instead use large text files, with one text per newline-separated line, and any each line being pre-tokenized so that spaces can be fully trusted as token-separated.

That is: even if your initial text data has more complicated punctuation-sensative tokenization, or other preprocessing that combines/changes/splits other tokens, try to do that just once (especially if it involves costly regexes), writing the results to a single simple text file which then fits the simple rules: read one text per line, split each line only by spaces.

Lots of algorithms, like Gensim's Word2Vec or FastText, can either stream such files directly or via very low-overhead iterable-wrappers - so the text is never completely in memory, only read as needed, repeatedly, for multiple training iterations.

For more details on this efficient way to work with large bodies of text, see this artice: https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

Source https://stackoverflow.com/questions/67718791

QUESTION

How to access to FastText classifier pipeline?

Asked 2021-Jun-06 at 16:30

As we know Facebook's FastText is a great open-source, free, lightweight library which can be used for text classification. But here a problem is the pipeline seem to be end-to end black-box. Yes, we can change the hyper-parameters from these options for setting training configuration. But I couldn't manage to find a way to access to the vector embedding it generates internally.

Actually I want to do some manipulation on the vector embedding - like introducing tf-idf weighting apart from these word2vec representations and another thing I want to to is oversampling using SMOTE which requires numerical representation. For these reasons I need to introduce my custom code in between the overall pipeline which seems to be inaccessible for me. How introduce custom steps in this pipeline?

...

ANSWER

Answered 2021-Jun-06 at 16:30

The full source code is available:

https://github.com/facebookresearch/fastText

So, you can make any changes or extensions you can imagine - if you're comfortable reading & modifying its C++ source code. Nothing is hidden or inaccessible.

Note that both FastText, and its supervised classification mode, are chiefly conventions for training a shallow neural-network. It may not be helpful to think of it as a "pipeline" like in the architecture of other classifier libraries - as none of the internal interfaces use that sort of language or modular layout.

Specifically, if you get the gist of word2vec training, FastText classifier mode really just replaces attempted-predictions of neighboring (in-context-window) vocabulary words, with attempted-predictions of known labels instead.

For the sake of understanding FastText's relationship to other techniques, and potential aspects for further extension, I think it's useful to also review:

this skeptical blog post comparing FastText to the much-earlier 'vowpal wabbit' tool: "Fast & easy baseline text categorization with vw"
Facebook's far-less discussed extension of such vector-training for more generic categorical or numerical tasks, "StarSpace"

Source https://stackoverflow.com/questions/67857840

QUESTION

Understanding true_classes in log_uniform_candidate_sampler

Asked 2021-Jun-06 at 05:33

https://www.tensorflow.org/tutorials/text/word2vec uses tf.random.log_uniform_candidate_sampler for negative sampling.

The tutorial sets true_classes to context_class.

My experiment shows no matter what I set for true_classes, the function always yields good results.

...

ANSWER

Answered 2021-Jun-06 at 05:33

The line in the tutorial:

You can call the function on one skip-grams's target word and pass the context word as a true class to exclude it from being sampled

That's misleading.

What does true_classes mean in this function?

Function returns true_expected_count which is defined in this line of the source code..

true_classes seems only used to calculate true_expected_count. So this function does not exclude negative classes. Every label has a probability to get sampled.

I copy an example code that can be experimented on (in case something happens to the link), taken from this GitHub issue:

Source https://stackoverflow.com/questions/67847893

QUESTION

I applied W2V on ML Algorithms. It gives error of negative value for NB and gives 0.48 accuracy for for all the other algorithms. How come?

Asked 2021-Jun-02 at 16:43

from gensim.models import Word2Vec
import time
# Skip-gram model (sg = 1)
size = 1000
window = 3
min_count = 1
workers = 3
sg = 1

word2vec_model_file = 'word2vec_' + str(size) + '.model'
start_time = time.time()
stemmed_tokens = pd.Series(df['STEMMED_TOKENS']).values
# Train the Word2Vec Model
w2v_model = Word2Vec(stemmed_tokens, min_count = min_count, size = size, workers = workers, window = window, sg = sg)
print("Time taken to train word2vec model: " + str(time.time() - start_time))
w2v_model.save(word2vec_model_file)

...

ANSWER

Answered 2021-Jun-02 at 16:43

A vector size of 1000 dimensions is very uncommon, and would require massive amounts of data to train. For example, the famous GoogleNews vectors were for 3 million words, trained on something like 100 billion corpus words - and still only 300 dimensions. Your STEMMED_TOKENS may not be enough data to justify 100-dimensional vectors, much less 300 or 1000.

A choice of min_count=1 is a bad idea. This algorithm can't learn anything valuable from words that only appear a few times. Typically people get better results by discarding rare words entirely, as the default min_count=5 will do. (If you have a lot of data, you're likely to increase this value to discard even more words.)

Are you examining the model's size or word-to-word results at all to ensure it's doing what you expect? Despite your colum being named STEMMED_TOKENS, I don't see any actual splitting-into-tokens, and the Word2Vec class expects each text to be a list-of-strings, not a string.

Finally, without seeing all your other choices for feeding word-vector-enriched data to your other classification steps, it is possible (likely even) that there are other errors there.

Given that a binary-classification model can always get at least 50% accuracy by simply classifying every example with whichever class is more common, any accuracy result less than 50% should immediately cause suspicions of major problems in your process like:

misalignment of examples & labels
insufficient/unrepresentative training data
some steps not running at all due to data-prep or invocation errors

Source https://stackoverflow.com/questions/67801844

QUESTION

Does adding a list of Word2Vec embeddings give a meaningful represenation?

Asked 2021-Jun-01 at 17:03

I'm using a pre-trained word2vec model (word2vec-google-news-300) to get the embeddings for a given list of words. Please note that this is NOT a list of words that we get after tokenizing a sentence, it is just a list of words that describe a given image.

Now I'd like to get a single vector representation for the entire list. Does adding all the individual word embeddings make sense? Or should I consider averaging? Also, I would like the vector to be of a constant size so concatenating the embeddings is not an option.

It would be really helpful if someone can explain the intuition behind considering either one of the above approaches.

...

ANSWER

Answered 2021-Jun-01 at 17:03

Averaging is most typical, when someone is looking for a super-simple way to turn a bag-of-words into a single fixed-length vector.

You could try a simple sum, as well.

But note that the key difference between the sum and average is that the average divides by the number of input vectors. Thus they both result in a vector that's pointing in the exact same 'direction', just of different magnitude. And, the most-often-used way of comparing such vectors, cosine-similarity, is oblivious to magnitudes. So for a lot of cosine-similarity-based ways of later comparing the vectors, sum-vs-average will give identical results.

On the other hand, if you're comparing the vectors in other ways, like via euclidean-distances, or feeding them into other classifiers, sum-vs-average could make a difference.

Similarly, some might try unit-length-normalizing all vectors before use in any comparisons. After such a pre-use normalization, then:

euclidean-distance (smallest to largest) & cosine-similarity (largest-to-smallest) will generate identical lists of nearest-neighbors
average-vs-sum will result in different ending directions - as the unit-normalization will have upped some vectors' magnitudes, and lowered others, changing their relative contributions to the average.

What should you do? There's no universally right answer - depending on your dataset & goals, & the ways your downstream steps use the vectors, different choices might offer slight advantages in whatever final quality/desirability evaluation you perform. So it's common to try a few different permutations, along with varying other parameters.

Separately:

The GoogleNews vectors were trained on news articles back around 2013; their word senses thus may not be optimal for an image-labeling task. If you have enough of your own data, or can collect it, training your own word-vectors might result in better results. (Both the use of domain-specific data, & the ability to tune training parameters based on your own evaluations, could offer benefits - especially when your domain is unique, or the tokens aren't typical natural-language sentences.)
There are other ways to create a single summary vector for a run-of-tokens, not just arithmatical-combo-of-word-vectors. One that's a small variation on the word2vec algorithm often goes by the name Doc2Vec (or 'Paragraph Vector') - it may also be worth exploring.
There are also ways to compare bags-of-tokens, leveraging word-vectors, that don't collapse the bag-of-tokens to a single fixed-length vector 1st - and while they're more expensive to calculate, sometimes offer better pairwise similarity/distance results than simple cosine-similarity. One such alternate comparison is called "Word Mover's Distance" - at some point,, you may want to try that as well.

Source https://stackoverflow.com/questions/67788151

QUESTION

How did online training work in the Word2vec model using Genism

Asked 2021-May-27 at 17:40

Using the Genism library, we can load the model and update the vocabulary when the new sentence will be added. That’s means If you save the model you can continue training it later. I checked with sample data, let’s say I have a word in my vocabulary that was previously trained (i.e. “women”). And after that let’s say I have new sentences and using model.build_vocab(new_sentence, update=True) and model.train(new_sentence), the model is updated. Now, in my new_sentence I have some word that already exists(“women”) in the previous vocabulary list and have some new word(“girl”) that not exists in the previous vocabulary list. After updating the vocabulary, I have both old and new words in the corpus. And I checked using model.wv[‘women’], the vector is updated after update and training new sentence. Also, get the word embedding vector for a new word i.e. model.wv[‘girl’]. All other words that were previously trained and not in the new_sentence, those word vectors not changed.

...

ANSWER

Answered 2021-May-27 at 17:40

When you perform a new call to .train(), it only trains on the new data. So only words in the new data can possibly be updated.

And to the extent that the new data may be smaller, and more idiosyncratic in its word usages, any words in the new data will be trained to only be consistent with other words being trained in the new data. (Depending on the size of the new data, and the training parameters chosen like alpha & epochs, they might be pulled via the new examples arbitrarily far from their old locations - and thus start to lose comparability to words that were trained earlier.)

(Note also that when providing an different corpus that the original, you shouldn't use a parameter like total_examples=model.corpus_count, reusing model.corpus_count, a value cahced in the model from the earlier data. Rather, parameters should describe the current batch of data.)

Frankly, I'm not a fan of this feature. It's possible it could be useful to advanced users. But most people drawn to it are likely misuing it, expecting any number of tiny incremental updates to constantly expand & improve the model - when there's no good support for the idea that will reliably happen with naive use.

In fact, there's reasons to doubt such updates are generally a good idea. There's even an established term for the risk that incremental updates to a neural-network wreck its prior performance: catastrophic forgetting.

The straightforward & best-grounded approach to updating word-vectors for new expanded data is to re-train from scratch, so all words are on equal footing, and go through the same interleaved training, on the same unified optimization (SGD) schedule. (The new new vectors at the end of such a process will not be in a compatible coordinate space, but should be equivalently useful, or better if the data is now bigger and better.)

Source https://stackoverflow.com/questions/67697776

QUESTION

TypeError: 'Word2Vec' object is not subscriptable

Asked 2021-May-25 at 15:05

I am trying to build a Word2vec model but when I try to reshape the vector for tokens, I am getting this error. Any idea ?

...

ANSWER

Answered 2021-May-25 at 15:05

As of Gensim 4.0 & higher, the Word2Vec model doesn't support subscripted-indexed access (the ['...']') to individual words. (Previous versions would display a deprecation warning, Method will be removed in 4.0.0, use self.wv.getitem() instead`, for such uses.)

So, when you want to access a specific word, do it via the Word2Vec model's .wv property, which holds just the word-vectors, instead. So, your (unshown) word_vector() function should have its line highlighted in the error stack changed to:

Source https://stackoverflow.com/questions/67687962

QUESTION

Inner workings of Gensim Word2Vec

Asked 2021-May-20 at 18:08

I have a couple of issues regarding Gensim in its Word2Vec model.

The first is what is happening if I set it to train for 0 epochs? Does it just create the random vectors and calls it done. So they have to be random every time, correct?

The second is concerning the WV object in the doc page says:

...

ANSWER

Answered 2021-May-20 at 18:08

I've not tried the nonsense parameter epochs=0, but it might behave as you expect. (Have you tried it and seen otherwise?)

However, if your real goal is to be able to tamper with the model after initialization, but before training, the usual way to do that is to not supply any corpus when constructing the model instance, and instead manually do the two followup steps, .build_vocab() & .train(), in your own code - inserting extra steps between the two. (For even finer-grained control, you can examine the source of .build_vocab() & its helper methods, and simply ensure you do all those necessary things, with your own extra steps interleaved.)

The "word vectors" in the .wv property of type KeyedVectors are essentially the "input projection layer" of the model: the data which converts a single word into a vector_size-dimensional dense embedding. (You can think of the keys – word token strings – as being somewhat like a one-hot word-encoding.)

So, assigning into that structure only changes that "input projection vector", which is the "word vector" usually collected from the model. If you need to tamper with the hidden-to-output weights, you need to look at the model's .syn1neg (or .syn1 for HS mode) property.

Source https://stackoverflow.com/questions/67609635

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install word2vec

You can download it from GitHub.
You can use word2vec like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the word2vec component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: