Word2Vector | Self complemented word embedding methods | Icon library
kandi X-RAY | Word2Vector Summary
kandi X-RAY | Word2Vector Summary
Self complemented word embedding methods using CBOW,skip-Gram,word2doc matrix , word2word matrix ,基于CBOW、skip-gram、词-文档矩阵、词-词矩阵四种方法的词向量生成.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Test for similar words
- Train a wordvec
- Train a word2vec
- Save the final embedding
- Train the model
- Compute similarity of a word
- Build a Datasimap
- Build a dataset from a list of words
- Generate batch of data
- Generate batch
- Return a list of similar words for word embedding
- Train model embedding
- Build word2word word map
- The low dimension of the word embedding
- Build a dictionary of reserved words
- Builds a dictionary of wordtfids
- Build word frequency matrix
- Builds worddoc matrix
- Builds word2word matrix
Word2Vector Key Features
Word2Vector Examples and Code Snippets
Community Discussions
Trending Discussions on Word2Vector
QUESTION
Full Description
I am starting to work with word embedding and found a great amount of information about it. I understand, this far, that I can train my own word vectors or use previously trained ones, such as Google's or Wikipedia's, which are available for the English language and aren't useful to me, since I am working with texts in Brazilian Portuguese. Therefore, I went on a hunt for pre-trained word vectors in Portuguese and I ended up finding Hirosan's List of Pretrained Word Embeddings which led me to Kyubyong's WordVectors from which I learned about Rami Al-Rfou's Polyglot. After downloading both, I unsuccessfully have been trying to simply load the word vectors.
Short Description
I can't load pre-trained word vectors; I am trying WordVectors and Polyglot.
Downloads
- Kyubyong's pre-trained word2vector format word vectors for Portuguese;
- Polyglot's pre-trained word vectors for Portuguese;
Loading attempts
Kyubyong's WordVectors First attempt: using Gensim as suggested by Hirosan;
...ANSWER
Answered 2018-May-29 at 08:42For Kyubyong's pre-trained word2vector .bin file: it may have been saved using gensim's save function.
"load the model with load()
. Not load_word2vec_format
(that's for the C-tool compatibility)."
i.e., model = Word2Vec.load(fname)
Let me know if that works.
Reference : Gensim mailing list
QUESTION
I want to calculate the similarity between two sentences using word2vectors, I am trying to get the vectors of a sentence so that i can calculate the average of a sentence vectors to find the cosine similarity. i have tried this code but its not working. the output it gives the sentence-vectors with ones. i want the actual vectors of sentences in sentence_1_avg_vector & sentence_2_avg_vector.
Code:
...ANSWER
Answered 2017-Aug-24 at 21:01I think what you are trying to achieve is the following:
- Obtain vector representations from word2vec for every word in your sentence.
- Average all word vectors of a sentence to obtain a sentence representation.
- Compute cosine similarity between the vectors of two sentences.
While the code for 2 and 3 looks fine to me in general (haven't tested it though), the issue is probably in step 1. What you are doing in your code with
word2vec_model=gensim.models.Word2Vec(sentences, size=100, min_count=5)
is to initialize a new word2vec model. If you would then call word2vec_model.train()
, gensim would train a new model on your sentences so you can use the resulting vectors for each word afterwards. But, in order to obtain useful word vectors that capture things like similarity, you usually need to train the word2vec model on a lot of data - the model provided by Google was trained on 100 billion words.
What you probably want to do instead is to use a pretrained word2vec model and use it with gensim in your code. According to the documentation of gensim, this can be done with the KeyedVectors.load_word2vec_format
method.
QUESTION
private static final Word2Vec word2vectors = getWordVector();
private static Word2Vec getWordVector() {
String PATH;
try {
PATH = new ClassPathResource("models/word2vec_model").getFile().getAbsolutePath();
} catch (Exception e) {
e.printStackTrace();
return null;
}
log.warn("Loading model...");
return WordVectorSerializer.readWord2VecModel(new File(PATH));
}
ExecutorService pools = Executors.newFixedThreadPool(4);
long startTime = System.currentTimeMillis();
List> runnables = new ArrayList<>();
if (word2vectors != null) {
for (int i = 0; i < 3000; i++) {
MyRunnable runnable = new MyRunnable("beautiful", i);
runnables.add(pools.submit(runnable));
}
}
for(Future task: runnables){
try {
task.get();
}catch(InterruptedException ie){
ie.printStackTrace();
}catch(ExecutionException ee){
ee.printStackTrace();
}
}
pools.shutdown();
static class MyRunnable implements Runnable{
private String word;
private int count;
public MyRunnable(String word, int i){
this.word = word;
this.count = i;
}
@Override
public void run() {
Collection words = word2vectors.wordsNearest(word, 5);
log.info("Top 5 cloest words: " + words);
log.info(String.valueOf(count));
}
}
...ANSWER
Answered 2017-Oct-08 at 00:20Your code snippet will be thread-safe if the following conditions are met:
- The
word2vectors.wordsNearest(...)
call is thread-safe - The
word2vectors
data structure is created and initialized by the current thread. - Nothing changes the data structure between the pools.execute call and the computation finishing.
If wordsNearest
doesn't look at other data structures, and if it doesn't change the word2vectors
data structure, then it is a reasonable assumption that it is thread-safe. However, the only way to be sure is to analyse it.
But it is also worth noting that you are only submitting a single task to the executor. Since each task is work to be run by a single thread, your code effectively uses only one thread. To exploit multi-threading you need to split your single big task into multiple small and independent tasks; e.g. put the 10,000 repetitions loop outside of the execute call ...
QUESTION
In Word2Vector, the word embeddings are learned using co-occurrence and updating the vector's dimensions such that words that occur in each other's context come closer together.
My questions are the following:
1) If you already have a pre-trained set of embeddings, let's say a 100 dimensional space with 40k words, can you add 10 additional words onto this embedding space without changing the existing word embeddings. So you would only be updating the dimensions of the new words using the existing word embeddings. I'm thinking of this problem with respect to the "word 2 vector" algorithm, but if people have insights on how GLoVe embeddings work in this case, I am still very interested.
2) Part 2 of the question is; Can you then use the NEW word embeddings in a NN that was trained with the previous embedding set and expect reasonable results. For example, if I had trained a NN for sentiment analysis, and the word "nervous" was previously not in the vocabulary, then would "nervous" be correctly classified as "negative".
This is a question about how sensitive (or robust) NN are with respect to the embeddings. I'd appreciate any thoughts/insight/guidance.
...ANSWER
Answered 2017-Aug-04 at 17:35The initial training used info about known words to plot them in a useful N-dimensional space.
It is of course theoretically possible to then use new information, about new words, to also give them coordinates in the same space. You would want lots of varied examples of the new words being used together with the old words.
Whether you want to freeze the positions of old words, or let them also drift into new positions based on the new examples, could be an important choice to make. If you've already trained a pre-existing classifier (like a sentiment classifier) using the older words, and didn't want to re-train that classifier, you'd probably want to lock the old words in place, and force the new words into compatible positioning (even if the newer combined text examples would otherwise change the relative positions of older words).
Since after an effective train-up of the new words, they should generally be near similar-meaning older words, it would be reasonable to expect classifiers that worked on the old words to still do something useful on the new words. But how well that'd work would depend on lots of things, including how well the original word-set covered all the generalizable 'neighborhoods' of meaning. (If the new words bring in shades of meaning of which there were no examples in the old words, that area of the coordinate-space may be impoverished, and the classifier may have never had a good set of distinguishing examples, so performance could lag.)
QUESTION
This may sound like a naive question, but i am quite new on this. Let's say I use the Google pre-trained word2vector model (https://github.com/dav/word2vec) to train a classification model. I save my classification model. Now I load back the classification model into memory for testing new instances. Do I need to load the Google word2vector model again? Or is it only used for training my model?
...ANSWER
Answered 2017-Jun-13 at 22:48It depends on how your corpuses and test examples are structured and pre-processed.
You are probably using the pre-trained word-vectors to turn text into numerical features. At first, text examples are vectorized to train the classifier. Later, other (test/production) text examples will be vectorized in the same, and presented to get the classifier to get its judgements.
So you will need to use the same text-to-vectors process for test/production text examples as was used during training. Perhaps you've done that in a separate earlier bulk step, in which case you already have the features in the vector form the classifier uses. But often your classifier pipeline will itself take raw text, and vectorize it – in which case it will need the same pre-trained (word)->(vector) mappings available at test time as were available during training.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install Word2Vector
You can use Word2Vector like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page