gensim-data | Data repository for pretrained NLP models and NLP corpora | Dataset library

by RaRe-Technologies Python Version: fasttext-wiki-news-subwords-300 License: LGPL-2.1

X-Ray Key Features Code Snippets Community Discussions(4)Vulnerabilities Install Support

kandi X-RAY | gensim-data Summary

gensim-data is a Python library typically used in Artificial Intelligence, Dataset, Deep Learning applications. gensim-data has no bugs, it has no vulnerabilities, it has a Weak Copyleft License and it has medium support. However gensim-data build file is not available. You can download it from GitHub.

Research datasets regularly disappear, change over time, become obsolete or come without a sane implementation to handle the data format reading and processing. For this reason, Gensim launched its own dataset storage, committed to long-term support, a sane standardized usage API and focused on datasets for unstructured text processing (no images or audio). This Gensim-data repository serves as that storage. There's no need for you to use this repository directly. Instead, simply install Gensim and use its download API (see the Quickstart below). It will "talk" to this repository automagically. When you use the Gensim download API, all data is stored in your ~/gensim-data home folder. Read more about the project rationale and design decisions in this article: New Download API for Pretrained NLP Models and Datasets.

Support

Quality

Security

License

Reuse

Support

gensim-data has a medium active ecosystem.

It has 842 star(s) with 112 fork(s). There are 36 watchers for this library.

It had no major release in the last 12 months.

There are 21 open issues and 21 have been closed. On average issues are closed in 33 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of gensim-data is fasttext-wiki-news-subwords-300

Quality

gensim-data has 0 bugs and 0 code smells.

Security

gensim-data has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

gensim-data code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

gensim-data is licensed under the LGPL-2.1 License. This license is Weak Copyleft.

Weak Copyleft licenses have some restrictions, but you can use them in commercial projects.

Reuse

gensim-data releases are available to install and integrate.

gensim-data has no build file. You will be need to create the build yourself to build the component from source.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed gensim-data and discovered the below as its top functions. This is intended to give you an instant insight into gensim-data implemented functionality, and help decide if they suit your requirements.

Generate a table .
Main entry point .

Get all kandi verified functions for this library.

gensim-data Key Features

No Key Features are available at this moment for gensim-data.

gensim-data Examples and Code Snippets

No Code Snippets are available at this moment for gensim-data.

Community Discussions

Trending Discussions on gensim-data

rare misspelled words messes my fastText/Word-Embedding Classfiers

AttributeError: 'Word2Vec' object has no attribute 'most_similar' (Word2Vec)

How to download glove-wiki-gigaword-100 or other word vector package using gensim.downloader behind a proxy?

Inconsistent results when training gensim model with gensim.downloader vs manual loading

QUESTION

rare misspelled words messes my fastText/Word-Embedding Classfiers

Asked 2021-Dec-16 at 21:14

I'm currently trying to make a sentiment analysis on the IMDB review dataset as a part of homework assignment for my college, I'm required to firstly do some preprocessing e.g. : tokenization, stop words removal, stemming, lemmatization. then use different ways to convert this data to vectors to be classfied by different classfiers, Gensim FastText library was one of the required models to obtain word embeddings on the data I got from text pre-processing step.

the problem I faced with Gensim is that I firstly tried to train on my data using vectors of feature size (100,200,300) but yet they always fail at some point, I tried later to use many pre-trained Gensim data vectors, but none of them worked to find word embeddings for all of the words, they'd rather fail at some point with error

...

ANSWER

Answered 2021-Dec-16 at 21:14

If you train your own word-vector model, then it will contain vectors for all the words you told it to learn. If a word that was in your training data doesn't appear to have a vector, it likely did not appear the required min_count number of times. (These models tend to improve if you discard rare words who few example usages may not be suitably-informative, so the default min_words=5 is a good idea.)

It's often reasonable for downstream tasks, like feature engineering using the text & set of word-vectors, to simply ignore words with no vector. That is, if some_rare_word in model.wv is False, just don't try to use that word – & its missing vector – for anything. So you don't necessarily need to find, or train, a set of word-vectors with every word you need. Just elide, rather than worry-about, the rare missing words.

Separate observations:

Stemming/lemmatization & stop-word removal aren't always worth the trouble, with all corpora/algorithms/goals. (And, stemming/lemmatization may wind up creating pseudowords that limit the model's interpretability & easy application to any texts that don't go through identical preprocessing.) So if those are required parts of laerning exercise, sure, get some experience using them. But don't assume they're necessarily helping, or worth the extra time/complexity, unless you verify that rigrously.
FastText models will also be able to supply synthetic vectors for words that aren't known to the model, based on substrings. These are often pretty weak, but may better than nothing - especially when they give vectors for typos, or rare infelcted forms, similar to morphologically-related known words. (Since this deduced similarity, from many similarly-written tokens, provides some of the same value as stemming/lemmatization via a different path that required the original variations to all be present during initial training, you'd especially want to pay attention to whether FastText & stemming/lemmatization mix well for your goals.) Beware, though: for very-short unknown words – for which the model learned no reusable substring vectors – FastText may still return an error or all-zeros vector.
FastText has a supervised classification mode, but it's not supported by Gensim. If you want to experiment with that, you'd need to use the Facebook FastText implementation. (You could still use a traditional, non-supervised FastText word vector model as a contributor of features for other possible representations.)

Source https://stackoverflow.com/questions/70384870

QUESTION

AttributeError: 'Word2Vec' object has no attribute 'most_similar' (Word2Vec)

Asked 2021-Aug-06 at 19:59

I am using Word2Vec and using a wiki trained model that gives out the most similar words. I ran this before and it worked but now it gives me this error even after rerunning the whole program. I tried to take off return_path=True but im still getting the same error

...

ANSWER

Answered 2021-Aug-06 at 18:44

You are probably looking for .wv.most_similar, so please try:

Source https://stackoverflow.com/questions/68676637

QUESTION

How to download glove-wiki-gigaword-100 or other word vector package using gensim.downloader behind a proxy?

Asked 2020-Nov-18 at 18:36

Usually, I can use the following code to download the word vector package in jupyter lab:

...

ANSWER

Answered 2020-Nov-18 at 18:36

I would not use the gensim.downloader facility at all, given the extra complexity/hidden-steps it introduces (which include what I consider an unnecessary security risk of downloading & running extra 'shim' Python code that's not in the normal Gensim release).

Instead, find the plain dataset you want, download it to somewhere you can, then use whatever other method you have for transferring files to your firewalled Windows Server.

Specifically, the 50d GLoVe vectors appear to be included as part of the glove.6B.zip download available on the canonical GLoVe home page:

https://nlp.stanford.edu/projects/glove/

Source https://stackoverflow.com/questions/64887979

QUESTION

Inconsistent results when training gensim model with gensim.downloader vs manual loading

Asked 2020-Jun-23 at 21:09

I am trying to understand what is going wrong in the following example.

To train on the 'text8' dataset as described in the docs, one only has to do the following:

...

ANSWER

Answered 2020-Jun-23 at 21:09

In your second example, you've created a training dataset with just a single text with the entire contents of the file. That's about 1.1 million word tokens, in a single list.

Word2Vec (& other related algorithms) in gensim have an internal implementation limitation, in their optimized paths, of 10,000 tokens per text item. All additional tokens are ignored.

So, in your 2nd case, 99% of your data is being discarded. Training may seem instant, but very little actual training will have occurred. (Word-vectors for words that only appear past the 1st 10,000 tokens won't have been trained at all, having only their initial randomly-set values.) If you enable logging at the INFO level, you'll see more details about each step of the process, and discrepancies like this may be easier to identify.

Yes, the api.load() variant takes extra steps to break the single-line-file into 10,000-token chunks. I believe it's using the LineSentence utility class for this purpose, whose source can be examined here:

https://github.com/RaRe-Technologies/gensim/blob/e859c11f6f57bf3c883a718a9ab7067ac0c2d4cf/gensim/models/word2vec.py#L1209

However, I recommend avoiding the api.load() functionality entirely. It doesn't just download data; it also downloads a shim of additional outside-of-version-control Python code for prepping that data for extra operations. Such code is harder to browse & less well-reviewed than official gensim release code as packaged for PyPI/etc, which also presents a security risk. Each load target (by name like 'text8') might do something different, leaving you with a different object type as the return value.

It's much better for understanding to directly download precisely the data files you need, to known local paths, and do the IO/prep yourself, from those paths, so you know what steps have been applied, and the only code you're running is the officially versioned & released code.

Source https://stackoverflow.com/questions/62543491

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install gensim-data

To load a model or corpus, use either the Python or command line interface of Gensim (you'll need Gensim installed first):.
Python API Example: load a pre-trained model (gloVe word vectors): import gensim.downloader as api info = api.info() # show info about available models/datasets model = api.load("glove-twitter-25") # download the model and return as object ready for use model.most_similar("cat") """ output: [(u'dog', 0.9590819478034973), (u'monkey', 0.9203578233718872), (u'bear', 0.9143137335777283), (u'pet', 0.9108031392097473), (u'girl', 0.8880630135536194), (u'horse', 0.8872727155685425), (u'kitty', 0.8870542049407959), (u'puppy', 0.886769711971283), (u'hot', 0.8865255117416382), (u'lady', 0.8845518827438354)] """ Example: load a corpus and use it to train a Word2Vec model: from gensim.models.word2vec import Word2Vec import gensim.downloader as api corpus = api.load('text8') # download the corpus and return it opened as an iterable model = Word2Vec(corpus) # train a model from the corpus model.most_similar("car") """ output: [(u'driver', 0.8273754119873047), (u'motorcycle', 0.769528865814209), (u'cars', 0.7356342077255249), (u'truck', 0.7331641912460327), (u'taxi', 0.718338131904602), (u'vehicle', 0.7177008390426636), (u'racing', 0.6697118878364563), (u'automobile', 0.6657308340072632), (u'passenger', 0.6377975344657898), (u'glider', 0.6374964714050293)] """ Example: only download a dataset and return the local file path (no opening): import gensim.downloader as api print(api.load("20-newsgroups", return_path=True)) # output: /home/user/gensim-data/20-newsgroups/20-newsgroups.gz print(api.load("glove-twitter-25", return_path=True)) # output: /home/user/gensim-data/glove-twitter-25/glove-twitter-25.gz
The same operations, but from CLI, command line interface: python -m gensim.downloader --info # show info about available models/datasets python -m gensim.downloader --download text8 # download text8 dataset to ~/gensim-data/text8 python -m gensim.downloader --download glove-twitter-25 # download model to ~/gensim-data/glove-twitter-50/

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: