fastText | Facebook fastText based QnA bot demo | Natural Language Processing library
kandi X-RAY | fastText Summary
kandi X-RAY | fastText Summary
Facebook fastText based QnA bot demo fastText is a library for efficient learning of word representations and sentence classification.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Ask a prediction
- Returns a random utterances from a given prediction
- Predict the given query
- Example example
- Train the classifier
fastText Key Features
fastText Examples and Code Snippets
Community Discussions
Trending Discussions on fastText
QUESTION
I am trying to migrate from google cloud composer composer-1.16.4-airflow-1.10.15 to composer-2.0.1-airflow-2.1.4, However we are getting some difficulties with the libraries as each time I upload the libs, the scheduler fails to work.
here is my requirements.txt
...ANSWER
Answered 2022-Mar-27 at 07:04We have found out what was happening. The root cause was the performances of the workers. To be properly working, composer expects the scanning of the dags to take less than 15% of the CPU ressources. If it exceeds this limit, it fails to schedule or update the dags. We have just taken bigger workers and it has worked well
QUESTION
I am curious to know if there are any implications of using a different source while calling the build_vocab
and train
of Gensim FastText
model. Will this impact the contextual representation of the word embedding?
My intention for doing this is that there is a specific set of words I am interested to get the vector representation for and when calling model.wv.most_similar
. I only want words defined in this vocab list to get returned rather than all possible words in the training corpus. I would use the result of this to decide if I want to group those words to be relevant to each other based on similarity threshold.
Following is the code snippet that I am using, appreciate your thoughts if there are any concerns or implication with this approach.
- vocab.txt contains a list of unique words of interest
- corpus.txt contains full conversation text (i.e. chat messages) where each line represents a paragraph/sentence per chat
A follow up question to this is what values should I set for total_examples
& total_words
during training in this case?
ANSWER
Answered 2022-Mar-07 at 22:50Incase someone has similar question, I'll paste the reply I got when asking this question in the Gensim Disussion Group for reference:
You can try it, but I wouldn't expect it to work well for most purposes.
The
build_vocab()
call establishes the known vocabulary of the model, & caches some stats about the corpus.If you then supply another corpus – & especially one with more words – then:
- You'll want your
train()
parameters to reflect the actual size of your training corpus. You'll want to provide a truetotal_examples
andtotal_words
count that are accurate for the training-corpus.- Every word in the training corpus that's not in the know vocabulary is ignored completely, as if it wasn't even there. So you might as well filter your corpus down to just the words-of-interest first, then use that same filtered corpus for both steps. Will the example texts still make sense? Will that be enough data to train meaningful, generalizable word-vectors for just the words-of-interest, alongside other words-of-interest, without the full texts? (You could look at your pref-filtered corpus to get a sense of that.) I'm not sure - it could depend on how severely trimming to just the words-of-interest changed the corpus. In particular, to train high-dimensional dense vectors – as with
vector_size=300
– you need a lot of varied data. Such pre-trimming might thin the corpus so much as to make the word-vectors for the words-of-interest far less useful.You could certainly try it both ways – pre-filtered to just your words-of-interest, or with the full original corpus – and see which works better on downstream evaluations.
More generally, if the concern is training time with the full corpus, there are likely other ways to get an adequate model in an acceptable amount of time.
If using
corpus_file
mode, you can increaseworkers
to equal the local CPU core count for a nearly-linear speedup from number of cores. (In traditionalcorpus_iterable
mode, max throughput is usually somewhere in the 6-12workers
threads, as long as you ahve that many cores.)
min_count=1
is usually a bad idea for these algorithms: they tend to train faster, in less memory, leaving better vectors for the remaining words when you discard the lowest-frequency words, as the defaultmin_count=5
does. (It's possibleFastText
can eke a little bit of benefit out of lower-frequency words via their contribution to character-n-gram-training, but I'd only ever lower the defaultmin_count
if I could confirm it was actually improving relevant results.If your corpus is so large that training time is a concern, often a more-aggressive (smaller)
sample
parameter value not only speeds training (by dropping many redundant high-frequency words), but ofthen improves final word-vector quality for downstream purposes as well (by letting the rarer words have relatively more influence on the model in the absense of the downsampled words).And again if the corpus is so large that training time is a concern, than
epochs=100
is likely overkill. I believe theGoogleNews
vectors were trained using only 3 passes – over a gigantic corpus. A sufficiently large & varied corpus, with plenty of examples of all words all throughout, could potentially train in 1 pass – because each word-vector can then get more total training-updates than many epochs with a small corpus. (In general largerepochs
values are more often used when the corpus is thin, to eke out something – not on a corpus so large you're considering non-standard shortcuts to speed the steps.)-- Gordon
QUESTION
I am using azureml sdk in Azure Databricks.
When I write the script for inference model (%%writefile script.py) in a databricks cell, I try to load a .bin file that I loaded in Azure Machine Learning Datasets.
I would like to do this in the script.py:
...ANSWER
Answered 2022-Feb-24 at 17:26You can use your model name with the Model.get_model_path() method to retrieve the path of the model file or files on the local file system. If you register a folder or a collection of files, this API returns the path of the directory that contains those files.
More info you may want to refer: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-advanced-entry-script#azureml_model_dir
QUESTION
I have a .bin file in a blob in Azure Blob Storage.
I would like to use it to give to fasttext to use a method.
I tried it:
...ANSWER
Answered 2022-Feb-23 at 11:19Please check if given references help to work around:
As per the error the arg0 (1st positional argument) should be a str(string). fasttext.load_model accepts first argument as string or char and second argument is utf-8 encoding which is optional.
See if fasttext.load_model(str) can only load files from the local filesystem. Try to copy the data to the local filesystem and then load it from there, e.g. Check this reference from Stack Overflow. Try to download blob to a file and load that file
QUESTION
I am hosting a pretrained fasttext model on s3 (uncompressed) and I am trying to load it in a lambda function. I am using the gensim.models.fasttext
module to load the model:
ANSWER
Answered 2022-Jan-27 at 17:46Unfortunately, the np.fromfile()
method on which this load depends doesn't work on a streamed-from-S3 file.
Some alternate options include:
- download the S3 file to a local path first, then use
load_facebook_vectors()
from there; or… - while having the FastText file local, load it locally, then use Python's
pickle
functionality to save it to a single file (now of Python's format), then put that file on S3, and in the future re-load it using Python's unpickling
The utility functions in gensim.utils
pickle()
and unpickle()
(which take a file path, including S3 URLs) may be helpful for the 2nd option, eg:
https://radimrehurek.com/gensim/utils.html#gensim.utils.unpickle
Since your prior code only shows using the vectors (via .load_facebook_vector
), not the whole model, you could just pickle & upload the model.wv
subcomponent of the loaded model, rather than the whole model, to save some storage/bandwidth.
If perhaps in future Gensim versions, the FastText
-model related classes change in shape/operation, an old pickled-model might not cleanly load. In such an eventuality, you could potentially either:
- go back to the original Facebook-format model file (which could then be loaded, & then re-saved in a modern format, again); OR...
- load your pickled model into the older Gensim where it works, save it locally using Gensim's native
.save()
(which may split it over multiple local files), then in the newer Gensim use Gensim's nativeFastText.load()
to load those older files (which will usually handle older formats), then re-pickle that loaded model, for future re-unpickles into the matching latest Gensim.
QUESTION
I want to train unsupervised fasttext for word representation. To do this, I have install fasttext from official website, I read the word representation page, and I used model = fasttext.train_unsupervised()
, but it just show me the avg.loss.
My question is, how do I know my fasttext is trained well on my dataset or it is not trained well and I must change the hyperparameters.
I want use fasttext in my embedding layer for text generation. I need a method or some tips to evaluate my fasttext that trained unsupervised.
ANSWER
Answered 2022-Jan-11 at 20:25There's no one 'best' set of word-vectors: it always depends on your data & downstream goals.
The 'loss' that's optimized, & reported, during FastText
training is for the model's internal word-to-nearby-word goal. It is only a guide, via its overall trend & eventual inability to improve further, as to whether more of that kind of training can improve on that internal goal. It is not the case that a model that can reach a lower loss has better metaparameters, or is better at any real downstream tasks.
So: if reported loss was still noticeably decreasing from epoch to epoch when training stopped, it may be worth trying a longer run of more iterations, with all the other data/parameters the same, that instead reaches a point of no-further improvement ('convergence' of the underlying optimization). But don't use FastText-training loss to choose between models with different other metaparameters.
For that, you should use some other repeatable quantititave evaluation of the final word-vectors, ideally in a task as close as possible to your real usage. That is: really plug alternate versions of them into your next step, and review how well they work, & how different sets influence where the full system works better, or worse.
This might be very manual & ad hoc at first: running a set of familiar challenges, and merely 'eyeballing' whethere one 'seems to' be giving more desirable answers or not. But to do well, & truly search all the possibilities for data preprocessing & model metaparameters, you'd ideally want to use some large, automated, & potentially growing-over-time set of probes you can score as 'better' or 'worse'.
The automated tests used in original word-vector papers are often based on some task like analogy-solving, or matching a human's native language reports of which words should be 'closer' to each other than another. Sometimes it makes sense to try to re-use those, as interim evaluations, but ultimately what makes a word-vector perform best on those may not always be what works in other tasks. (In particular, I've seen where word-vectors worse at analogies work noticeably better as inputs to a classifier.)
QUESTION
I would like to create a fasttext model for numbers. Is this a good approach?
Use Case:
I have a given number set of about 100.000 integer invoice-numbers. Our OCR sometimes creates false invoice-numbers like 1000o00 or 383I338, so my idea was to use fasttext to predict nearest invoice-number based on my 100.000 integers. As correct invoice-numbers are known in advance, I trained a fastext model with all invoice-numbers to create a word-embeding space just with invoices-numbers.
But it is not working and I don´t know if my idea is completly wrong? But I would assume that even if I have no sentences, embedding into vector space should work and therefore also a similarity between 383I338 and 3831338 should be found by the model.
Here some of my code:
...ANSWER
Answered 2022-Jan-07 at 21:33I doubt FastText is the right approach for this.
Unlike in natural-languages, where word roots/prefixes/suffixes (character n-grams) can be hints to meaning, most invoice number schemes are just incrementing numbers.
Every '###' or '####' is going to have a similar frequency. (Well, perhaps there'd be a little bit of a bias towards lower digits to the left, for Benford's Law-like reasons.) Unless the exact same invoice numbers repeat often* throughout the corpus, so that the whole token, & its fragments, acquire a word-like meaning from surrounding other tokens, FastText's post-training nearest-neighbors are unlikely to offer any hints about correct numbers. (For it to have a chance to help, you'd want the same invoice-numbers to not just repeat many times, but for a lot of those appeearances to have similar OCR errors - but I strongly suspet your corpus instead has invoice numbers only on individual texts.)
Is the real goal to correct the invoice-numbers, or just to have them be less-noisy in a model thaat's trained on a lot more meaningful, text-like tokens? (If the latter, it might be better just to discard anything that looks like an invoice number – with or without OCR glitches – or is similarly so rare it's likely an OCR scanno.)
That said, statistical & edit-distance methods could potentially help if the real need is correcting OCR errors - just not semantic-context-dependent methods like FastText. You might get useful ideas from Peter Norvig's classic writeup on "How to Write a Spelling Corrector".
QUESTION
- I use Python.Net for C# interaction with Python libraries. I solve the problem of text classification. I use FastText to index and get the vector, as well as Sklearn to train the classifier (Knn).During the implementation, I encountered a lot of problems, but all were solved, with the exception of one. After receiving the vectors of the texts on which I train Knn, I save them to a separate text file and then, if necessary, use it.
ANSWER
Answered 2021-Dec-10 at 21:59I solved this issue for a couple of days and each time I thought it was worth reading the documentation on python.net .
As a result, I found a solution and it turned out to be quite banal, it is necessary to represent X_vec
not as a float[]
, but as a List
QUESTION
I downloaded word embedding from this link. I want to load it in Gensim
to do some work but I am not able to load it. I have found many resources and none of it is working. I am using Gensim
version 4.1
.
I have tried
...ANSWER
Answered 2021-Dec-29 at 17:20Per the NotImplementedError
, those are the one kind of full Facebook FastText model, -supervised
mode, that Gensim does not support.
So sadly, the answer to "How do you load these?" is "you don't".
The .vec
files contain just the full-word vectors in a plain-text format – no subword info for synthesizing OOV vectors, or supervised-classification output features. Those can be loaded into a KeyedVectors
model:
QUESTION
I'm currently trying to make a sentiment analysis on the IMDB review dataset as a part of homework assignment for my college, I'm required to firstly do some preprocessing e.g. : tokenization, stop words removal, stemming, lemmatization. then use different ways to convert this data to vectors to be classfied by different classfiers, Gensim FastText library was one of the required models to obtain word embeddings on the data I got from text pre-processing step.
the problem I faced with Gensim is that I firstly tried to train on my data using vectors of feature size (100,200,300) but yet they always fail at some point, I tried later to use many pre-trained Gensim data vectors, but none of them worked to find word embeddings for all of the words, they'd rather fail at some point with error
...ANSWER
Answered 2021-Dec-16 at 21:14If you train your own word-vector model, then it will contain vectors for all the words you told it to learn. If a word that was in your training data doesn't appear to have a vector, it likely did not appear the required min_count
number of times. (These models tend to improve if you discard rare words who few example usages may not be suitably-informative, so the default min_words=5
is a good idea.)
It's often reasonable for downstream tasks, like feature engineering using the text & set of word-vectors, to simply ignore words with no vector. That is, if some_rare_word in model.wv
is False
, just don't try to use that word – & its missing vector – for anything. So you don't necessarily need to find, or train, a set of word-vectors with every word you need. Just elide, rather than worry-about, the rare missing words.
Separate observations:
- Stemming/lemmatization & stop-word removal aren't always worth the trouble, with all corpora/algorithms/goals. (And, stemming/lemmatization may wind up creating pseudowords that limit the model's interpretability & easy application to any texts that don't go through identical preprocessing.) So if those are required parts of laerning exercise, sure, get some experience using them. But don't assume they're necessarily helping, or worth the extra time/complexity, unless you verify that rigrously.
- FastText models will also be able to supply synthetic vectors for words that aren't known to the model, based on substrings. These are often pretty weak, but may better than nothing - especially when they give vectors for typos, or rare infelcted forms, similar to morphologically-related known words. (Since this deduced similarity, from many similarly-written tokens, provides some of the same value as stemming/lemmatization via a different path that required the original variations to all be present during initial training, you'd especially want to pay attention to whether FastText & stemming/lemmatization mix well for your goals.) Beware, though: for very-short unknown words – for which the model learned no reusable substring vectors – FastText may still return an error or all-zeros vector.
- FastText has a
supervised
classification mode, but it's not supported by Gensim. If you want to experiment with that, you'd need to use the Facebook FastText implementation. (You could still use a traditional, non-supervised
FastText word vector model as a contributor of features for other possible representations.)
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install fastText
You can use fastText like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page