fastText | Library for fast text representation and classification | Natural Language Processing library

by facebookresearch HTML Version: 0.9.3 License: MIT

X-Ray Key Features Code Snippets(10)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | fastText Summary

fastText is a HTML library typically used in Artificial Intelligence, Natural Language Processing, Bert applications. fastText has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

fastText is a library for efficient learning of word representations and sentence classification.

Support

Quality

Security

License

Reuse

Support

fastText has a medium active ecosystem.

It has 24702 star(s) with 4612 fork(s). There are 851 watchers for this library.

It had no major release in the last 12 months.

There are 452 open issues and 608 have been closed. On average issues are closed in 145 days. There are 88 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of fastText is 0.9.3

Quality

fastText has 0 bugs and 0 code smells.

Security

fastText has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

fastText code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

fastText is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

fastText releases are available to install and integrate.

Installation instructions are not available. Examples and code snippets are available.

It has 24414 lines of code, 149 functions and 370 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of fastText

Get all kandi verified functions for this library.

fastText Key Features

No Key Features are available at this moment for fastText.

fastText Examples and Code Snippets

FastText: TypeError: loadModel(): incompatible function arguments

Python

Lines of Code : 5

License : Strong Copyleft (CC BY-SA 4.0)

Copy

TypeError: loadModel(): incompatible function arguments. The following argument types are supported:
    1. (self: fasttext_pybind.fasttext, arg0: str) -> None

Invoked with: , WindowsPath('../models/cc.de.300.bin')

Preparing large txt file for gensim FastText unsupervised model

Python

Lines of Code : 2

License : Strong Copyleft (CC BY-SA 4.0)

Copy

yield line.split()

Loading a FastText Model from s3 without Saving Locally

Python

Lines of Code : 12

License : Strong Copyleft (CC BY-SA 4.0)

Copy

import tempfile
import fasttext
import smart_open
from pathlib import Path

file = smart_open.smart_open(f's3://{bucket_name}/{key}')
listed = b''.join([i for i in file])
with tempfile.TemporaryDirectory() as tdir:
    tfile = Path(tdir).j

How can I use Ensemble learning of two models with different features as an input?

Python

Lines of Code : 23

License : Strong Copyleft (CC BY-SA 4.0)

Copy

from collections import Counter
clf1 = knn_model_1.fit(X1, y)
clf2 = knn_model_2.fit(X2, y)
clf3 = knn_model_3.fit(X3, y)

class MyVotingClassifier:
    def __init__(self, **models):
        self.models = models
    
    def predict(dict_X

Python

Lines of Code : 2

License : Strong Copyleft (CC BY-SA 4.0)

Copy

word_vecs.most_similar(positive=['honest'], negative=['dishonest'])

How to Find Top N Similar Words in a Dictionary of Words / Things?

Python

Lines of Code : 46

License : Strong Copyleft (CC BY-SA 4.0)

Copy

import fasttext
import numpy as np

# download English pretrained model
fasttext.util.download_model('en', if_exists='ignore')
ft = fasttext.load_model('cc.en.300.bin')

def cos_sim(a, b):
    """Takes 2 vectors a, b and returns the cosine

words not available in corpus for Word2Vec training

Python

Lines of Code : 6

License : Strong Copyleft (CC BY-SA 4.0)

Copy

word = data['word 1']
if word in model.wv:
    vec = model[word]
else: 
    vec = np.zeros(100)

convert dataframe to fasttext data format

Python

Lines of Code : 2

License : Strong Copyleft (CC BY-SA 4.0)

Copy

'__label__'+df['label']+' '+df['text']

FastAPI application serving a ML model has blocking code?

Python

Lines of Code : 6

License : Strong Copyleft (CC BY-SA 4.0)

Copy

@app.get("/dummy")
async def dummy():
    time.sleep(5)

for _ in {1..3}; do curl http://127.0.0.1:8000/dummy &; done

How can i optimize my Embedding transformation on a huge dataset?

Python

Lines of Code : 3

License : Strong Copyleft (CC BY-SA 4.0)

Copy

def sum_vectors(phrase, model):
    return np.sum(model.wv[phrase], axis=0)

Community Discussions

Trending Discussions on fastText

The airflow scheduler stops working after updating pypi packages on google cloud composer 2.0.1

Can I use a different corpus for fasttext build_vocab than train in Gensim Fasttext?

How give azure machine learning dataset path in an inference script?

How use a .bin file in blob storage to be loaded to a model?

Loading fasttext binary model from s3 fails

By what criteria do we find out that fasttext unsupervised is well trained?

Fasttext model representations for numbers

Passing a vector from C# to Python

How to load pre trained FastText Word Embeddings using Gensim?

rare misspelled words messes my fastText/Word-Embedding Classfiers

QUESTION

The airflow scheduler stops working after updating pypi packages on google cloud composer 2.0.1

Asked 2022-Mar-27 at 07:04

I am trying to migrate from google cloud composer composer-1.16.4-airflow-1.10.15 to composer-2.0.1-airflow-2.1.4, However we are getting some difficulties with the libraries as each time I upload the libs, the scheduler fails to work.

here is my requirements.txt

...

ANSWER

Answered 2022-Mar-27 at 07:04

We have found out what was happening. The root cause was the performances of the workers. To be properly working, composer expects the scanning of the dags to take less than 15% of the CPU ressources. If it exceeds this limit, it fails to schedule or update the dags. We have just taken bigger workers and it has worked well

Source https://stackoverflow.com/questions/70684862

QUESTION

Can I use a different corpus for fasttext build_vocab than train in Gensim Fasttext?

Asked 2022-Mar-07 at 22:50

I am curious to know if there are any implications of using a different source while calling the build_vocab and train of Gensim FastText model. Will this impact the contextual representation of the word embedding?

My intention for doing this is that there is a specific set of words I am interested to get the vector representation for and when calling model.wv.most_similar. I only want words defined in this vocab list to get returned rather than all possible words in the training corpus. I would use the result of this to decide if I want to group those words to be relevant to each other based on similarity threshold.

Following is the code snippet that I am using, appreciate your thoughts if there are any concerns or implication with this approach.

vocab.txt contains a list of unique words of interest
corpus.txt contains full conversation text (i.e. chat messages) where each line represents a paragraph/sentence per chat

A follow up question to this is what values should I set for total_examples & total_words during training in this case?

...

ANSWER

Answered 2022-Mar-07 at 22:50

Incase someone has similar question, I'll paste the reply I got when asking this question in the Gensim Disussion Group for reference:

You can try it, but I wouldn't expect it to work well for most purposes.

The build_vocab() call establishes the known vocabulary of the model, & caches some stats about the corpus.

If you then supply another corpus – & especially one with more words – then:

You'll want your train() parameters to reflect the actual size of your training corpus. You'll want to provide a true total_examples and total_words count that are accurate for the training-corpus.

Every word in the training corpus that's not in the know vocabulary is ignored completely, as if it wasn't even there. So you might as well filter your corpus down to just the words-of-interest first, then use that same filtered corpus for both steps. Will the example texts still make sense? Will that be enough data to train meaningful, generalizable word-vectors for just the words-of-interest, alongside other words-of-interest, without the full texts? (You could look at your pref-filtered corpus to get a sense of that.) I'm not sure - it could depend on how severely trimming to just the words-of-interest changed the corpus. In particular, to train high-dimensional dense vectors – as with vector_size=300 – you need a lot of varied data. Such pre-trimming might thin the corpus so much as to make the word-vectors for the words-of-interest far less useful.

You could certainly try it both ways – pre-filtered to just your words-of-interest, or with the full original corpus – and see which works better on downstream evaluations.

More generally, if the concern is training time with the full corpus, there are likely other ways to get an adequate model in an acceptable amount of time.

If using corpus_file mode, you can increase workers to equal the local CPU core count for a nearly-linear speedup from number of cores. (In traditional corpus_iterable mode, max throughput is usually somewhere in the 6-12 workers threads, as long as you ahve that many cores.)

min_count=1 is usually a bad idea for these algorithms: they tend to train faster, in less memory, leaving better vectors for the remaining words when you discard the lowest-frequency words, as the default min_count=5 does. (It's possible FastText can eke a little bit of benefit out of lower-frequency words via their contribution to character-n-gram-training, but I'd only ever lower the default min_count if I could confirm it was actually improving relevant results.

If your corpus is so large that training time is a concern, often a more-aggressive (smaller) sample parameter value not only speeds training (by dropping many redundant high-frequency words), but ofthen improves final word-vector quality for downstream purposes as well (by letting the rarer words have relatively more influence on the model in the absense of the downsampled words).

And again if the corpus is so large that training time is a concern, than epochs=100 is likely overkill. I believe the GoogleNews vectors were trained using only 3 passes – over a gigantic corpus. A sufficiently large & varied corpus, with plenty of examples of all words all throughout, could potentially train in 1 pass – because each word-vector can then get more total training-updates than many epochs with a small corpus. (In general larger epochs values are more often used when the corpus is thin, to eke out something – not on a corpus so large you're considering non-standard shortcuts to speed the steps.)

-- Gordon

Source https://stackoverflow.com/questions/71289683

QUESTION

How give azure machine learning dataset path in an inference script?

Asked 2022-Feb-24 at 17:26

I am using azureml sdk in Azure Databricks.

When I write the script for inference model (%%writefile script.py) in a databricks cell, I try to load a .bin file that I loaded in Azure Machine Learning Datasets.

I would like to do this in the script.py:

...

ANSWER

Answered 2022-Feb-24 at 17:26

You can use your model name with the Model.get_model_path() method to retrieve the path of the model file or files on the local file system. If you register a folder or a collection of files, this API returns the path of the directory that contains those files.

More info you may want to refer: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-advanced-entry-script#azureml_model_dir

Source https://stackoverflow.com/questions/71024584

QUESTION

How use a .bin file in blob storage to be loaded to a model?

Asked 2022-Feb-23 at 11:19

I have a .bin file in a blob in Azure Blob Storage.

I would like to use it to give to fasttext to use a method.

I tried it:

...

ANSWER

Answered 2022-Feb-23 at 11:19

Please check if given references help to work around:

As per the error the arg0 (1st positional argument) should be a str(string). fasttext.load_model accepts first argument as string or char and second argument is utf-8 encoding which is optional.

See if fasttext.load_model(str) can only load files from the local filesystem. Try to copy the data to the local filesystem and then load it from there, e.g. Check this reference from Stack Overflow. Try to download blob to a file and load that file

Source https://stackoverflow.com/questions/71010726

QUESTION

Loading fasttext binary model from s3 fails

Asked 2022-Jan-27 at 17:55

I am hosting a pretrained fasttext model on s3 (uncompressed) and I am trying to load it in a lambda function. I am using the gensim.models.fasttext module to load the model:

...

ANSWER

Answered 2022-Jan-27 at 17:46

Unfortunately, the np.fromfile() method on which this load depends doesn't work on a streamed-from-S3 file.

Some alternate options include:

download the S3 file to a local path first, then use load_facebook_vectors() from there; or…
while having the FastText file local, load it locally, then use Python's pickle functionality to save it to a single file (now of Python's format), then put that file on S3, and in the future re-load it using Python's unpickling

The utility functions in gensim.utils pickle() and unpickle() (which take a file path, including S3 URLs) may be helpful for the 2nd option, eg:

https://radimrehurek.com/gensim/utils.html#gensim.utils.unpickle

Since your prior code only shows using the vectors (via .load_facebook_vector), not the whole model, you could just pickle & upload the model.wv subcomponent of the loaded model, rather than the whole model, to save some storage/bandwidth.

If perhaps in future Gensim versions, the FastText-model related classes change in shape/operation, an old pickled-model might not cleanly load. In such an eventuality, you could potentially either:

go back to the original Facebook-format model file (which could then be loaded, & then re-saved in a modern format, again); OR...
load your pickled model into the older Gensim where it works, save it locally using Gensim's native .save() (which may split it over multiple local files), then in the newer Gensim use Gensim's native FastText.load() to load those older files (which will usually handle older formats), then re-pickle that loaded model, for future re-unpickles into the matching latest Gensim.

Source https://stackoverflow.com/questions/70881262

QUESTION

By what criteria do we find out that fasttext unsupervised is well trained?

Asked 2022-Jan-11 at 20:25

I want to train unsupervised fasttext for word representation. To do this, I have install fasttext from official website, I read the word representation page, and I used model = fasttext.train_unsupervised(), but it just show me the avg.loss. My question is, how do I know my fasttext is trained well on my dataset or it is not trained well and I must change the hyperparameters. I want use fasttext in my embedding layer for text generation. I need a method or some tips to evaluate my fasttext that trained unsupervised.

...

ANSWER

Answered 2022-Jan-11 at 20:25

There's no one 'best' set of word-vectors: it always depends on your data & downstream goals.

The 'loss' that's optimized, & reported, during FastText training is for the model's internal word-to-nearby-word goal. It is only a guide, via its overall trend & eventual inability to improve further, as to whether more of that kind of training can improve on that internal goal. It is not the case that a model that can reach a lower loss has better metaparameters, or is better at any real downstream tasks.

So: if reported loss was still noticeably decreasing from epoch to epoch when training stopped, it may be worth trying a longer run of more iterations, with all the other data/parameters the same, that instead reaches a point of no-further improvement ('convergence' of the underlying optimization). But don't use FastText-training loss to choose between models with different other metaparameters.

For that, you should use some other repeatable quantititave evaluation of the final word-vectors, ideally in a task as close as possible to your real usage. That is: really plug alternate versions of them into your next step, and review how well they work, & how different sets influence where the full system works better, or worse.

This might be very manual & ad hoc at first: running a set of familiar challenges, and merely 'eyeballing' whethere one 'seems to' be giving more desirable answers or not. But to do well, & truly search all the possibilities for data preprocessing & model metaparameters, you'd ideally want to use some large, automated, & potentially growing-over-time set of probes you can score as 'better' or 'worse'.

The automated tests used in original word-vector papers are often based on some task like analogy-solving, or matching a human's native language reports of which words should be 'closer' to each other than another. Sometimes it makes sense to try to re-use those, as interim evaluations, but ultimately what makes a word-vector perform best on those may not always be what works in other tasks. (In particular, I've seen where word-vectors worse at analogies work noticeably better as inputs to a classifier.)

Source https://stackoverflow.com/questions/70671038

QUESTION

Fasttext model representations for numbers

Asked 2022-Jan-07 at 21:33

I would like to create a fasttext model for numbers. Is this a good approach?

Use Case:

I have a given number set of about 100.000 integer invoice-numbers. Our OCR sometimes creates false invoice-numbers like 1000o00 or 383I338, so my idea was to use fasttext to predict nearest invoice-number based on my 100.000 integers. As correct invoice-numbers are known in advance, I trained a fastext model with all invoice-numbers to create a word-embeding space just with invoices-numbers.

But it is not working and I don´t know if my idea is completly wrong? But I would assume that even if I have no sentences, embedding into vector space should work and therefore also a similarity between 383I338 and 3831338 should be found by the model.

Here some of my code:

...

ANSWER

Answered 2022-Jan-07 at 21:33

I doubt FastText is the right approach for this.

Unlike in natural-languages, where word roots/prefixes/suffixes (character n-grams) can be hints to meaning, most invoice number schemes are just incrementing numbers.

Every '###' or '####' is going to have a similar frequency. (Well, perhaps there'd be a little bit of a bias towards lower digits to the left, for Benford's Law-like reasons.) Unless the exact same invoice numbers repeat often* throughout the corpus, so that the whole token, & its fragments, acquire a word-like meaning from surrounding other tokens, FastText's post-training nearest-neighbors are unlikely to offer any hints about correct numbers. (For it to have a chance to help, you'd want the same invoice-numbers to not just repeat many times, but for a lot of those appeearances to have similar OCR errors - but I strongly suspet your corpus instead has invoice numbers only on individual texts.)

Is the real goal to correct the invoice-numbers, or just to have them be less-noisy in a model thaat's trained on a lot more meaningful, text-like tokens? (If the latter, it might be better just to discard anything that looks like an invoice number – with or without OCR glitches – or is similarly so rare it's likely an OCR scanno.)

That said, statistical & edit-distance methods could potentially help if the real need is correcting OCR errors - just not semantic-context-dependent methods like FastText. You might get useful ideas from Peter Norvig's classic writeup on "How to Write a Spelling Corrector".

Source https://stackoverflow.com/questions/70625591

QUESTION

Passing a vector from C# to Python

Asked 2022-Jan-07 at 01:26

I use Python.Net for C# interaction with Python libraries. I solve the problem of text classification. I use FastText to index and get the vector, as well as Sklearn to train the classifier (Knn).During the implementation, I encountered a lot of problems, but all were solved, with the exception of one. After receiving the vectors of the texts on which I train Knn, I save them to a separate text file and then, if necessary, use it.

...

ANSWER

Answered 2021-Dec-10 at 21:59

I solved this issue for a couple of days and each time I thought it was worth reading the documentation on python.net . As a result, I found a solution and it turned out to be quite banal, it is necessary to represent X_vec not as a float[] , but as a List

Source https://stackoverflow.com/questions/70310329

QUESTION

How to load pre trained FastText Word Embeddings using Gensim?

Asked 2021-Dec-29 at 17:20

I downloaded word embedding from this link. I want to load it in Gensim to do some work but I am not able to load it. I have found many resources and none of it is working. I am using Gensim version 4.1.

I have tried

...

ANSWER

Answered 2021-Dec-29 at 17:20

Per the NotImplementedError, those are the one kind of full Facebook FastText model, -supervised mode, that Gensim does not support.

So sadly, the answer to "How do you load these?" is "you don't".

The .vec files contain just the full-word vectors in a plain-text format – no subword info for synthesizing OOV vectors, or supervised-classification output features. Those can be loaded into a KeyedVectors model:

Source https://stackoverflow.com/questions/70522109

QUESTION

rare misspelled words messes my fastText/Word-Embedding Classfiers

Asked 2021-Dec-16 at 21:14

I'm currently trying to make a sentiment analysis on the IMDB review dataset as a part of homework assignment for my college, I'm required to firstly do some preprocessing e.g. : tokenization, stop words removal, stemming, lemmatization. then use different ways to convert this data to vectors to be classfied by different classfiers, Gensim FastText library was one of the required models to obtain word embeddings on the data I got from text pre-processing step.

the problem I faced with Gensim is that I firstly tried to train on my data using vectors of feature size (100,200,300) but yet they always fail at some point, I tried later to use many pre-trained Gensim data vectors, but none of them worked to find word embeddings for all of the words, they'd rather fail at some point with error

...

ANSWER

Answered 2021-Dec-16 at 21:14

If you train your own word-vector model, then it will contain vectors for all the words you told it to learn. If a word that was in your training data doesn't appear to have a vector, it likely did not appear the required min_count number of times. (These models tend to improve if you discard rare words who few example usages may not be suitably-informative, so the default min_words=5 is a good idea.)

It's often reasonable for downstream tasks, like feature engineering using the text & set of word-vectors, to simply ignore words with no vector. That is, if some_rare_word in model.wv is False, just don't try to use that word – & its missing vector – for anything. So you don't necessarily need to find, or train, a set of word-vectors with every word you need. Just elide, rather than worry-about, the rare missing words.

Separate observations:

Stemming/lemmatization & stop-word removal aren't always worth the trouble, with all corpora/algorithms/goals. (And, stemming/lemmatization may wind up creating pseudowords that limit the model's interpretability & easy application to any texts that don't go through identical preprocessing.) So if those are required parts of laerning exercise, sure, get some experience using them. But don't assume they're necessarily helping, or worth the extra time/complexity, unless you verify that rigrously.
FastText models will also be able to supply synthetic vectors for words that aren't known to the model, based on substrings. These are often pretty weak, but may better than nothing - especially when they give vectors for typos, or rare infelcted forms, similar to morphologically-related known words. (Since this deduced similarity, from many similarly-written tokens, provides some of the same value as stemming/lemmatization via a different path that required the original variations to all be present during initial training, you'd especially want to pay attention to whether FastText & stemming/lemmatization mix well for your goals.) Beware, though: for very-short unknown words – for which the model learned no reusable substring vectors – FastText may still return an error or all-zeros vector.
FastText has a supervised classification mode, but it's not supported by Gensim. If you want to experiment with that, you'd need to use the Facebook FastText implementation. (You could still use a traditional, non-supervised FastText word vector model as a contributor of features for other possible representations.)

Source https://stackoverflow.com/questions/70384870

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install fastText

You can download it from GitHub.

Support

Invoke a command without arguments to list available arguments and their default values:. Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.).

Find more information at: