Flaubert | Unsupervised Language Model Pre-training | Machine Learning library

by getalp Python Version: Current License: Non-SPDX

X-Ray Key Features Code Snippets Community Discussions(5)Vulnerabilities Install Support

kandi X-RAY | Flaubert Summary

Flaubert is a Python library typically used in Artificial Intelligence, Machine Learning, Deep Learning, Pytorch, Tensorflow applications. Flaubert has no bugs, it has no vulnerabilities and it has low support. However Flaubert build file is not available and it has a Non-SPDX License. You can download it from GitHub.

FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. This repository shares everything: pre-trained models (base and large), the data, the code to use the models and the code to train them if you need. Along with FlauBERT comes FLUE: an evaluation setup for French NLP systems similar to the popular GLUE benchmark. The goal is to enable further reproducible experiments in the future and to share models and progress on the French language. This repository is still under construction and everything will be available soon.

Support

Quality

Security

License

Reuse

Support

Flaubert has a low active ecosystem.

It has 191 star(s) with 24 fork(s). There are 16 watchers for this library.

It had no major release in the last 6 months.

There are 5 open issues and 27 have been closed. On average issues are closed in 18 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of Flaubert is current.

Quality

Flaubert has 0 bugs and 0 code smells.

Security

Flaubert has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

Flaubert code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

Flaubert has a Non-SPDX License.

Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

Reuse

Flaubert releases are not available. You will need to build from source code and install.

Flaubert has no build file. You will be need to create the build yourself to build the component from source.

Installation instructions, examples and code snippets are available.

It has 6021 lines of code, 289 functions and 47 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed Flaubert and discovered the below as its top functions. This is intended to give you an instant insight into Flaubert implemented functionality, and help decide if they suit your requirements.

get argument parser
Generate beam .
Initialize distributed mode .
Reads data from given directory .
Evaluate the MLEU .
Check that the data params are valid .
Register command line arguments .
Evaluate the MLM model .
Computes precision scores for each source .
Build Transformer model .

Get all kandi verified functions for this library.

Flaubert Key Features

No Key Features are available at this moment for Flaubert.

Flaubert Examples and Code Snippets

No Code Snippets are available at this moment for Flaubert.

Community Discussions

Trending Discussions on Flaubert

ValueError: Unrecognized model in ./MRPC/. Should have a `model_type` key in its config.json, or contain one of the following strings in its name

Change last layer on pretrained huggingface model

Huggingface saving tokenizer

Why is there no pooler layer in huggingfaces' FlauBERT model?

Where in the code of pytorch or huggingface/transformer label gets "renamed" into labels?

QUESTION

ValueError: Unrecognized model in ./MRPC/. Should have a `model_type` key in its config.json, or contain one of the following strings in its name

Asked 2022-Jan-13 at 14:10

Goal: Amend this Notebook to work with Albert and Distilbert models

Kernel: conda_pytorch_p36. I did Restart & Run All, and refreshed file view in working directory.

Error occurs in Section 1.2, only for these 2 new models.

For filenames etc., I've created a variable used everywhere:

...

ANSWER

Answered 2022-Jan-13 at 14:10

Explanation:

When instantiating AutoModel, you must specify a model_type parameter in ./MRPC/config.json file (downloaded during Notebook runtime).

List of model_types can be found here.

Solution:

Code that appends model_type to config.json, in the same format:

Source https://stackoverflow.com/questions/70697470

QUESTION

Change last layer on pretrained huggingface model

Asked 2021-Dec-25 at 18:14

I want to re-finetuned a transformer model but I get an unknown error when I tried to train the model. I can't change the "num_labels" on loading the model. So, I tried to change it manually

...

ANSWER

Answered 2021-Dec-22 at 13:53

So, There is a solution for this Just add ignore_mismatched_sizes=True when loading the model as:

Source https://stackoverflow.com/questions/70449122

QUESTION

Huggingface saving tokenizer

Asked 2020-Oct-28 at 09:27

I am trying to save the tokenizer in huggingface so that I can load it later from a container where I don't need access to the internet.

...

ANSWER

Answered 2020-Oct-28 at 09:27

save_vocabulary(), saves only the vocabulary file of the tokenizer (List of BPE tokens).

To save the entire tokenizer, you should use save_pretrained()

Thus, as follows:

Source https://stackoverflow.com/questions/64550503

QUESTION

Why is there no pooler layer in huggingfaces' FlauBERT model?

Asked 2020-Aug-25 at 14:51

BERT model for Language Model and Sequence classification includes an extra projection layer between the last transformer and the classification layer (it contains a linear layer of size hidden_dim x hidden_dim, a dropout layer and a tanh activation). This was not described in the paper originally but was clarified here. This intermediate layer is pre-trained together with the rest of the transformers.

In huggingface's BertModel, this layer is called pooler.

According to the paper, FlauBERT model (XLMModel fine-tuned on French corpus) also includes this pooler layer: "The classification head is composed of the following layers, in order: dropout, linear,tanhactivation, dropout, and linear.". However, when loading a FlauBERT model with huggingface (e.g, with FlaubertModel.from_pretrained(...), or FlaubertForSequenceClassification.from_pretrained(...)), the model seem to include no such layer.

Hence the question: why is there no pooler layer in huggingfaces' FlauBERT model ?

...

ANSWER

Answered 2020-Aug-11 at 14:20

Because Flaubert is an XLM model and not a BERT model

Source https://stackoverflow.com/questions/63358768

QUESTION

Where in the code of pytorch or huggingface/transformer label gets "renamed" into labels?

Asked 2020-Jun-17 at 18:34

my question concerns the example, available in the great huggingface/transformers library.

I am using a notebook, provided by library creators as a starting point for my pipeline. The notebook below presents a pipeline of finetuning a BERT for Sentence Classification on Glue dataset. https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb#scrollTo=uBzDW1FO63pK
When getting into the code, I noticed a very weird thing, which I cannot explain.

In the example, input data is introduced to the model as the instances of the InputFeatures class from here: https://github.com/huggingface/transformers/blob/011cc0be51cf2eb0a91333f1a731658361e81d89/src/transformers/data/processors/utils.py This class has 4 attributes, including the label attribute:

...

ANSWER

Answered 2020-Jun-17 at 18:34

The rename happens in the collator. In the trainer init, when data_collator is None, a default one is used:

Source https://stackoverflow.com/questions/62435022

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install Flaubert

You should clone this repo and then install WikiExtractor, fastBPE and Moses tokenizer under tools:.
In the following, replace $DATA_DIR, $corpus_name respectively with the path to the local directory to save the downloaded data and the name of the corpus that you want to download among the options specified in the scripts.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: