Flaubert | Unsupervised Language Model Pre-training | Machine Learning library

 by   getalp Python Version: Current License: Non-SPDX

kandi X-RAY | Flaubert Summary

kandi X-RAY | Flaubert Summary

Flaubert is a Python library typically used in Artificial Intelligence, Machine Learning, Deep Learning, Pytorch, Tensorflow applications. Flaubert has no bugs, it has no vulnerabilities and it has low support. However Flaubert build file is not available and it has a Non-SPDX License. You can download it from GitHub.

FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. This repository shares everything: pre-trained models (base and large), the data, the code to use the models and the code to train them if you need. Along with FlauBERT comes FLUE: an evaluation setup for French NLP systems similar to the popular GLUE benchmark. The goal is to enable further reproducible experiments in the future and to share models and progress on the French language. This repository is still under construction and everything will be available soon.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              Flaubert has a low active ecosystem.
              It has 191 star(s) with 24 fork(s). There are 16 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 5 open issues and 27 have been closed. On average issues are closed in 18 days. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of Flaubert is current.

            kandi-Quality Quality

              Flaubert has 0 bugs and 0 code smells.

            kandi-Security Security

              Flaubert has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              Flaubert code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              Flaubert has a Non-SPDX License.
              Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

            kandi-Reuse Reuse

              Flaubert releases are not available. You will need to build from source code and install.
              Flaubert has no build file. You will be need to create the build yourself to build the component from source.
              Installation instructions, examples and code snippets are available.
              It has 6021 lines of code, 289 functions and 47 files.
              It has high code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed Flaubert and discovered the below as its top functions. This is intended to give you an instant insight into Flaubert implemented functionality, and help decide if they suit your requirements.
            • get argument parser
            • Generate beam .
            • Initialize distributed mode .
            • Reads data from given directory .
            • Evaluate the MLEU .
            • Check that the data params are valid .
            • Register command line arguments .
            • Evaluate the MLM model .
            • Computes precision scores for each source .
            • Build Transformer model .
            Get all kandi verified functions for this library.

            Flaubert Key Features

            No Key Features are available at this moment for Flaubert.

            Flaubert Examples and Code Snippets

            No Code Snippets are available at this moment for Flaubert.

            Community Discussions

            QUESTION

            ValueError: Unrecognized model in ./MRPC/. Should have a `model_type` key in its config.json, or contain one of the following strings in its name
            Asked 2022-Jan-13 at 14:10

            Goal: Amend this Notebook to work with Albert and Distilbert models

            Kernel: conda_pytorch_p36. I did Restart & Run All, and refreshed file view in working directory.

            Error occurs in Section 1.2, only for these 2 new models.

            For filenames etc., I've created a variable used everywhere:

            ...

            ANSWER

            Answered 2022-Jan-13 at 14:10
            Explanation:

            When instantiating AutoModel, you must specify a model_type parameter in ./MRPC/config.json file (downloaded during Notebook runtime).

            List of model_types can be found here.

            Solution:

            Code that appends model_type to config.json, in the same format:

            Source https://stackoverflow.com/questions/70697470

            QUESTION

            Change last layer on pretrained huggingface model
            Asked 2021-Dec-25 at 18:14

            I want to re-finetuned a transformer model but I get an unknown error when I tried to train the model. I can't change the "num_labels" on loading the model. So, I tried to change it manually

            ...

            ANSWER

            Answered 2021-Dec-22 at 13:53

            So, There is a solution for this Just add ignore_mismatched_sizes=True when loading the model as:

            Source https://stackoverflow.com/questions/70449122

            QUESTION

            Huggingface saving tokenizer
            Asked 2020-Oct-28 at 09:27

            I am trying to save the tokenizer in huggingface so that I can load it later from a container where I don't need access to the internet.

            ...

            ANSWER

            Answered 2020-Oct-28 at 09:27

            save_vocabulary(), saves only the vocabulary file of the tokenizer (List of BPE tokens).

            To save the entire tokenizer, you should use save_pretrained()

            Thus, as follows:

            Source https://stackoverflow.com/questions/64550503

            QUESTION

            Why is there no pooler layer in huggingfaces' FlauBERT model?
            Asked 2020-Aug-25 at 14:51

            BERT model for Language Model and Sequence classification includes an extra projection layer between the last transformer and the classification layer (it contains a linear layer of size hidden_dim x hidden_dim, a dropout layer and a tanh activation). This was not described in the paper originally but was clarified here. This intermediate layer is pre-trained together with the rest of the transformers.

            In huggingface's BertModel, this layer is called pooler.

            According to the paper, FlauBERT model (XLMModel fine-tuned on French corpus) also includes this pooler layer: "The classification head is composed of the following layers, in order: dropout, linear,tanhactivation, dropout, and linear.". However, when loading a FlauBERT model with huggingface (e.g, with FlaubertModel.from_pretrained(...), or FlaubertForSequenceClassification.from_pretrained(...)), the model seem to include no such layer.

            Hence the question: why is there no pooler layer in huggingfaces' FlauBERT model ?

            ...

            ANSWER

            Answered 2020-Aug-11 at 14:20

            Because Flaubert is an XLM model and not a BERT model

            Source https://stackoverflow.com/questions/63358768

            QUESTION

            Where in the code of pytorch or huggingface/transformer label gets "renamed" into labels?
            Asked 2020-Jun-17 at 18:34

            my question concerns the example, available in the great huggingface/transformers library.

            I am using a notebook, provided by library creators as a starting point for my pipeline. The notebook below presents a pipeline of finetuning a BERT for Sentence Classification on Glue dataset. https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb#scrollTo=uBzDW1FO63pK
            When getting into the code, I noticed a very weird thing, which I cannot explain.

            In the example, input data is introduced to the model as the instances of the InputFeatures class from here: https://github.com/huggingface/transformers/blob/011cc0be51cf2eb0a91333f1a731658361e81d89/src/transformers/data/processors/utils.py This class has 4 attributes, including the label attribute:

            ...

            ANSWER

            Answered 2020-Jun-17 at 18:34

            The rename happens in the collator. In the trainer init, when data_collator is None, a default one is used:

            Source https://stackoverflow.com/questions/62435022

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install Flaubert

            You should clone this repo and then install WikiExtractor, fastBPE and Moses tokenizer under tools:.
            In the following, replace $DATA_DIR, $corpus_name respectively with the path to the local directory to save the downloaded data and the name of the corpus that you want to download among the options specified in the scripts.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/getalp/Flaubert.git

          • CLI

            gh repo clone getalp/Flaubert

          • sshUrl

            git@github.com:getalp/Flaubert.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link