scibert | A BERT model for scientific text | Natural Language Processing library

 by   allenai Python Version: Current License: Apache-2.0

kandi X-RAY | scibert Summary

kandi X-RAY | scibert Summary

scibert is a Python library typically used in Artificial Intelligence, Natural Language Processing, Deep Learning, Pytorch, Tensorflow, Bert applications. scibert has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can download it from GitHub.

SciBERT models include all necessary files to be plugged in your own model and are in same format as BERT. If you are using Tensorflow, refer to Google's BERT repo and if you use PyTorch, refer to Hugging Face's repo where detailed instructions on using BERT models are provided.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              scibert has a medium active ecosystem.
              It has 1299 star(s) with 200 fork(s). There are 50 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 56 open issues and 34 have been closed. On average issues are closed in 81 days. There are 6 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of scibert is current.

            kandi-Quality Quality

              scibert has 0 bugs and 0 code smells.

            kandi-Security Security

              scibert has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              scibert code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              scibert is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              scibert releases are not available. You will need to build from source code and install.
              Build file is available. You can build the component from source.
              Installation instructions are not available. Examples and code snippets are available.
              scibert saves you 527 person hours of effort in developing the same functionality from scratch.
              It has 1236 lines of code, 49 functions and 23 files.
              It has high code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed scibert and discovered the below as its top functions. This is intended to give you an instant insight into scibert implemented functionality, and help decide if they suit your requirements.
            • Compute the full table
            • Calculate the difference between two scores
            • Compute the conservative argument matrix
            • Processes a part of a paper
            • Convert a paper record to spacy text
            • Process a paper record
            • Get a list of sentences from spacy text
            • Get spacy nlp
            • Convert a JSON file to JSON
            • Converts a list of dictionaries into text format
            • Processes a paper file
            • Processes a piece of papers
            Get all kandi verified functions for this library.

            scibert Key Features

            No Key Features are available at this moment for scibert.

            scibert Examples and Code Snippets

            Finetuning SciBERT on NER downstream task,Model Usage,Example of usage
            Pythondot img1Lines of Code : 79dot img1License : Permissive (MIT)
            copy iconCopy
            from transformers import pipeline
            
            text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."
            
            nlp_ner = pipeline("ner",
                               model='fran-martinez/scibert_scivocab_cased_ner_jnlpba',
                            
            Usage
            Pythondot img2Lines of Code : 32dot img2License : Permissive (MIT)
            copy iconCopy
            export NEW_TSV_DIR=sample/tsv
            export FINGERPRINT_DIR=sample/radius1
            export RADIUS=1
            python3 fingerprint/preprocessor.py $NEW_TSV_DIR none $RADIUS $FINGERPRINT_DIR
            
            cd main
            python run_ddie.py \
                --task_name MRPC \
                --model_type bert \
                --data  
            <code>SciBERT</code>,Model training,Training new models using AllenNLP
            Pythondot img3Lines of Code : 22dot img3License : Permissive (Apache-2.0)
            copy iconCopy
            ├── ner
            │   ├── JNLPBA
            │   ├── NCBI-disease
            │   ├── bc5cdr
            │   └── sciie
            ├── parsing
            │   └── genia
            ├── pico
            │   └── ebmnlp
            └── text_classification
                ├── chemprot
                ├── citation_intent
                

            Community Discussions

            QUESTION

            How to use SciBERT in the best manner?
            Asked 2021-Oct-03 at 14:21

            I'm trying to use BERT models to do text classification. As the text is about scientific texts, I intend to use the SicBERT pre-trained model: https://github.com/allenai/scibert

            I have faced several limitations which I want to know if there is any solutions for them:

            1. When I want to do tokenization and batching, it only allows me to use max_length of <=512. Is there any way to use more tokens. Doen't this limitation of 512 mean that I am actually not using all the text information during training? Any solution to use all the text?

            2. I have tried to use this pretrained library with other models such as DeBERTa or RoBERTa. But it doesn't let me. I has only worked with BERT. Is there anyway I can do that?

            3. I know this is a general question, but any suggestion that I can improve my fine tuning (from data to hyper parameter, etc)? Currently, I'm getting ~75% accuracy. Thanks

            Codes:

            ...

            ANSWER

            Answered 2021-Oct-03 at 14:21

            When I want to do tokenization and batching, it only allows me to use max_length of <=512. Is there any way to use more tokens. Doen't this limitation of 512 mean that I am actually not using all the text information during training? Any solution to use all the text?

            Yes, you are not using the complete text. And this is one of the limitations of BERT and T5 models, which limit to using 512 and 1024 tokens resp. to the best of my knowledge.

            I can suggest you to use Longformer or Bigbird or Reformer models, which can handle sequence lengths up to 16k, 4096, 64k tokens respectively. These are really good for processing longer texts like scientific documents.

            I have tried to use this pretrained library with other models such as DeBERTa or RoBERTa. But it doesn't let me. I has only worked with BERT. Is there anyway I can do that?

            SciBERT is actually a pre-trained BERT model. See this issue for more details where they mention the feasibility of converting BERT to ROBERTa:

            Since you're working with a BERT model that was pre-trained, you unfortunately won't be able to change the tokenizer now from a WordPiece (BERT) to a Byte-level BPE (RoBERTa).

            I know this is a general question, but any suggestion that I can improve my fine tuning (from data to hyper parameter, etc)? Currently, I'm getting ~79% accuracy.

            I would first try to tune the most important hyperparameter learning_rate. I would then explore different values for hyperparameters of AdamW optimizer and num_warmup_steps hyperparamter of the scheduler.

            Source https://stackoverflow.com/questions/69406937

            QUESTION

            Huggingface SciBERT predict masked word not working
            Asked 2021-Jun-07 at 14:28

            I am trying to use the pretrained SciBERT model (https://huggingface.co/allenai/scibert_scivocab_uncased) from Huggingface to predict masked words in scientific/biomedical text. This produces errors, and not sure how to move forward from this point.

            Here is the code so far -

            ...

            ANSWER

            Answered 2021-Jun-07 at 14:28

            As the error message tells you, you need to use AutoModelForMaskedLM:

            Source https://stackoverflow.com/questions/67872803

            QUESTION

            How to train BERT from scratch on a new domain for both MLM and NSP?
            Asked 2021-Jun-01 at 14:42

            I’m trying to train BERT model from scratch using my own dataset using HuggingFace library. I would like to train the model in a way that it has the exact architecture of the original BERT model.

            In the original paper, it stated that: “BERT is trained on two tasks: predicting randomly masked tokens (MLM) and predicting whether two sentences follow each other (NSP). SCIBERT follows the same architecture as BERT but is instead pretrained on scientific text.”

            I’m trying to understand how to train the model on two tasks as above. At the moment, I initialised the model as below:

            ...

            ANSWER

            Answered 2021-Feb-10 at 14:04

            I would suggest doing the following:

            1. First pre-train BERT on the MLM objective. HuggingFace provides a script especially for training BERT on the MLM objective on your own data. You can find it here. As you can see in the run_mlm.py script, they use AutoModelForMaskedLM, and you can specify any architecture you want.

            2. Second, if want to train on the next sentence prediction task, you can define a BertForPretraining model (which has both the MLM and NSP heads on top), then load in the weights from the model you trained in step 1, and then further pre-train it on a next sentence prediction task.

            UPDATE: apparently the next sentence prediction task did help improve performance of BERT on some GLUE tasks. See this talk by the author of BERT.

            Source https://stackoverflow.com/questions/65646925

            QUESTION

            Cannot register text_classifier as Model; name already in use for TextClassifier
            Asked 2021-Feb-17 at 13:55

            Trying to use text classifier model shared by https://github.com/allenai/scibert/blob/master/scibert/models/text_classifier.py

            Everything used to work and suddenly I keep getting this error: Cannot register text_classifier as Model; name already in use for TextClassifier

            What might be the reason? any suggestion?

            ...

            ANSWER

            Answered 2021-Feb-17 at 13:55

            The name is already taken. Something that’s already a part of AllenNLP uses that name already, so you need to pick a different one.

            For the curious, AllenNLP creates a registry of models, so that you can select a model at the command line. (That’s what the decorator is doing.) This requires the names to be unique.

            The name text_classifier was used by AllenNLP only after the external package you’re using used it. It worked in May 2019, when that file was last updated. But 17 months ago, AllenNLP started using it. So it’s not your fault; it’s a mismatch between those two packages (at least, in their current versions).

            Source https://stackoverflow.com/questions/66242860

            QUESTION

            How to truncate a Bert tokenizer in Transformers library
            Asked 2020-Nov-27 at 13:48

            I am using the Scibert pretrained model to get embeddings for various texts. The code is as follows:

            ...

            ANSWER

            Answered 2020-Nov-27 at 13:48

            truncation is not a parameter of the class constructor (class reference), but a parameter of the __call__ method. Therefore you should use:

            Source https://stackoverflow.com/questions/65034771

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install scibert

            You can download it from GitHub.
            You can use scibert like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/allenai/scibert.git

          • CLI

            gh repo clone allenai/scibert

          • sshUrl

            git@github.com:allenai/scibert.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link

            Consider Popular Natural Language Processing Libraries

            transformers

            by huggingface

            funNLP

            by fighting41love

            bert

            by google-research

            jieba

            by fxsjy

            Python

            by geekcomputers

            Try Top Libraries by allenai

            allennlp

            by allenaiPython

            longformer

            by allenaiPython

            bilm-tf

            by allenaiPython

            RL4LMs

            by allenaiPython

            bi-att-flow

            by allenaiPython