scibert | A BERT model for scientific text | Natural Language Processing library

by allenai Python Version: Current License: Apache-2.0

X-Ray Key Features Code Snippets(3)Community Discussions(5)Vulnerabilities Install Support

kandi X-RAY | scibert Summary

scibert is a Python library typically used in Artificial Intelligence, Natural Language Processing, Deep Learning, Pytorch, Tensorflow, Bert applications. scibert has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can download it from GitHub.

SciBERT models include all necessary files to be plugged in your own model and are in same format as BERT. If you are using Tensorflow, refer to Google's BERT repo and if you use PyTorch, refer to Hugging Face's repo where detailed instructions on using BERT models are provided.

Support

Quality

Security

License

Reuse

Support

scibert has a medium active ecosystem.

It has 1299 star(s) with 200 fork(s). There are 50 watchers for this library.

It had no major release in the last 6 months.

There are 56 open issues and 34 have been closed. On average issues are closed in 81 days. There are 6 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of scibert is current.

Quality

scibert has 0 bugs and 0 code smells.

Security

scibert has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

scibert code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

scibert is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

scibert releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

Installation instructions are not available. Examples and code snippets are available.

scibert saves you 527 person hours of effort in developing the same functionality from scratch.

It has 1236 lines of code, 49 functions and 23 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed scibert and discovered the below as its top functions. This is intended to give you an instant insight into scibert implemented functionality, and help decide if they suit your requirements.

Compute the full table
Calculate the difference between two scores
Compute the conservative argument matrix
Processes a part of a paper
Convert a paper record to spacy text
Process a paper record
Get a list of sentences from spacy text
Get spacy nlp
Convert a JSON file to JSON
Converts a list of dictionaries into text format
Processes a paper file
Processes a piece of papers

Get all kandi verified functions for this library.

scibert Key Features

No Key Features are available at this moment for scibert.

scibert Examples and Code Snippets

Finetuning SciBERT on NER downstream task,Model Usage,Example of usage

Python

Lines of Code : 79

License : Permissive (MIT)

Copy

from transformers import pipeline

text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."

nlp_ner = pipeline("ner",
                   model='fran-martinez/scibert_scivocab_cased_ner_jnlpba',

Usage

Python

Lines of Code : 32

License : Permissive (MIT)

Copy

export NEW_TSV_DIR=sample/tsv
export FINGERPRINT_DIR=sample/radius1
export RADIUS=1
python3 fingerprint/preprocessor.py $NEW_TSV_DIR none $RADIUS $FINGERPRINT_DIR

cd main
python run_ddie.py \
    --task_name MRPC \
    --model_type bert \
    --data

<code>SciBERT</code>,Model training,Training new models using AllenNLP

Python

Lines of Code : 22

License : Permissive (Apache-2.0)

Copy

├── ner
│   ├── JNLPBA
│   ├── NCBI-disease
│   ├── bc5cdr
│   └── sciie
├── parsing
│   └── genia
├── pico
│   └── ebmnlp
└── text_classification
    ├── chemprot
    ├── citation_intent

Community Discussions

Trending Discussions on scibert

How to use SciBERT in the best manner?

Huggingface SciBERT predict masked word not working

How to train BERT from scratch on a new domain for both MLM and NSP?

Cannot register text_classifier as Model; name already in use for TextClassifier

How to truncate a Bert tokenizer in Transformers library

QUESTION

How to use SciBERT in the best manner?

Asked 2021-Oct-03 at 14:21

I'm trying to use BERT models to do text classification. As the text is about scientific texts, I intend to use the SicBERT pre-trained model: https://github.com/allenai/scibert

I have faced several limitations which I want to know if there is any solutions for them:

When I want to do tokenization and batching, it only allows me to use max_length of <=512. Is there any way to use more tokens. Doen't this limitation of 512 mean that I am actually not using all the text information during training? Any solution to use all the text?
I have tried to use this pretrained library with other models such as DeBERTa or RoBERTa. But it doesn't let me. I has only worked with BERT. Is there anyway I can do that?
I know this is a general question, but any suggestion that I can improve my fine tuning (from data to hyper parameter, etc)? Currently, I'm getting ~75% accuracy. Thanks

Codes:

...

ANSWER

Answered 2021-Oct-03 at 14:21

When I want to do tokenization and batching, it only allows me to use max_length of <=512. Is there any way to use more tokens. Doen't this limitation of 512 mean that I am actually not using all the text information during training? Any solution to use all the text?

Yes, you are not using the complete text. And this is one of the limitations of BERT and T5 models, which limit to using 512 and 1024 tokens resp. to the best of my knowledge.

I can suggest you to use Longformer or Bigbird or Reformer models, which can handle sequence lengths up to 16k, 4096, 64k tokens respectively. These are really good for processing longer texts like scientific documents.

I have tried to use this pretrained library with other models such as DeBERTa or RoBERTa. But it doesn't let me. I has only worked with BERT. Is there anyway I can do that?

SciBERT is actually a pre-trained BERT model. See this issue for more details where they mention the feasibility of converting BERT to ROBERTa:

Since you're working with a BERT model that was pre-trained, you unfortunately won't be able to change the tokenizer now from a WordPiece (BERT) to a Byte-level BPE (RoBERTa).

I know this is a general question, but any suggestion that I can improve my fine tuning (from data to hyper parameter, etc)? Currently, I'm getting ~79% accuracy.

I would first try to tune the most important hyperparameter learning_rate. I would then explore different values for hyperparameters of AdamW optimizer and num_warmup_steps hyperparamter of the scheduler.

Source https://stackoverflow.com/questions/69406937

QUESTION

Huggingface SciBERT predict masked word not working

Asked 2021-Jun-07 at 14:28

I am trying to use the pretrained SciBERT model (https://huggingface.co/allenai/scibert_scivocab_uncased) from Huggingface to predict masked words in scientific/biomedical text. This produces errors, and not sure how to move forward from this point.

Here is the code so far -

...

ANSWER

Answered 2021-Jun-07 at 14:28

As the error message tells you, you need to use AutoModelForMaskedLM:

Source https://stackoverflow.com/questions/67872803

QUESTION

How to train BERT from scratch on a new domain for both MLM and NSP?

Asked 2021-Jun-01 at 14:42

I’m trying to train BERT model from scratch using my own dataset using HuggingFace library. I would like to train the model in a way that it has the exact architecture of the original BERT model.

In the original paper, it stated that: “BERT is trained on two tasks: predicting randomly masked tokens (MLM) and predicting whether two sentences follow each other (NSP). SCIBERT follows the same architecture as BERT but is instead pretrained on scientific text.”

I’m trying to understand how to train the model on two tasks as above. At the moment, I initialised the model as below:

...

ANSWER

Answered 2021-Feb-10 at 14:04

I would suggest doing the following:

First pre-train BERT on the MLM objective. HuggingFace provides a script especially for training BERT on the MLM objective on your own data. You can find it here. As you can see in the run_mlm.py script, they use AutoModelForMaskedLM, and you can specify any architecture you want.
Second, if want to train on the next sentence prediction task, you can define a BertForPretraining model (which has both the MLM and NSP heads on top), then load in the weights from the model you trained in step 1, and then further pre-train it on a next sentence prediction task.

UPDATE: apparently the next sentence prediction task did help improve performance of BERT on some GLUE tasks. See this talk by the author of BERT.

Source https://stackoverflow.com/questions/65646925

QUESTION

Cannot register text_classifier as Model; name already in use for TextClassifier

Asked 2021-Feb-17 at 13:55

Trying to use text classifier model shared by https://github.com/allenai/scibert/blob/master/scibert/models/text_classifier.py

Everything used to work and suddenly I keep getting this error: Cannot register text_classifier as Model; name already in use for TextClassifier

What might be the reason? any suggestion?

...

ANSWER

Answered 2021-Feb-17 at 13:55

The name is already taken. Something that’s already a part of AllenNLP uses that name already, so you need to pick a different one.

For the curious, AllenNLP creates a registry of models, so that you can select a model at the command line. (That’s what the decorator is doing.) This requires the names to be unique.

The name text_classifier was used by AllenNLP only after the external package you’re using used it. It worked in May 2019, when that file was last updated. But 17 months ago, AllenNLP started using it. So it’s not your fault; it’s a mismatch between those two packages (at least, in their current versions).

Source https://stackoverflow.com/questions/66242860

QUESTION

How to truncate a Bert tokenizer in Transformers library

Asked 2020-Nov-27 at 13:48

I am using the Scibert pretrained model to get embeddings for various texts. The code is as follows:

...

ANSWER

Answered 2020-Nov-27 at 13:48

truncation is not a parameter of the class constructor (class reference), but a parameter of the __call__ method. Therefore you should use:

Source https://stackoverflow.com/questions/65034771

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install scibert

You can download it from GitHub.
You can use scibert like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: