BERT | modification of official bert for downstream task | Natural Language Processing library

by yyht Python Version: Current License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | BERT Summary

BERT is a Python library typically used in Artificial Intelligence, Natural Language Processing, Bert, Transformer applications. BERT has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can download it from GitHub.

modification of official bert for downstream task.

Support

Quality

Security

License

Reuse

Support

BERT has a low active ecosystem.

It has 30 star(s) with 10 fork(s). There are 3 watchers for this library.

It had no major release in the last 6 months.

There are 5 open issues and 1 have been closed. There are 1 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of BERT is current.

Quality

BERT has 0 bugs and 0 code smells.

Security

BERT has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

BERT code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

BERT is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

BERT releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

BERT saves you 194378 person hours of effort in developing the same functionality from scratch.

It has 195472 lines of code, 11077 functions and 1274 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed BERT and discovered the below as its top functions. This is intended to give you an instant insight into BERT implemented functionality, and help decide if they suit your requirements.

Perform beam search .
Train a training loop .
Sample a sequence of features .
Creates a multitask model function for the given model .
Compute the ML score .
Evolution decoder decoder .
Basic hyperparameters .
Sample a sequence without caching .
Gumbel generator .
Returns a function that returns a distribution function for the given model .

Get all kandi verified functions for this library.

BERT Key Features

No Key Features are available at this moment for BERT.

BERT Examples and Code Snippets

No Code Snippets are available at this moment for BERT.

Community Discussions

Trending Discussions on BERT

How do I show the other sentiment scores from text classification?

I'm using bert pre-trained model for question and answering. It's returning correct result but with lot of spaces between the text

Extracting multiple Wikipedia pages using Pythons Wikipedia

Hugging Face: NameError: name 'sentences' is not defined

Force BERT transformer to use CUDA

Huggingface SciBERT predict masked word not working

How to test masked language model after training it?

I applied W2V on ML Algorithms. It gives error of negative value for NB and gives 0.48 accuracy for for all the other algorithms. How come?

Avoiding repeating equivalent lines

How to train BERT from scratch on a new domain for both MLM and NSP?

QUESTION

How do I show the other sentiment scores from text classification?

Asked 2021-Jun-15 at 17:15

I am doing sentiment analysis, and I was wondering how to show the other sentiment scores from classifying my sentence: "Tesla's stock just increased by 20%."

I have three sentiments: positive, negative and neutral.

This is my code, which contains the sentence I want to classify:

...

ANSWER

Answered 2021-Jun-15 at 14:44

Because HappyTransformer does not support multi class probabilities I suggest to use another library. The library flair provides even more functionality and can give you your desired multi class probabilities, with something like this:

Source https://stackoverflow.com/questions/67987738

QUESTION

I'm using bert pre-trained model for question and answering. It's returning correct result but with lot of spaces between the text

Asked 2021-Jun-15 at 17:14

I'm using bert pre-trained model for question and answering. It's returning correct result but with lot of spaces between the text

The code is below :

...

ANSWER

Answered 2021-Jun-15 at 17:14

You can just use the tokenizer decode function:

Source https://stackoverflow.com/questions/67990545

QUESTION

Extracting multiple Wikipedia pages using Pythons Wikipedia

Asked 2021-Jun-15 at 13:10

I am not sure how to extract multiple pages from a search result using Pythons Wikipedia plugin. Some advice would be appreciated.

My code so far:

...

ANSWER

Answered 2021-Jun-15 at 13:10

You have done the hard part, the results are already in the results variable.

But the results need parsing by the wiki.page() nethod, which only takes one argument.

The solution? Use a loop to parse all results one by one.

The easiest way will be using for loops, but the list comprehension method is the best.

Replace the last two lines with the following:

Source https://stackoverflow.com/questions/67986624

QUESTION

Hugging Face: NameError: name 'sentences' is not defined

Asked 2021-Jun-14 at 15:16

I am following this tutorial here: https://huggingface.co/transformers/training.html - though, I am coming across an error, and I think the tutorial is missing an import, but i do not know which.

These are my current imports:

...

ANSWER

Answered 2021-Jun-14 at 15:08

The error states that you do not have a variable called sentences in the scope. I believe the tutorial presumes you already have a list of sentences and are tokenizing it.

Have a look at the documentation The first argument can be either a string or list of string or list of list of strings.

Source https://stackoverflow.com/questions/67972661

QUESTION

Force BERT transformer to use CUDA

Asked 2021-Jun-13 at 09:57

I want to force the Huggingface transformer (BERT) to make use of CUDA. nvidia-smi showed that all my CPU cores were maxed out during the code execution, but my GPU was at 0% utilization. Unfortunately, I'm new to the Hugginface library as well as PyTorch and don't know where to place the CUDA attributes device = cuda:0 or .to(cuda:0).

The code below is basically a customized part from german sentiment BERT working example

...

ANSWER

Answered 2021-Jun-12 at 16:19

You can make the entire class inherit torch.nn.Module like so:

Source https://stackoverflow.com/questions/67948945

QUESTION

Huggingface SciBERT predict masked word not working

Asked 2021-Jun-07 at 14:28

I am trying to use the pretrained SciBERT model (https://huggingface.co/allenai/scibert_scivocab_uncased) from Huggingface to predict masked words in scientific/biomedical text. This produces errors, and not sure how to move forward from this point.

Here is the code so far -

...

ANSWER

Answered 2021-Jun-07 at 14:28

As the error message tells you, you need to use AutoModelForMaskedLM:

Source https://stackoverflow.com/questions/67872803

QUESTION

How to test masked language model after training it?

Asked 2021-Jun-06 at 16:53

I have followed this tutorial for masked language modelling from Hugging Face using BERT, but I am unsure how to actually deploy the model.

Tutorial: https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb

I have trained the model using my own dataset, which has worked fine, but I don't know how to actually use the model, as the notebook does not include an example on how to do this, sadly.

Example of what I want to do with my trained model

On the Hugging Face website, this is the code used in the example; hence, I want to do this exact thing but with my model:

...

ANSWER

Answered 2021-Jun-06 at 16:53

This depends a lot of your task. Your task seems to be masked language modelling, that, is to predict one or more masked words:

today I ate ___ .

(pizza) or (pasta) could be equally correct, so you cannot use a metric such as accuray. But (water) should be less "correct" than the other two. So what you normally do is to check how "surprised" the language model is, on an evaluation data set. This metric is called perplexity. Therefore, before and after you finetune a model on you specific dataset, you would calculate the perplexity and you would expect it to be lower after finetuning. The model should be more used to your specific vocabulary etc. And that is how you test your model.

As you can see, they calculate the perplexity in the tutorial you mentioned:

Source https://stackoverflow.com/questions/67851322

QUESTION

I applied W2V on ML Algorithms. It gives error of negative value for NB and gives 0.48 accuracy for for all the other algorithms. How come?

Asked 2021-Jun-02 at 16:43

from gensim.models import Word2Vec
import time
# Skip-gram model (sg = 1)
size = 1000
window = 3
min_count = 1
workers = 3
sg = 1

word2vec_model_file = 'word2vec_' + str(size) + '.model'
start_time = time.time()
stemmed_tokens = pd.Series(df['STEMMED_TOKENS']).values
# Train the Word2Vec Model
w2v_model = Word2Vec(stemmed_tokens, min_count = min_count, size = size, workers = workers, window = window, sg = sg)
print("Time taken to train word2vec model: " + str(time.time() - start_time))
w2v_model.save(word2vec_model_file)

...

ANSWER

Answered 2021-Jun-02 at 16:43

A vector size of 1000 dimensions is very uncommon, and would require massive amounts of data to train. For example, the famous GoogleNews vectors were for 3 million words, trained on something like 100 billion corpus words - and still only 300 dimensions. Your STEMMED_TOKENS may not be enough data to justify 100-dimensional vectors, much less 300 or 1000.

A choice of min_count=1 is a bad idea. This algorithm can't learn anything valuable from words that only appear a few times. Typically people get better results by discarding rare words entirely, as the default min_count=5 will do. (If you have a lot of data, you're likely to increase this value to discard even more words.)

Are you examining the model's size or word-to-word results at all to ensure it's doing what you expect? Despite your colum being named STEMMED_TOKENS, I don't see any actual splitting-into-tokens, and the Word2Vec class expects each text to be a list-of-strings, not a string.

Finally, without seeing all your other choices for feeding word-vector-enriched data to your other classification steps, it is possible (likely even) that there are other errors there.

Given that a binary-classification model can always get at least 50% accuracy by simply classifying every example with whichever class is more common, any accuracy result less than 50% should immediately cause suspicions of major problems in your process like:

misalignment of examples & labels
insufficient/unrepresentative training data
some steps not running at all due to data-prep or invocation errors

Source https://stackoverflow.com/questions/67801844

QUESTION

Avoiding repeating equivalent lines

Asked 2021-Jun-01 at 15:17

def tokenized_dataset(self, dataset):
    tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

    print("\n"+"="*10, "Start Tokenizing", "="*10)
    start = time.process_time()
    train_articles = [self.encode(document, tokenizer) for document in dataset["train"]["article"]]
    test_articles = [self.encode(document, tokenizer) for document in dataset["test"]["article"]]
    val_articles = [self.encode(document, tokenizer) for document in dataset["val"]["article"]]
    train_abstracts = [self.encode(document, tokenizer) for document in dataset["train"]["abstract"]]
    test_abstracts = [self.encode(document, tokenizer) for document in dataset["test"]["abstract"]]
    val_abstracts = [self.encode(document, tokenizer) for document in dataset["val"]["abstract"]]

    print("Time:", time.process_time() - start)
    print("=" * 10, "End Tokenizing", "="*10+"\n")

    return {"train": (dataset["train"]["id"], train_articles, train_abstracts),
            "test": (dataset["train"]["id"], test_articles, test_abstracts),
            "val": (dataset["val"]["id"], val_articles, val_abstracts)}

...

ANSWER

Answered 2021-Jun-01 at 15:17

You can easily do this by using python functions.

Source https://stackoverflow.com/questions/67791435

QUESTION

How to train BERT from scratch on a new domain for both MLM and NSP?

Asked 2021-Jun-01 at 14:42

I’m trying to train BERT model from scratch using my own dataset using HuggingFace library. I would like to train the model in a way that it has the exact architecture of the original BERT model.

In the original paper, it stated that: “BERT is trained on two tasks: predicting randomly masked tokens (MLM) and predicting whether two sentences follow each other (NSP). SCIBERT follows the same architecture as BERT but is instead pretrained on scientific text.”

I’m trying to understand how to train the model on two tasks as above. At the moment, I initialised the model as below:

...

ANSWER

Answered 2021-Feb-10 at 14:04

I would suggest doing the following:

First pre-train BERT on the MLM objective. HuggingFace provides a script especially for training BERT on the MLM objective on your own data. You can find it here. As you can see in the run_mlm.py script, they use AutoModelForMaskedLM, and you can specify any architecture you want.
Second, if want to train on the next sentence prediction task, you can define a BertForPretraining model (which has both the MLM and NSP heads on top), then load in the weights from the model you trained in step 1, and then further pre-train it on a next sentence prediction task.

UPDATE: apparently the next sentence prediction task did help improve performance of BERT on some GLUE tasks. See this talk by the author of BERT.

Source https://stackoverflow.com/questions/65646925

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install BERT

You can download it from GitHub.
You can use BERT like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.