bert | TensorFlow code and pre-trained models for BERT | Natural Language Processing library
kandi X-RAY | bert Summary
Support
Quality
Security
License
Reuse
- Writes predictions
- Compute softmax
- Returns the n_best_size of the logits
- Return the final prediction
- Convert examples to features
- Convert a single example
- Return a string representation of text
- Truncate a sequence pair
- Validate flags
- Validate a case insensitive case
- Returns a list of input examples
- Embed word embedding
- Return a list of input examples
- Builds the input function
- Tokenize text
- Validates that the case matches the given checkpoint
- Build a file - based input function
- Create TrainingInstances
- Reads input_file
- Creates an attention mask from from_tensor
- Converts examples into features
- Reads squad examples
- Process a feature
- Write examples to examples
- Transformer transformer model
- Embedding postprocessor
- Build a function for TPUEstimator
bert Key Features
bert Examples and Code Snippets
@inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", }
@inproceedings{reimers-2020-multilingual-sentence-bert, title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2020", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/2004.09813", }
from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ['This framework generates embeddings for each input sentence', 'Sentences are passed as a list of string.', 'The quick brown fox jumps over the lazy dog.'] sentence_embeddings = model.encode(sentences)
for sentence, embedding in zip(sentences, sentence_embeddings): print("Sentence:", sentence) print("Embedding:", embedding) print("")
pip install -U sentence-transformers
conda install -c conda-forge sentence-transformers
pip install -e .
def generator(X_data,y_data,batch_size):
while True:
for step in range(X_data.shape[0]//batch_size):
start=step*batch_size
end=step*(batch_size+1)
current_x=X_data.iloc[start]
current_y=y_data.iloc[start]
#Or if it's an numpy array just get the rows
yield current_x,current_y
Generator=generator(X,y)
batch_size=32
number_of_steps=X.shape[0]//batch_size
clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
gamma=1)
for step in number_of_steps:
X_g,y_g=next(Generator)
clf.fit(X_g, y_g)
class AdamWeightDecayOptimizer(tf.train.Optimizer):
class AdamWeightDecayOptimizer(tf.compat.v1.train.Optimizer):
elif self.pooling == "mean":
result = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)["sequence_output" ]
pooled = result
embedding_size = 768
in_id = Input(shape=(max_seq_length,), name="input_ids")
in_mask = Input(shape=(max_seq_length,), name="input_masks")
in_segment = Input(shape=(max_seq_length,), name="segment_ids")
bert_inputs = [in_id, in_mask, in_segment]
bert_output = BertLayer(n_fine_tune_layers=12, pooling="mean")(bert_inputs)
bert_output = Reshape((max_seq_length, embedding_size))(bert_output)
bilstm = Bidirectional(LSTM(128, dropout=0.2,recurrent_dropout=0.2,return_sequences=True))(bert_output)
output = Dense(output_size, activation="softmax")(bilstm)
embeddings = bert_model.bert.get_input_embeddings()
word_embeddings = embeddings.word_embeddings
inputs_embeds = tf.gather(word_embeddings, input_ids)
full_embeddings = embeddings(inputs=[None, None, token_type_ids, inputs_embeds])
inputs_embeds = result[-1][0]
embeddings = bert_model.bert.get_input_embeddings().word_embeddings
inputs_embeds = tf.gather(embeddings, input_ids)
Trending Discussions on bert
Trending Discussions on bert
QUESTION
I cannot find anywhere how to convert a pandas dataframe to type datasets.dataset_dict.DatasetDict
, for optimal use in a BERT workflow with a huggingface model. Take these simple dataframes, for example.
train_df = pd.DataFrame({
"label" : [1, 2, 3],
"text" : ["apple", "pear", "strawberry"]
})
test_df = pd.DataFrame({
"label" : [2, 2, 1],
"text" : ["banana", "pear", "apple"]
})
What is the most efficient way to convert these to the type above?
ANSWER
Answered 2022-Mar-25 at 15:47One possibility is to first create two Datasets and then join them:
import datasets
import pandas as pd
train_df = pd.DataFrame({
"label" : [1, 2, 3],
"text" : ["apple", "pear", "strawberry"]
})
test_df = pd.DataFrame({
"label" : [2, 2, 1],
"text" : ["banana", "pear", "apple"]
})
train_dataset = Dataset.from_dict(train_df)
test_dataset = Dataset.from_dict(test_df)
my_dataset_dict = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})
The result is:
DatasetDict({
train: Dataset({
features: ['label', 'text'],
num_rows: 3
})
test: Dataset({
features: ['label', 'text'],
num_rows: 3
})
})
QUESTION
What is the loss function used in Trainer from the Transformers library of Hugging Face?
I am trying to fine tine a BERT model using the Trainer class from the Transformers library of Hugging Face.
In their documentation, they mention that one can specify a customized loss function by overriding the compute_loss
method in the class. However, if I do not do the method override and use the Trainer to fine tine a BERT model directly for sentiment classification, what is the default loss function being use? Is it the categorical crossentropy? Thanks!
ANSWER
Answered 2022-Mar-23 at 10:12It depends! Especially given your relatively vague setup description, it is not clear what loss will be used. But to start from the beginning, let's first check how the default compute_loss()
function in the Trainer
class looks like.
You can find the corresponding function here, if you want to have a look for yourself (current version at time of writing is 4.17). The actual loss that will be returned with default parameters is taken from the model's output values:
loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
which means that the model itself is (by default) responsible for computing some sort of loss and returning it in outputs
.
Following this, we can then look into the actual model definitions for BERT (source: here, and in particular check out the model that will be used in your Sentiment Analysis task (I assume a BertForSequenceClassification
model.
The code relevant for defining a loss function looks like this:
if labels is not None:
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = "regression"
elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
self.config.problem_type = "single_label_classification"
else:
self.config.problem_type = "multi_label_classification"
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(logits, labels)
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == "multi_label_classification":
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(logits, labels)
Based on this information, you should be able to either set the correct loss function yourself (by changing model.config.problem_type
accordingly), or otherwise at least be able to determine whichever loss will be chosen, based on the hyperparameters of your task (number of labels, label scores, etc.)
QUESTION
I am following this tutorial on how to train a siamese bert network:
https://keras.io/examples/nlp/semantic_similarity_with_bert/
all good, but I am not sure what is the best way to save the model after train it and save it. any suggestion?
I was trying with
model.save('models/bert_siamese_v1')
which creates a folder with save_model.bp keras_metadata.bp and two subfolders (variables and assets)
then I try to load it with:
model.load_weights('models/bert_siamese_v1/')
and it gives me this error:
2022-03-08 14:11:52.567762: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open models/bert_siamese_v1/: Failed precondition: models/bert_siamese_v1; Is a directory: perhaps your file is in a different file format and you need to use a different restore operator?
what is the best way to proceed?
ANSWER
Answered 2022-Mar-08 at 16:13Try using tf.saved_model.save
to save your model:
tf.saved_model.save(model, 'models/bert_siamese_v1')
model = tf.saved_model.load('models/bert_siamese_v1')
The warning you get during saving can apparently be ignored. After loading your model, you can use it for inference f(test_data)
:
f = model.signatures["serving_default"]
x1 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
x2 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
x3 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
print(f)
print(f(attention_masks = x1, input_ids = x2, token_type_ids = x3))
ConcreteFunction signature_wrapper(*, token_type_ids, attention_masks, input_ids)
Args:
attention_masks: int32 Tensor, shape=(None, 128)
input_ids: int32 Tensor, shape=(None, 128)
token_type_ids: int32 Tensor, shape=(None, 128)
Returns:
{'dense': <1>}
<1>: float32 Tensor, shape=(None, 3)
{'dense': }
QUESTION
Currently i'm able to train a Semantic Role Labeling model using the config file below. This config file is based on the one provided by AllenNLP and works for the default bert-base-uncased
model and also GroNLP/bert-base-dutch-cased
.
{
"dataset_reader": {
"type": "srl_custom",
"bert_model_name": "GroNLP/bert-base-dutch-cased"
},
"data_loader": {
"batch_sampler": {
"type": "bucket",
"batch_size": 32
}
},
"train_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
"validation_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
"model": {
"type": "srl_bert",
"embedding_dropout": 0.1,
"bert_model": "GroNLP/bert-base-dutch-cased"
},
"trainer": {
"optimizer": {
"type": "huggingface_adamw",
"lr": 5e-5,
"correct_bias": false,
"weight_decay": 0.01,
"parameter_groups": [
[
[
"bias",
"LayerNorm.bias",
"LayerNorm.weight",
"layer_norm.weight"
],
{
"weight_decay": 0.0
}
]
]
},
"learning_rate_scheduler": {
"type": "slanted_triangular"
},
"checkpointer": {
"keep_most_recent_by_count": 2
},
"grad_norm": 1.0,
"num_epochs": 3,
"validation_metric": "+f1-measure-overall"
}
}
Swapping the values of bert_model_name
and bert_model
parameters from GroNLP/bert-base-dutch-cased
to roberta-base
won't work out of the box since the SRL datareader only supports the BertTokenizer and not the RobertaTokenizer. So I changed the config file to the following:
{
"dataset_reader": {
"type": "srl_custom",
"token_indexers": {
"tokens": {
"type": "pretrained_transformer",
"model_name": "roberta-base"
}
}
},
"data_loader": {
"batch_sampler": {
"type": "bucket",
"batch_size": 32
}
},
"train_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
"validation_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
"model": {
"type": "srl_bert",
"embedding_dropout": 0.1,
"bert_model": "roberta-base"
},
"trainer": {
"optimizer": {
"type": "huggingface_adamw",
"lr": 5e-5,
"correct_bias": false,
"weight_decay": 0.01,
"parameter_groups": [
[
[
"bias",
"LayerNorm.bias",
"LayerNorm.weight",
"layer_norm.weight"
],
{
"weight_decay": 0.0
}
]
]
},
"learning_rate_scheduler": {
"type": "slanted_triangular"
},
"checkpointer": {
"keep_most_recent_by_count": 2
},
"grad_norm": 1.0,
"num_epochs": 15,
"validation_metric": "+f1-measure-overall"
}
}
However, this is still not working. I'm receiving the following error:
2022-02-22 16:19:34,122 - INFO - allennlp.training.gradient_descent_trainer - Training
0%| | 0/1546 [00:00
sys.exit(run())
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\__main__.py", line 39, in run
main(prog="allennlp")
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\__init__.py", line 119, in main
args.func(args)
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 111, in train_model_from_args
train_model_from_file(
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 177, in train_model_from_file
return train_model(
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 258, in train_model
model = _train_worker(
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 508, in _train_worker
metrics = train_loop.run()
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 581, in run
return self.trainer.train()
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 771, in train
metrics, epoch = self._try_train()
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 793, in _try_train
train_metrics = self._train_epoch(epoch)
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 510, in _train_epoch
batch_outputs = self.batch_outputs(batch, for_training=True)
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 403, in batch_outputs
output_dict = self._pytorch_model(**batch)
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp_models\structured_prediction\models\srl_bert.py", line 141, in forward
bert_embeddings, _ = self.bert_model(
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\transformers\models\bert\modeling_bert.py", line 989, in forward
embedding_output = self.embeddings(
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\transformers\models\bert\modeling_bert.py", line 215, in forward
token_type_embeddings = self.token_type_embeddings(token_type_ids)
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\sparse.py", line 156, in forward
return F.embedding(
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\functional.py", line 1916, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
I don't fully understand whats going wrong and couldn't find any documentation on how to change the config file to load in a 'custom' BERT/RoBERTa model (one thats not mentioned here). I'm running the default allennlp train config.jsonnet
command to start training. allennlp train config.jsonnet --dry-run
produces no errors however.
Thanks in advance! Thijs
EDIT: I've now swapped out and inherited the "srl_bert" for a custom "srl_roberta" class to make use of the RobertaModel. This however still produces the same error.
EDIT2: I'm now using the AutoTokenizer as suggested by Dirk Groeneveld. It looks like changing the SrlReader class to support RoBERTa based models involves way more changes like swapping BERTs wordpiece tokenizer to RoBERTa's BPE tokenizer. Is there an easy way to adapt the SrlReader class or is it better to write a new RobertaSrlReader from scratch?
I've inherited the SrlReader class and changed this line to the following:
self.bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
It produces the following error since RoBERTa tokenization differs from BERT:
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp_models\structured_prediction\dataset_readers\srl.py", line 255, in text_to_instance
wordpieces, offsets, start_offsets = self._wordpiece_tokenize_input(
File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp_models\structured_prediction\dataset_readers\srl.py", line 196, in _wordpiece_tokenize_input
word_pieces = self.bert_tokenizer.wordpiece_tokenizer.tokenize(token)
AttributeError: 'RobertaTokenizerFast' object has no attribute 'wordpiece_tokenizer'
ANSWER
Answered 2022-Feb-24 at 02:14The easiest way to resolve this is to patch SrlReader
so that it uses PretrainedTransformerTokenizer
(from AllenNLP) or AutoTokenizer
(from Huggingface) instead of BertTokenizer
. SrlReader
is an old class, and was written against an old version of the Huggingface tokenizer API, so it's not so easy to upgrade.
If you want to submit a pull request in the AllenNLP project, I'd be happy to help you get it merged into AllenNLP!
QUESTION
I have a simple transformers script looking like this.
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs
args = Seq2SeqArgs()
args.num_train_epoch=5
model = Seq2SeqModel(
"roberta",
"roberta-base",
"bert-base-cased",
)
import pandas as pd
df = pd.read_csv('english-french.csv')
df['input_text'] = df['english'].values
df['target_text'] =df['french'].values
model.train_model(df.head(1000))
print(model.eval_model(df.tail(10)))
The eval_loss is {'eval_loss': 0.0001931049264385365}
However when I run my prediction script
to_predict = ["They went to the public swimming pool."]
predictions=model.predict(to_predict)
I get this
['']
The dataset I used is here
I'm very confused on the output. Any help or explanation why it returns nothing would be much appreciated.
ANSWER
Answered 2022-Feb-22 at 11:54Use this model instead.
model = Seq2SeqModel(
encoder_decoder_type="marian",
encoder_decoder_name="Helsinki-NLP/opus-mt-en-mul",
args=args,
use_cuda=True,
)
roBERTa is not a good option for your task.
I have rewritten your code on this colab notebook
Results
# Input
to_predict = ["They went to the public swimming pool.", "she was driving the shiny black car."]
predictions = model.predict(to_predict)
print(predictions)
# Output
['Ils aient cher à la piscine publice.', 'elle conduit la véricine noir glancer.']
QUESTION
I have a corpus of synonyms and non-synonyms. These are stored in a list of python dictionaries like {"sentence1": , "sentence2": , "label": <1.0 or 0.0> }
. Note that this words (or sentences) do not have to be a single token in the tokenizer.
I want to fine-tune a BERT-based model to take both sentences like: [[CLS], ], ...,, [SEP], ], ..., , [SEP]]
and predict the "label" (a measurement between 0.0 and 1.0).
What is the best approach to organized this data to facilitate the fine-tuning of the huggingface transformer?
ANSWER
Answered 2022-Feb-02 at 14:58You can use the Tokenizer __call__
method to join both sentences when encoding them.
In case you're using the PyTorch implementation, here is an example:
import torch
from transformers import AutoTokenizer
sentences1 = ... # List containing all sentences 1
sentences2 = ... # List containing all sentences 2
labels = ... # List containing all labels (0 or 1)
TOKENIZER_NAME = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
encodings = tokenizer(
sentences1,
sentences2,
return_tensors="pt"
)
labels = torch.tensor(labels)
Then you can create your custom Dataset to use it on training:
class CustomRealDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: value[idx] for key, value in self.encodings.items()}
item["labels"] = self.labels[idx]
return item
def __len__(self):
return len(self.labels)
QUESTION
I am getting the following error : attributeerror: 'dataframe' object has no attribute 'data_type'"
. I am trying to recreate the code from this link which is based on this article with my own dataset which is similar to the article
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(df.index.values,
df.label.values,
test_size=0.15,
random_state=42,
stratify=df.label.values)
df['data_type'] = ['not_set']*df.shape[0]
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'
df.groupby(['Conference', 'label', 'data_type']).count()
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
do_lower_case=True)
encoded_data_train = tokenizer.batch_encode_plus(
df[df.data_type=='train'].example.values,
add_special_tokens=True,
return_attention_mask=True,
pad_to_max_length=True,
max_length=256,
return_tensors='pt'
)
and this is the error I get:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_24180/2662883887.py in
3
4 encoded_data_train = tokenizer.batch_encode_plus(
----> 5 df[df.data_type=='train'].example.values,
6 add_special_tokens=True,
7 return_attention_mask=True,
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5485 ):
5486 return self[name]
-> 5487 return object.__getattribute__(self, name)
5488
5489 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'data_type'
I am using python: 3.9; pytorch :1.10.1; pandas: 1.3.5; transformers: 4.15.0
ANSWER
Answered 2022-Jan-10 at 08:41The error means you have no data_type
column in your dataframe because you missed this step
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(df.index.values,
df.label.values,
test_size=0.15,
random_state=42,
stratify=df.label.values)
df['data_type'] = ['not_set']*df.shape[0] # <- HERE
df.loc[X_train, 'data_type'] = 'train' # <- HERE
df.loc[X_val, 'data_type'] = 'val' # <- HERE
df.groupby(['Conference', 'label', 'data_type']).count()
Demo
- Setup
import pandas as pd
from sklearn.model_selection import train_test_split
# The Data
df = pd.read_csv('data/title_conference.csv')
df['label'] = pd.factorize(df['Conference'])[0]
# Train and Validation Split
X_train, X_val, y_train, y_val = train_test_split(df.index.values,
df.label.values,
test_size=0.15,
random_state=42,
stratify=df.label.values)
df['data_type'] = ['not_set']*df.shape[0]
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'
- Code
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
do_lower_case=True)
encoded_data_train = tokenizer.batch_encode_plus(
df[df.data_type=='train'].Title.values,
add_special_tokens=True,
return_attention_mask=True,
pad_to_max_length=True,
max_length=256,
return_tensors='pt'
)
Output:
>>> encoded_data_train
{'input_ids': tensor([[ 101, 8144, 1999, ..., 0, 0, 0],
[ 101, 2152, 2836, ..., 0, 0, 0],
[ 101, 22454, 25806, ..., 0, 0, 0],
...,
[ 101, 1037, 2047, ..., 0, 0, 0],
[ 101, 13229, 7375, ..., 0, 0, 0],
[ 101, 2006, 1996, ..., 0, 0, 0]]), 'token_type_ids': tensor([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
...,
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]])}
QUESTION
I am attempting to fine-tune a BERT model on Google Colab from the Tensorflow Hub using this link.
However, I run into the following error:
InternalError: RET_CHECK failure (third_party/tensorflow/core/tpu/graph_rewrite/distributed_tpu_rewrite_pass.cc:2047) arg_shape.handle_type != DT_INVALID input edge: [id=2693 model_preprocessing_67660:0 -> cluster_train_function:628]
When I run my model.fit(...)
function.
This error only occurs when I try to use TPU (runs fine on CPU, but has a very long training time).
Here is my code for setting up the TPU and model:
TPU Setup:
import os
os.environ["TFHUB_MODEL_LOAD_FORMAT"]="UNCOMPRESSED"
cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
strategy = tf.distribute.TPUStrategy(cluster_resolver)
Model Setup:
def build_classifier_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessing_layer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', name='preprocessing')
encoder_inputs = preprocessing_layer(text_input)
encoder = hub.KerasLayer('https://tfhub.dev/google/experts/bert/wiki_books/sst2/2', trainable=True, name='BERT_encoder')
outputs = encoder(encoder_inputs)
net = outputs['pooled_output']
net = tf.keras.layers.Dropout(0.1)(net)
net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
return tf.keras.Model(text_input, net)
Model Training
with strategy.scope():
bert_model = build_classifier_model()
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()
epochs = 1
steps_per_epoch = 1280000
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)
init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
optimizer_type='adamw')
bert_model.compile(optimizer=optimizer,
loss=loss,
metrics=metrics)
print(f'Training model')
history = bert_model.fit(x=X_train, y=y_train,
validation_data=(X_val, y_val),
epochs=epochs)
Note that X_train
is a numpy array of type str
with shape (1280000,)
and y_train
is a numpy array of shape (1280000, 1)
ANSWER
Answered 2021-Dec-31 at 08:18As I don't exactly know what changes you have made in the code... I don't have idea about your dataset. But I can see that you are trying to train the whole datset with one epoch and passing the steps per epoch directly. I would recommend to write it like this
set some batch_size 2^n power (for example 16 or 32 or etc) if you don't want to batch the dataset just set batch_size to 1
batch_size = 16
steps_per_epoch = training_data_size // batch_size
The problem with the code is most probably the training dataset size. I think that you're making a mistake by passing the value of the training dataset manually.
If you're loading the dataset from tfds use (as shown in the link):
train_dataset, train_data_size = load_dataset_from_tfds(
in_memory_ds, tfds_info, train_split, batch_size, bert_preprocess_model)
If you're using a custom dataset take the size of the cleaned dataset in a variable and then use that variable for using the size of the training data. Try to avoid manually putting values in the code as far as possible.
QUESTION
I have several masked language models (mainly Bert, Roberta, Albert, Electra). I also have a dataset of sentences. How can I get the perplexity of each sentence?
From the huggingface documentation here they mentioned that perplexity "is not well defined for masked language models like BERT", though I still see people somehow calculate it.
For example in this SO question they calculated it using the function
def score(model, tokenizer, sentence, mask_token_id=103):
tensor_input = tokenizer.encode(sentence, return_tensors='pt')
repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
masked_input = repeat_input.masked_fill(mask == 1, 103)
labels = repeat_input.masked_fill( masked_input != 103, -100)
loss,_ = model(masked_input, masked_lm_labels=labels)
result = np.exp(loss.item())
return result
score(model, tokenizer, '我爱你') # returns 45.63794545581973
However, when I try to use the code I get TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'
.
I tried it with a couple of my models:
from transformers import pipeline, BertForMaskedLM, BertForMaskedLM, AutoTokenizer, RobertaForMaskedLM, AlbertForMaskedLM, ElectraForMaskedLM
import torch
1)
tokenizer = AutoTokenizer.from_pretrained("bioformers/bioformer-cased-v1.0")
model = BertForMaskedLM.from_pretrained("bioformers/bioformer-cased-v1.0")
2)
tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")
model = ElectraForMaskedLM.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")
This SO question also used the masked_lm_labels
as an input and it seemed to work somehow.
ANSWER
Answered 2021-Dec-25 at 21:51There is a paper Masked Language Model Scoring that explores pseudo-perplexity from masked language models and shows that pseudo-perplexity, while not being theoretically well justified, still performs well for comparing "naturalness" of texts.
As for the code, your snippet is perfectly correct but for one detail: in recent implementations of Huggingface BERT, masked_lm_labels
are renamed to simply labels
, to make interfaces of various models more compatible. I have also replaced the hard-coded 103
with the generic tokenizer.mask_token_id
. So the snippet below should work:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
import numpy as np
model_name = 'cointegrated/rubert-tiny'
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def score(model, tokenizer, sentence):
tensor_input = tokenizer.encode(sentence, return_tensors='pt')
repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
masked_input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
labels = repeat_input.masked_fill( masked_input != tokenizer.mask_token_id, -100)
with torch.inference_mode():
loss = model(masked_input, labels=labels).loss
return np.exp(loss.item())
print(score(sentence='London is the capital of Great Britain.', model=model, tokenizer=tokenizer))
# 4.541251105675365
print(score(sentence='London is the capital of South America.', model=model, tokenizer=tokenizer))
# 6.162017238332462
You can try this code in Google Colab by running this gist.
QUESTION
So what I want to do is identify the 1st node in some subtree of a xml tree.
here's an example
now I want the 1st person mentioned per road.
so lets have a go...
/root/road/households/household/occupants/person[1]/@name
that returns the 1st person per occupants node.
lets try
(/root/road/households/household/occupants/person)[1]/@name
that returns the 1st person in the whole tree
what I sort of want to do is?
/root/road/(households/household/occupants/person)[1]/@name
i.e. take the 1st person in the set of people in a road
but thats not valid xpath 1.0
ANSWER
Answered 2021-Dec-23 at 19:40This seems to be what you’re after, using the descendant axis:
/root/road/descendant::person[1]/@name
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install bert
You can use bert like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesExplore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits
Save this library and start creating your kit
Share this Page