bert | TensorFlow code and pre-trained models for BERT | Natural Language Processing library

 by   google-research Python Version: Current License: Apache-2.0

kandi X-RAY | bert Summary

bert is a Python library typically used in Institutions, Learning, Education, Artificial Intelligence, Natural Language Processing, Tensorflow, Bert, Neural Network, Transformer applications. bert has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has high support. You can install using 'pip install bert' or download it from GitHub, PyPI.
BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. Our academic paper which describes BERT in detail and provides full results on a number of tasks can be found here:
    Support
      Quality
        Security
          License
            Reuse
            Support
              Quality
                Security
                  License
                    Reuse

                      kandi-support Support

                        summary
                        bert has a highly active ecosystem.
                        summary
                        It has 33500 star(s) with 9099 fork(s). There are 989 watchers for this library.
                        summary
                        It had no major release in the last 6 months.
                        summary
                        There are 773 open issues and 349 have been closed. On average issues are closed in 180 days. There are 95 open pull requests and 0 closed requests.
                        summary
                        It has a positive sentiment in the developer community.
                        summary
                        The latest version of bert is current.
                        bert Support
                          Best in #Natural Language Processing
                            Average in #Natural Language Processing
                            bert Support
                              Best in #Natural Language Processing
                                Average in #Natural Language Processing

                                  kandi-Quality Quality

                                    summary
                                    bert has 0 bugs and 0 code smells.
                                    bert Quality
                                      Best in #Natural Language Processing
                                        Average in #Natural Language Processing
                                        bert Quality
                                          Best in #Natural Language Processing
                                            Average in #Natural Language Processing

                                              kandi-Security Security

                                                summary
                                                bert has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
                                                summary
                                                bert code analysis shows 0 unresolved vulnerabilities.
                                                summary
                                                There are 0 security hotspots that need review.
                                                bert Security
                                                  Best in #Natural Language Processing
                                                    Average in #Natural Language Processing
                                                    bert Security
                                                      Best in #Natural Language Processing
                                                        Average in #Natural Language Processing

                                                          kandi-License License

                                                            summary
                                                            bert is licensed under the Apache-2.0 License. This license is Permissive.
                                                            summary
                                                            Permissive licenses have the least restrictions, and you can use them in most projects.
                                                            bert License
                                                              Best in #Natural Language Processing
                                                                Average in #Natural Language Processing
                                                                bert License
                                                                  Best in #Natural Language Processing
                                                                    Average in #Natural Language Processing

                                                                      kandi-Reuse Reuse

                                                                        summary
                                                                        bert releases are not available. You will need to build from source code and install.
                                                                        summary
                                                                        Deployable package is available in PyPI.
                                                                        summary
                                                                        Build file is available. You can build the component from source.
                                                                        summary
                                                                        Installation instructions are not available. Examples and code snippets are available.
                                                                        summary
                                                                        bert saves you 1764 person hours of effort in developing the same functionality from scratch.
                                                                        summary
                                                                        It has 3902 lines of code, 187 functions and 13 files.
                                                                        summary
                                                                        It has high code complexity. Code complexity directly impacts maintainability of the code.
                                                                        bert Reuse
                                                                          Best in #Natural Language Processing
                                                                            Average in #Natural Language Processing
                                                                            bert Reuse
                                                                              Best in #Natural Language Processing
                                                                                Average in #Natural Language Processing
                                                                                  Top functions reviewed by kandi - BETA
                                                                                  kandi has reviewed bert and discovered the below as its top functions. This is intended to give you an instant insight into bert implemented functionality, and help decide if they suit your requirements.
                                                                                  • Writes predictions
                                                                                    • Compute softmax
                                                                                    • Returns the n_best_size of the logits
                                                                                    • Return the final prediction
                                                                                  • Convert examples to features
                                                                                    • Convert a single example
                                                                                    • Return a string representation of text
                                                                                    • Truncate a sequence pair
                                                                                  • Validate flags
                                                                                    • Validate a case insensitive case
                                                                                  • Returns a list of input examples
                                                                                  • Embed word embedding
                                                                                  • Return a list of input examples
                                                                                  • Builds the input function
                                                                                  • Tokenize text
                                                                                  • Validates that the case matches the given checkpoint
                                                                                  • Build a file - based input function
                                                                                  • Create TrainingInstances
                                                                                  • Reads input_file
                                                                                  • Creates an attention mask from from_tensor
                                                                                  • Converts examples into features
                                                                                  • Reads squad examples
                                                                                  • Process a feature
                                                                                  • Write examples to examples
                                                                                  • Transformer transformer model
                                                                                  • Embedding postprocessor
                                                                                  • Build a function for TPUEstimator
                                                                                  Get all kandi verified functions for this library.
                                                                                  Get all kandi verified functions for this library.

                                                                                  bert Key Features

                                                                                  TensorFlow code and pre-trained models for BERT

                                                                                  bert Examples and Code Snippets

                                                                                  copy iconCopy
                                                                                  
                                                                                                                      @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", }
                                                                                  @inproceedings{reimers-2020-multilingual-sentence-bert, title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2020", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/2004.09813", }
                                                                                  copy iconCopy
                                                                                  
                                                                                                                      from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2')
                                                                                  sentences = ['This framework generates embeddings for each input sentence', 'Sentences are passed as a list of string.', 'The quick brown fox jumps over the lazy dog.'] sentence_embeddings = model.encode(sentences)
                                                                                  for sentence, embedding in zip(sentences, sentence_embeddings): print("Sentence:", sentence) print("Embedding:", embedding) print("")
                                                                                  copy iconCopy
                                                                                  
                                                                                                                      pip install -U sentence-transformers
                                                                                  conda install -c conda-forge sentence-transformers
                                                                                  pip install -e .
                                                                                  Can't import bert.tokenization
                                                                                  Pythondot imgLines of Code : 2dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  !pip install bert-tensorflow
                                                                                  
                                                                                  How can I train an XGBoost with a generator?
                                                                                  Pythondot imgLines of Code : 21dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  def generator(X_data,y_data,batch_size):
                                                                                      while True:
                                                                                        for step in range(X_data.shape[0]//batch_size):
                                                                                            start=step*batch_size
                                                                                            end=step*(batch_size+1)
                                                                                            current_x=X_data.iloc[start]
                                                                                            current_y=y_data.iloc[start] 
                                                                                            #Or if it's an numpy array just get the rows
                                                                                            yield current_x,current_y
                                                                                  
                                                                                  Generator=generator(X,y)
                                                                                  batch_size=32
                                                                                  number_of_steps=X.shape[0]//batch_size
                                                                                  
                                                                                  clf = xgb.XGBClassifier(max_depth=200, n_estimators=400, subsample=1, learning_rate=0.07, reg_lambda=0.1, reg_alpha=0.1,\
                                                                                                         gamma=1)
                                                                                   
                                                                                  for step in number_of_steps:
                                                                                      X_g,y_g=next(Generator)
                                                                                      clf.fit(X_g, y_g)
                                                                                  
                                                                                  Error importing BERT: module 'tensorflow._api.v2.train' has no attribute 'Optimizer'
                                                                                  Pythondot imgLines of Code : 4dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  class AdamWeightDecayOptimizer(tf.train.Optimizer):
                                                                                  
                                                                                  class AdamWeightDecayOptimizer(tf.compat.v1.train.Optimizer):
                                                                                  
                                                                                  copy iconCopy
                                                                                  elif self.pooling == "mean": 
                                                                                      result = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)["sequence_output" ] 
                                                                                      pooled = result
                                                                                  
                                                                                  embedding_size = 768
                                                                                  in_id = Input(shape=(max_seq_length,), name="input_ids") 
                                                                                  in_mask = Input(shape=(max_seq_length,), name="input_masks")
                                                                                  in_segment = Input(shape=(max_seq_length,), name="segment_ids")
                                                                                  
                                                                                  bert_inputs = [in_id, in_mask, in_segment] 
                                                                                  bert_output = BertLayer(n_fine_tune_layers=12, pooling="mean")(bert_inputs) 
                                                                                  bert_output = Reshape((max_seq_length, embedding_size))(bert_output) 
                                                                                  
                                                                                  bilstm = Bidirectional(LSTM(128, dropout=0.2,recurrent_dropout=0.2,return_sequences=True))(bert_output)
                                                                                  output = Dense(output_size, activation="softmax")(bilstm)
                                                                                  
                                                                                  
                                                                                  Getting embedding lookup result from BERT
                                                                                  Pythondot imgLines of Code : 6dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  embeddings = bert_model.bert.get_input_embeddings()
                                                                                  word_embeddings = embeddings.word_embeddings
                                                                                  inputs_embeds = tf.gather(word_embeddings, input_ids)
                                                                                  
                                                                                  full_embeddings = embeddings(inputs=[None, None, token_type_ids, inputs_embeds])
                                                                                  
                                                                                  HuggingFace BERT `inputs_embeds` giving unexpected result
                                                                                  Pythondot imgLines of Code : 5dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  inputs_embeds = result[-1][0]
                                                                                  
                                                                                  embeddings = bert_model.bert.get_input_embeddings().word_embeddings
                                                                                  inputs_embeds = tf.gather(embeddings, input_ids)
                                                                                  
                                                                                  Saving a 'fine-tuned' bert model
                                                                                  Pythondot imgLines of Code : 10dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  python run_classifier.py \
                                                                                    --task_name=MRPC \
                                                                                    --do_predict=true \
                                                                                    --data_dir=$GLUE_DIR/MRPC \
                                                                                    --vocab_file=$BERT_BASE_DIR/vocab.txt \
                                                                                    --bert_config_file=$BERT_BASE_DIR/bert_config.json \
                                                                                    --init_checkpoint=$TRAINED_CLASSIFIER 
                                                                                  
                                                                                  bert-serving-start -model_dir $TRAINED_CLASSIFIER
                                                                                  
                                                                                  Community Discussions

                                                                                  Trending Discussions on bert

                                                                                  Convert pandas dataframe to datasetDict
                                                                                  chevron right
                                                                                  What is the loss function used in Trainer from the Transformers library of Hugging Face?
                                                                                  chevron right
                                                                                  how to save and load custom siamese bert model
                                                                                  chevron right
                                                                                  How to change AllenNLP BERT based Semantic Role Labeling to RoBERTa in AllenNLP
                                                                                  chevron right
                                                                                  Simple Transformers producing nothing?
                                                                                  chevron right
                                                                                  Organize data for transformer fine-tuning
                                                                                  chevron right
                                                                                  attributeerror: 'dataframe' object has no attribute 'data_type'
                                                                                  chevron right
                                                                                  InternalError when using TPU for training Keras model
                                                                                  chevron right
                                                                                  How to calculate perplexity of a sentence using huggingface masked language models?
                                                                                  chevron right
                                                                                  XPath 1.0, 1st node in subtree
                                                                                  chevron right

                                                                                  QUESTION

                                                                                  Convert pandas dataframe to datasetDict
                                                                                  Asked 2022-Mar-25 at 15:47

                                                                                  I cannot find anywhere how to convert a pandas dataframe to type datasets.dataset_dict.DatasetDict, for optimal use in a BERT workflow with a huggingface model. Take these simple dataframes, for example.

                                                                                  train_df = pd.DataFrame({
                                                                                       "label" : [1, 2, 3],
                                                                                       "text" : ["apple", "pear", "strawberry"]
                                                                                  })
                                                                                  
                                                                                  test_df = pd.DataFrame({
                                                                                       "label" : [2, 2, 1],
                                                                                       "text" : ["banana", "pear", "apple"]
                                                                                  })
                                                                                  

                                                                                  What is the most efficient way to convert these to the type above?

                                                                                  ANSWER

                                                                                  Answered 2022-Mar-25 at 15:47

                                                                                  One possibility is to first create two Datasets and then join them:

                                                                                  import datasets
                                                                                  import pandas as pd
                                                                                  
                                                                                  
                                                                                  train_df = pd.DataFrame({
                                                                                       "label" : [1, 2, 3],
                                                                                       "text" : ["apple", "pear", "strawberry"]
                                                                                  })
                                                                                  
                                                                                  test_df = pd.DataFrame({
                                                                                       "label" : [2, 2, 1],
                                                                                       "text" : ["banana", "pear", "apple"]
                                                                                  })
                                                                                  
                                                                                  train_dataset = Dataset.from_dict(train_df)
                                                                                  test_dataset = Dataset.from_dict(test_df)
                                                                                  my_dataset_dict = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})
                                                                                  

                                                                                  The result is:

                                                                                  DatasetDict({
                                                                                      train: Dataset({
                                                                                          features: ['label', 'text'],
                                                                                          num_rows: 3
                                                                                      })
                                                                                      test: Dataset({
                                                                                          features: ['label', 'text'],
                                                                                          num_rows: 3
                                                                                      })
                                                                                  })
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/71618974

                                                                                  QUESTION

                                                                                  What is the loss function used in Trainer from the Transformers library of Hugging Face?
                                                                                  Asked 2022-Mar-23 at 10:12

                                                                                  What is the loss function used in Trainer from the Transformers library of Hugging Face?

                                                                                  I am trying to fine tine a BERT model using the Trainer class from the Transformers library of Hugging Face.

                                                                                  In their documentation, they mention that one can specify a customized loss function by overriding the compute_loss method in the class. However, if I do not do the method override and use the Trainer to fine tine a BERT model directly for sentiment classification, what is the default loss function being use? Is it the categorical crossentropy? Thanks!

                                                                                  ANSWER

                                                                                  Answered 2022-Mar-23 at 10:12

                                                                                  It depends! Especially given your relatively vague setup description, it is not clear what loss will be used. But to start from the beginning, let's first check how the default compute_loss() function in the Trainer class looks like.

                                                                                  You can find the corresponding function here, if you want to have a look for yourself (current version at time of writing is 4.17). The actual loss that will be returned with default parameters is taken from the model's output values:

                                                                                  loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

                                                                                  which means that the model itself is (by default) responsible for computing some sort of loss and returning it in outputs.

                                                                                  Following this, we can then look into the actual model definitions for BERT (source: here, and in particular check out the model that will be used in your Sentiment Analysis task (I assume a BertForSequenceClassification model.

                                                                                  The code relevant for defining a loss function looks like this:

                                                                                  if labels is not None:
                                                                                      if self.config.problem_type is None:
                                                                                          if self.num_labels == 1:
                                                                                              self.config.problem_type = "regression"
                                                                                          elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                                                                                              self.config.problem_type = "single_label_classification"
                                                                                          else:
                                                                                              self.config.problem_type = "multi_label_classification"
                                                                                  
                                                                                      if self.config.problem_type == "regression":
                                                                                          loss_fct = MSELoss()
                                                                                          if self.num_labels == 1:
                                                                                              loss = loss_fct(logits.squeeze(), labels.squeeze())
                                                                                          else:
                                                                                              loss = loss_fct(logits, labels)
                                                                                      elif self.config.problem_type == "single_label_classification":
                                                                                          loss_fct = CrossEntropyLoss()
                                                                                          loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
                                                                                      elif self.config.problem_type == "multi_label_classification":
                                                                                          loss_fct = BCEWithLogitsLoss()
                                                                                          loss = loss_fct(logits, labels)
                                                                                  
                                                                                  

                                                                                  Based on this information, you should be able to either set the correct loss function yourself (by changing model.config.problem_type accordingly), or otherwise at least be able to determine whichever loss will be chosen, based on the hyperparameters of your task (number of labels, label scores, etc.)

                                                                                  Source https://stackoverflow.com/questions/71581197

                                                                                  QUESTION

                                                                                  how to save and load custom siamese bert model
                                                                                  Asked 2022-Mar-09 at 10:34

                                                                                  I am following this tutorial on how to train a siamese bert network:

                                                                                  https://keras.io/examples/nlp/semantic_similarity_with_bert/

                                                                                  all good, but I am not sure what is the best way to save the model after train it and save it. any suggestion?

                                                                                  I was trying with

                                                                                  model.save('models/bert_siamese_v1')

                                                                                  which creates a folder with save_model.bp keras_metadata.bp and two subfolders (variables and assets)

                                                                                  then I try to load it with:

                                                                                  model.load_weights('models/bert_siamese_v1/')
                                                                                  

                                                                                  and it gives me this error:

                                                                                  2022-03-08 14:11:52.567762: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open models/bert_siamese_v1/: Failed precondition: models/bert_siamese_v1; Is a directory: perhaps your file is in a different file format and you need to use a different restore operator?
                                                                                  

                                                                                  what is the best way to proceed?

                                                                                  ANSWER

                                                                                  Answered 2022-Mar-08 at 16:13

                                                                                  Try using tf.saved_model.save to save your model:

                                                                                  tf.saved_model.save(model, 'models/bert_siamese_v1')
                                                                                  model = tf.saved_model.load('models/bert_siamese_v1')
                                                                                  

                                                                                  The warning you get during saving can apparently be ignored. After loading your model, you can use it for inference f(test_data):

                                                                                  f = model.signatures["serving_default"]
                                                                                  x1 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
                                                                                  x2 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
                                                                                  x3 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
                                                                                  print(f)
                                                                                  print(f(attention_masks = x1, input_ids = x2, token_type_ids = x3))
                                                                                  
                                                                                  ConcreteFunction signature_wrapper(*, token_type_ids, attention_masks, input_ids)
                                                                                    Args:
                                                                                      attention_masks: int32 Tensor, shape=(None, 128)
                                                                                      input_ids: int32 Tensor, shape=(None, 128)
                                                                                      token_type_ids: int32 Tensor, shape=(None, 128)
                                                                                    Returns:
                                                                                      {'dense': <1>}
                                                                                        <1>: float32 Tensor, shape=(None, 3)
                                                                                  {'dense': }
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/71396540

                                                                                  QUESTION

                                                                                  How to change AllenNLP BERT based Semantic Role Labeling to RoBERTa in AllenNLP
                                                                                  Asked 2022-Feb-24 at 12:34

                                                                                  Currently i'm able to train a Semantic Role Labeling model using the config file below. This config file is based on the one provided by AllenNLP and works for the default bert-base-uncased model and also GroNLP/bert-base-dutch-cased.

                                                                                  {
                                                                                    "dataset_reader": {
                                                                                      "type": "srl_custom",
                                                                                      "bert_model_name": "GroNLP/bert-base-dutch-cased"
                                                                                    },
                                                                                    "data_loader": {
                                                                                      "batch_sampler": {
                                                                                        "type": "bucket",
                                                                                        "batch_size": 32
                                                                                      }
                                                                                    },
                                                                                    "train_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
                                                                                    "validation_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
                                                                                    "model": {
                                                                                      "type": "srl_bert",
                                                                                      "embedding_dropout": 0.1,
                                                                                      "bert_model": "GroNLP/bert-base-dutch-cased"
                                                                                    },
                                                                                    "trainer": {
                                                                                      "optimizer": {
                                                                                        "type": "huggingface_adamw",
                                                                                        "lr": 5e-5,
                                                                                        "correct_bias": false,
                                                                                        "weight_decay": 0.01,
                                                                                        "parameter_groups": [
                                                                                          [
                                                                                            [
                                                                                              "bias",
                                                                                              "LayerNorm.bias",
                                                                                              "LayerNorm.weight",
                                                                                              "layer_norm.weight"
                                                                                            ],
                                                                                            {
                                                                                              "weight_decay": 0.0
                                                                                            }
                                                                                          ]
                                                                                        ]
                                                                                      },
                                                                                      "learning_rate_scheduler": {
                                                                                        "type": "slanted_triangular"
                                                                                      },
                                                                                      "checkpointer": {
                                                                                        "keep_most_recent_by_count": 2
                                                                                      },
                                                                                      "grad_norm": 1.0,
                                                                                      "num_epochs": 3,
                                                                                      "validation_metric": "+f1-measure-overall"
                                                                                    }
                                                                                  }
                                                                                  

                                                                                  Swapping the values of bert_model_name and bert_model parameters from GroNLP/bert-base-dutch-cased to roberta-base won't work out of the box since the SRL datareader only supports the BertTokenizer and not the RobertaTokenizer. So I changed the config file to the following:

                                                                                  {
                                                                                    "dataset_reader": {
                                                                                      "type": "srl_custom",
                                                                                      "token_indexers": {
                                                                                        "tokens": {
                                                                                          "type": "pretrained_transformer",
                                                                                          "model_name": "roberta-base"
                                                                                        }
                                                                                      }
                                                                                    },
                                                                                    "data_loader": {
                                                                                      "batch_sampler": {
                                                                                        "type": "bucket",
                                                                                        "batch_size": 32
                                                                                      }
                                                                                    },
                                                                                    "train_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
                                                                                    "validation_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
                                                                                    "model": {
                                                                                      "type": "srl_bert",
                                                                                      "embedding_dropout": 0.1,
                                                                                      "bert_model": "roberta-base"
                                                                                    },
                                                                                    "trainer": {
                                                                                      "optimizer": {
                                                                                        "type": "huggingface_adamw",
                                                                                        "lr": 5e-5,
                                                                                        "correct_bias": false,
                                                                                        "weight_decay": 0.01,
                                                                                        "parameter_groups": [
                                                                                          [
                                                                                            [
                                                                                              "bias",
                                                                                              "LayerNorm.bias",
                                                                                              "LayerNorm.weight",
                                                                                              "layer_norm.weight"
                                                                                            ],
                                                                                            {
                                                                                              "weight_decay": 0.0
                                                                                            }
                                                                                          ]
                                                                                        ]
                                                                                      },
                                                                                      "learning_rate_scheduler": {
                                                                                        "type": "slanted_triangular"
                                                                                      },
                                                                                      "checkpointer": {
                                                                                        "keep_most_recent_by_count": 2
                                                                                      },
                                                                                      "grad_norm": 1.0,
                                                                                      "num_epochs": 15,
                                                                                      "validation_metric": "+f1-measure-overall"
                                                                                    }
                                                                                  }
                                                                                  

                                                                                  However, this is still not working. I'm receiving the following error:

                                                                                  2022-02-22 16:19:34,122 - INFO - allennlp.training.gradient_descent_trainer - Training
                                                                                    0%|          | 0/1546 [00:00
                                                                                      sys.exit(run())
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\__main__.py", line 39, in run
                                                                                      main(prog="allennlp")
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\__init__.py", line 119, in main
                                                                                      args.func(args)
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 111, in train_model_from_args
                                                                                      train_model_from_file(
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 177, in train_model_from_file
                                                                                      return train_model(
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 258, in train_model
                                                                                      model = _train_worker(
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 508, in _train_worker
                                                                                      metrics = train_loop.run()
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 581, in run
                                                                                      return self.trainer.train()
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 771, in train
                                                                                      metrics, epoch = self._try_train()
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 793, in _try_train
                                                                                      train_metrics = self._train_epoch(epoch)
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 510, in _train_epoch
                                                                                      batch_outputs = self.batch_outputs(batch, for_training=True)
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 403, in batch_outputs
                                                                                      output_dict = self._pytorch_model(**batch)
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
                                                                                      result = self.forward(*input, **kwargs)
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp_models\structured_prediction\models\srl_bert.py", line 141, in forward
                                                                                      bert_embeddings, _ = self.bert_model(
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
                                                                                      result = self.forward(*input, **kwargs)
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\transformers\models\bert\modeling_bert.py", line 989, in forward
                                                                                      embedding_output = self.embeddings(
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
                                                                                      result = self.forward(*input, **kwargs)
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\transformers\models\bert\modeling_bert.py", line 215, in forward
                                                                                      token_type_embeddings = self.token_type_embeddings(token_type_ids)
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
                                                                                      result = self.forward(*input, **kwargs)
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\sparse.py", line 156, in forward
                                                                                      return F.embedding(
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\functional.py", line 1916, in embedding
                                                                                      return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
                                                                                  IndexError: index out of range in self
                                                                                  

                                                                                  I don't fully understand whats going wrong and couldn't find any documentation on how to change the config file to load in a 'custom' BERT/RoBERTa model (one thats not mentioned here). I'm running the default allennlp train config.jsonnet command to start training. allennlp train config.jsonnet --dry-run produces no errors however.

                                                                                  Thanks in advance! Thijs

                                                                                  EDIT: I've now swapped out and inherited the "srl_bert" for a custom "srl_roberta" class to make use of the RobertaModel. This however still produces the same error.

                                                                                  EDIT2: I'm now using the AutoTokenizer as suggested by Dirk Groeneveld. It looks like changing the SrlReader class to support RoBERTa based models involves way more changes like swapping BERTs wordpiece tokenizer to RoBERTa's BPE tokenizer. Is there an easy way to adapt the SrlReader class or is it better to write a new RobertaSrlReader from scratch?

                                                                                  I've inherited the SrlReader class and changed this line to the following:

                                                                                  self.bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
                                                                                  

                                                                                  It produces the following error since RoBERTa tokenization differs from BERT:

                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp_models\structured_prediction\dataset_readers\srl.py", line 255, in text_to_instance
                                                                                      wordpieces, offsets, start_offsets = self._wordpiece_tokenize_input(
                                                                                    File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp_models\structured_prediction\dataset_readers\srl.py", line 196, in _wordpiece_tokenize_input
                                                                                      word_pieces = self.bert_tokenizer.wordpiece_tokenizer.tokenize(token)
                                                                                  AttributeError: 'RobertaTokenizerFast' object has no attribute 'wordpiece_tokenizer'
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-24 at 02:14

                                                                                  The easiest way to resolve this is to patch SrlReader so that it uses PretrainedTransformerTokenizer (from AllenNLP) or AutoTokenizer (from Huggingface) instead of BertTokenizer. SrlReader is an old class, and was written against an old version of the Huggingface tokenizer API, so it's not so easy to upgrade.

                                                                                  If you want to submit a pull request in the AllenNLP project, I'd be happy to help you get it merged into AllenNLP!

                                                                                  Source https://stackoverflow.com/questions/71223907

                                                                                  QUESTION

                                                                                  Simple Transformers producing nothing?
                                                                                  Asked 2022-Feb-22 at 11:54

                                                                                  I have a simple transformers script looking like this.

                                                                                  from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs
                                                                                  args = Seq2SeqArgs()
                                                                                  args.num_train_epoch=5
                                                                                  model = Seq2SeqModel(
                                                                                      "roberta",
                                                                                      "roberta-base",
                                                                                      "bert-base-cased",
                                                                                  )
                                                                                  import pandas as pd
                                                                                  df = pd.read_csv('english-french.csv')
                                                                                  df['input_text'] = df['english'].values
                                                                                  df['target_text'] =df['french'].values
                                                                                  model.train_model(df.head(1000))
                                                                                  print(model.eval_model(df.tail(10)))
                                                                                  

                                                                                  The eval_loss is {'eval_loss': 0.0001931049264385365}

                                                                                  However when I run my prediction script

                                                                                  to_predict = ["They went to the public swimming pool."]
                                                                                  predictions=model.predict(to_predict)
                                                                                  

                                                                                  I get this

                                                                                  ['']
                                                                                  

                                                                                  The dataset I used is here

                                                                                  I'm very confused on the output. Any help or explanation why it returns nothing would be much appreciated.

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-22 at 11:54

                                                                                  Use this model instead.

                                                                                  model = Seq2SeqModel(
                                                                                      encoder_decoder_type="marian",
                                                                                      encoder_decoder_name="Helsinki-NLP/opus-mt-en-mul",
                                                                                      args=args,
                                                                                      use_cuda=True,
                                                                                  )
                                                                                  

                                                                                  roBERTa is not a good option for your task.

                                                                                  I have rewritten your code on this colab notebook

                                                                                  Results

                                                                                  # Input
                                                                                  to_predict = ["They went to the public swimming pool.", "she was driving the shiny black car."]
                                                                                  predictions = model.predict(to_predict)
                                                                                  print(predictions)
                                                                                  
                                                                                  # Output
                                                                                  ['Ils aient cher à la piscine publice.', 'elle conduit la véricine noir glancer.']
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/71200243

                                                                                  QUESTION

                                                                                  Organize data for transformer fine-tuning
                                                                                  Asked 2022-Feb-02 at 14:58

                                                                                  I have a corpus of synonyms and non-synonyms. These are stored in a list of python dictionaries like {"sentence1": , "sentence2": , "label": <1.0 or 0.0> }. Note that this words (or sentences) do not have to be a single token in the tokenizer.

                                                                                  I want to fine-tune a BERT-based model to take both sentences like: [[CLS], ], ...,, [SEP], ], ..., , [SEP]] and predict the "label" (a measurement between 0.0 and 1.0).

                                                                                  What is the best approach to organized this data to facilitate the fine-tuning of the huggingface transformer?

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-02 at 14:58

                                                                                  You can use the Tokenizer __call__ method to join both sentences when encoding them.

                                                                                  In case you're using the PyTorch implementation, here is an example:

                                                                                  import torch
                                                                                  from transformers import AutoTokenizer
                                                                                  
                                                                                  sentences1 = ... # List containing all sentences 1
                                                                                  sentences2 = ... # List containing all sentences 2
                                                                                  labels = ... # List containing all labels (0 or 1)
                                                                                  
                                                                                  TOKENIZER_NAME = "bert-base-cased"
                                                                                  tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
                                                                                  
                                                                                  encodings = tokenizer(
                                                                                      sentences1,
                                                                                      sentences2,
                                                                                      return_tensors="pt"
                                                                                  )
                                                                                  
                                                                                  labels = torch.tensor(labels)
                                                                                  

                                                                                  Then you can create your custom Dataset to use it on training:

                                                                                  class CustomRealDataset(torch.utils.data.Dataset):
                                                                                      def __init__(self, encodings, labels):
                                                                                          self.encodings = encodings
                                                                                          self.labels = labels
                                                                                  
                                                                                      def __getitem__(self, idx):
                                                                                          item = {key: value[idx] for key, value in self.encodings.items()}
                                                                                          item["labels"] = self.labels[idx]
                                                                                          return item
                                                                                  
                                                                                      def __len__(self):
                                                                                          return len(self.labels)
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/70957390

                                                                                  QUESTION

                                                                                  attributeerror: 'dataframe' object has no attribute 'data_type'
                                                                                  Asked 2022-Jan-10 at 08:41

                                                                                  I am getting the following error : attributeerror: 'dataframe' object has no attribute 'data_type'" . I am trying to recreate the code from this link which is based on this article with my own dataset which is similar to the article

                                                                                  from sklearn.model_selection import train_test_split
                                                                                  
                                                                                  X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                                                                                                    df.label.values, 
                                                                                                                                    test_size=0.15, 
                                                                                                                                    random_state=42, 
                                                                                                                                    stratify=df.label.values)
                                                                                  
                                                                                  df['data_type'] = ['not_set']*df.shape[0]
                                                                                  
                                                                                  df.loc[X_train, 'data_type'] = 'train'
                                                                                  df.loc[X_val, 'data_type'] = 'val'
                                                                                  
                                                                                  df.groupby(['Conference', 'label', 'data_type']).count()
                                                                                  
                                                                                  tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
                                                                                                                            do_lower_case=True)
                                                                                  
                                                                                  encoded_data_train = tokenizer.batch_encode_plus(
                                                                                      df[df.data_type=='train'].example.values,
                                                                                      add_special_tokens=True,
                                                                                      return_attention_mask=True,
                                                                                      pad_to_max_length=True,
                                                                                      max_length=256,
                                                                                      return_tensors='pt'
                                                                                  )
                                                                                  

                                                                                  and this is the error I get:

                                                                                  ---------------------------------------------------------------------------
                                                                                  AttributeError                            Traceback (most recent call last)
                                                                                  ~\AppData\Local\Temp/ipykernel_24180/2662883887.py in 
                                                                                        3 
                                                                                        4 encoded_data_train = tokenizer.batch_encode_plus(
                                                                                  ----> 5     df[df.data_type=='train'].example.values,
                                                                                        6     add_special_tokens=True,
                                                                                        7     return_attention_mask=True,
                                                                                  
                                                                                  C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
                                                                                     5485         ):
                                                                                     5486             return self[name]
                                                                                  -> 5487         return object.__getattribute__(self, name)
                                                                                     5488 
                                                                                     5489     def __setattr__(self, name: str, value) -> None:
                                                                                  
                                                                                  AttributeError: 'DataFrame' object has no attribute 'data_type'
                                                                                  

                                                                                  I am using python: 3.9; pytorch :1.10.1; pandas: 1.3.5; transformers: 4.15.0

                                                                                  ANSWER

                                                                                  Answered 2022-Jan-10 at 08:41

                                                                                  The error means you have no data_type column in your dataframe because you missed this step

                                                                                  from sklearn.model_selection import train_test_split
                                                                                  
                                                                                  X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                                                                                                    df.label.values, 
                                                                                                                                    test_size=0.15, 
                                                                                                                                    random_state=42, 
                                                                                                                                    stratify=df.label.values)
                                                                                  
                                                                                  df['data_type'] = ['not_set']*df.shape[0]  # <- HERE
                                                                                  
                                                                                  df.loc[X_train, 'data_type'] = 'train'  # <- HERE
                                                                                  df.loc[X_val, 'data_type'] = 'val'  # <- HERE
                                                                                  
                                                                                  df.groupby(['Conference', 'label', 'data_type']).count()
                                                                                  

                                                                                  Demo

                                                                                  1. Setup
                                                                                  import pandas as pd
                                                                                  from sklearn.model_selection import train_test_split
                                                                                  
                                                                                  # The Data
                                                                                  df = pd.read_csv('data/title_conference.csv')
                                                                                  df['label'] = pd.factorize(df['Conference'])[0]
                                                                                  
                                                                                  # Train and Validation Split
                                                                                  X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                                                                                                    df.label.values, 
                                                                                                                                    test_size=0.15, 
                                                                                                                                    random_state=42, 
                                                                                                                                    stratify=df.label.values)
                                                                                  
                                                                                  df['data_type'] = ['not_set']*df.shape[0]
                                                                                  
                                                                                  df.loc[X_train, 'data_type'] = 'train'
                                                                                  df.loc[X_val, 'data_type'] = 'val'
                                                                                  
                                                                                  1. Code
                                                                                  from transformers import BertTokenizer
                                                                                  
                                                                                  tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                                                                                                            do_lower_case=True)
                                                                                  
                                                                                  encoded_data_train = tokenizer.batch_encode_plus(
                                                                                      df[df.data_type=='train'].Title.values, 
                                                                                      add_special_tokens=True, 
                                                                                      return_attention_mask=True, 
                                                                                      pad_to_max_length=True, 
                                                                                      max_length=256, 
                                                                                      return_tensors='pt'
                                                                                  )
                                                                                  

                                                                                  Output:

                                                                                  >>> encoded_data_train
                                                                                  {'input_ids': tensor([[  101,  8144,  1999,  ...,     0,     0,     0],
                                                                                          [  101,  2152,  2836,  ...,     0,     0,     0],
                                                                                          [  101, 22454, 25806,  ...,     0,     0,     0],
                                                                                          ...,
                                                                                          [  101,  1037,  2047,  ...,     0,     0,     0],
                                                                                          [  101, 13229,  7375,  ...,     0,     0,     0],
                                                                                          [  101,  2006,  1996,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
                                                                                          [0, 0, 0,  ..., 0, 0, 0],
                                                                                          [0, 0, 0,  ..., 0, 0, 0],
                                                                                          ...,
                                                                                          [0, 0, 0,  ..., 0, 0, 0],
                                                                                          [0, 0, 0,  ..., 0, 0, 0],
                                                                                          [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
                                                                                          [1, 1, 1,  ..., 0, 0, 0],
                                                                                          [1, 1, 1,  ..., 0, 0, 0],
                                                                                          ...,
                                                                                          [1, 1, 1,  ..., 0, 0, 0],
                                                                                          [1, 1, 1,  ..., 0, 0, 0],
                                                                                          [1, 1, 1,  ..., 0, 0, 0]])}
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/70649379

                                                                                  QUESTION

                                                                                  InternalError when using TPU for training Keras model
                                                                                  Asked 2021-Dec-31 at 08:18

                                                                                  I am attempting to fine-tune a BERT model on Google Colab from the Tensorflow Hub using this link.

                                                                                  However, I run into the following error:

                                                                                  InternalError: RET_CHECK failure (third_party/tensorflow/core/tpu/graph_rewrite/distributed_tpu_rewrite_pass.cc:2047) arg_shape.handle_type != DT_INVALID  input edge: [id=2693 model_preprocessing_67660:0 -> cluster_train_function:628]
                                                                                  

                                                                                  When I run my model.fit(...) function.

                                                                                  This error only occurs when I try to use TPU (runs fine on CPU, but has a very long training time).

                                                                                  Here is my code for setting up the TPU and model:

                                                                                  TPU Setup:

                                                                                  import os
                                                                                  os.environ["TFHUB_MODEL_LOAD_FORMAT"]="UNCOMPRESSED"
                                                                                  
                                                                                  cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
                                                                                  tf.config.experimental_connect_to_cluster(cluster_resolver)
                                                                                  tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
                                                                                  strategy = tf.distribute.TPUStrategy(cluster_resolver)
                                                                                  

                                                                                  Model Setup:

                                                                                  def build_classifier_model():
                                                                                    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
                                                                                    preprocessing_layer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', name='preprocessing')
                                                                                    encoder_inputs = preprocessing_layer(text_input)
                                                                                    encoder = hub.KerasLayer('https://tfhub.dev/google/experts/bert/wiki_books/sst2/2', trainable=True, name='BERT_encoder')
                                                                                    outputs = encoder(encoder_inputs)
                                                                                    net = outputs['pooled_output']
                                                                                    net = tf.keras.layers.Dropout(0.1)(net)
                                                                                    net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
                                                                                    return tf.keras.Model(text_input, net)
                                                                                  

                                                                                  Model Training

                                                                                  with strategy.scope():
                                                                                  
                                                                                    bert_model = build_classifier_model()
                                                                                    loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
                                                                                    metrics = tf.metrics.BinaryAccuracy()
                                                                                    epochs = 1
                                                                                    steps_per_epoch = 1280000
                                                                                    num_train_steps = steps_per_epoch * epochs
                                                                                    num_warmup_steps = int(0.1*num_train_steps)
                                                                                  
                                                                                    init_lr = 3e-5
                                                                                    optimizer = optimization.create_optimizer(init_lr=init_lr,
                                                                                                                            num_train_steps=num_train_steps,
                                                                                                                            num_warmup_steps=num_warmup_steps,
                                                                                                                            optimizer_type='adamw')
                                                                                    bert_model.compile(optimizer=optimizer,
                                                                                                           loss=loss,
                                                                                                           metrics=metrics)
                                                                                    print(f'Training model')
                                                                                    history = bert_model.fit(x=X_train, y=y_train,
                                                                                                                 validation_data=(X_val, y_val),
                                                                                                                 epochs=epochs)
                                                                                  

                                                                                  Note that X_train is a numpy array of type str with shape (1280000,) and y_train is a numpy array of shape (1280000, 1)

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-31 at 08:18

                                                                                  As I don't exactly know what changes you have made in the code... I don't have idea about your dataset. But I can see that you are trying to train the whole datset with one epoch and passing the steps per epoch directly. I would recommend to write it like this

                                                                                  set some batch_size 2^n power (for example 16 or 32 or etc) if you don't want to batch the dataset just set batch_size to 1

                                                                                  batch_size = 16
                                                                                  steps_per_epoch = training_data_size // batch_size
                                                                                  

                                                                                  The problem with the code is most probably the training dataset size. I think that you're making a mistake by passing the value of the training dataset manually.

                                                                                  If you're loading the dataset from tfds use (as shown in the link):

                                                                                  train_dataset, train_data_size = load_dataset_from_tfds(
                                                                                    in_memory_ds, tfds_info, train_split, batch_size, bert_preprocess_model)
                                                                                  

                                                                                  If you're using a custom dataset take the size of the cleaned dataset in a variable and then use that variable for using the size of the training data. Try to avoid manually putting values in the code as far as possible.

                                                                                  Source https://stackoverflow.com/questions/70479279

                                                                                  QUESTION

                                                                                  How to calculate perplexity of a sentence using huggingface masked language models?
                                                                                  Asked 2021-Dec-25 at 21:51

                                                                                  I have several masked language models (mainly Bert, Roberta, Albert, Electra). I also have a dataset of sentences. How can I get the perplexity of each sentence?

                                                                                  From the huggingface documentation here they mentioned that perplexity "is not well defined for masked language models like BERT", though I still see people somehow calculate it.

                                                                                  For example in this SO question they calculated it using the function

                                                                                  def score(model, tokenizer, sentence,  mask_token_id=103):
                                                                                    tensor_input = tokenizer.encode(sentence, return_tensors='pt')
                                                                                    repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
                                                                                    mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
                                                                                    masked_input = repeat_input.masked_fill(mask == 1, 103)
                                                                                    labels = repeat_input.masked_fill( masked_input != 103, -100)
                                                                                    loss,_ = model(masked_input, masked_lm_labels=labels)
                                                                                    result = np.exp(loss.item())
                                                                                    return result
                                                                                  
                                                                                  score(model, tokenizer, '我爱你') # returns 45.63794545581973
                                                                                  

                                                                                  However, when I try to use the code I get TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'.

                                                                                  I tried it with a couple of my models:

                                                                                  from transformers import pipeline, BertForMaskedLM, BertForMaskedLM, AutoTokenizer, RobertaForMaskedLM, AlbertForMaskedLM, ElectraForMaskedLM
                                                                                  import torch
                                                                                  
                                                                                  1)
                                                                                  tokenizer = AutoTokenizer.from_pretrained("bioformers/bioformer-cased-v1.0")
                                                                                  model = BertForMaskedLM.from_pretrained("bioformers/bioformer-cased-v1.0")
                                                                                  2)
                                                                                  tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")
                                                                                  model = ElectraForMaskedLM.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")
                                                                                  

                                                                                  This SO question also used the masked_lm_labels as an input and it seemed to work somehow.

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-25 at 21:51

                                                                                  There is a paper Masked Language Model Scoring that explores pseudo-perplexity from masked language models and shows that pseudo-perplexity, while not being theoretically well justified, still performs well for comparing "naturalness" of texts.

                                                                                  As for the code, your snippet is perfectly correct but for one detail: in recent implementations of Huggingface BERT, masked_lm_labels are renamed to simply labels, to make interfaces of various models more compatible. I have also replaced the hard-coded 103 with the generic tokenizer.mask_token_id. So the snippet below should work:

                                                                                  from transformers import AutoModelForMaskedLM, AutoTokenizer
                                                                                  import torch
                                                                                  import numpy as np
                                                                                  
                                                                                  model_name = 'cointegrated/rubert-tiny'
                                                                                  model = AutoModelForMaskedLM.from_pretrained(model_name)
                                                                                  tokenizer = AutoTokenizer.from_pretrained(model_name)
                                                                                  
                                                                                  def score(model, tokenizer, sentence):
                                                                                      tensor_input = tokenizer.encode(sentence, return_tensors='pt')
                                                                                      repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
                                                                                      mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
                                                                                      masked_input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
                                                                                      labels = repeat_input.masked_fill( masked_input != tokenizer.mask_token_id, -100)
                                                                                      with torch.inference_mode():
                                                                                          loss = model(masked_input, labels=labels).loss
                                                                                      return np.exp(loss.item())
                                                                                  
                                                                                  print(score(sentence='London is the capital of Great Britain.', model=model, tokenizer=tokenizer)) 
                                                                                  # 4.541251105675365
                                                                                  print(score(sentence='London is the capital of South America.', model=model, tokenizer=tokenizer)) 
                                                                                  # 6.162017238332462
                                                                                  

                                                                                  You can try this code in Google Colab by running this gist.

                                                                                  Source https://stackoverflow.com/questions/70464428

                                                                                  QUESTION

                                                                                  XPath 1.0, 1st node in subtree
                                                                                  Asked 2021-Dec-23 at 19:40

                                                                                  So what I want to do is identify the 1st node in some subtree of a xml tree.

                                                                                  here's an example

                                                                                  
                                                                                  
                                                                                    
                                                                                      
                                                                                        
                                                                                          
                                                                                            
                                                                                            
                                                                                            
                                                                                            
                                                                                          
                                                                                        
                                                                                        
                                                                                          
                                                                                            
                                                                                            
                                                                                            
                                                                                          
                                                                                        
                                                                                      
                                                                                    
                                                                                    
                                                                                      
                                                                                        
                                                                                          
                                                                                          
                                                                                        
                                                                                        
                                                                                          
                                                                                            
                                                                                            
                                                                                          
                                                                                        
                                                                                        
                                                                                          
                                                                                            
                                                                                            
                                                                                          
                                                                                        
                                                                                      
                                                                                    
                                                                                  
                                                                                  

                                                                                  now I want the 1st person mentioned per road.

                                                                                  so lets have a go...

                                                                                  /root/road/households/household/occupants/person[1]/@name
                                                                                  

                                                                                  that returns the 1st person per occupants node.

                                                                                  lets try

                                                                                  (/root/road/households/household/occupants/person)[1]/@name
                                                                                  

                                                                                  that returns the 1st person in the whole tree

                                                                                  what I sort of want to do is?

                                                                                  /root/road/(households/household/occupants/person)[1]/@name
                                                                                  

                                                                                  i.e. take the 1st person in the set of people in a road

                                                                                  but thats not valid xpath 1.0

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-23 at 19:40

                                                                                  This seems to be what you’re after, using the descendant axis:

                                                                                  /root/road/descendant::person[1]/@name
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/70466321

                                                                                  Community Discussions, Code Snippets contain sources that include Stack Exchange Network

                                                                                  Vulnerabilities

                                                                                  No vulnerabilities reported

                                                                                  Install bert

                                                                                  You can install using 'pip install bert' or download it from GitHub, PyPI.
                                                                                  You can use bert like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

                                                                                  Support

                                                                                  For help or issues using BERT, please submit a GitHub issue. For personal communication related to BERT, please contact Jacob Devlin (jacobdevlin@google.com), Ming-Wei Chang (mingweichang@google.com), or Kenton Lee (kentonl@google.com).
                                                                                  Find more information at:
                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit
                                                                                  CLONE
                                                                                • HTTPS

                                                                                  https://github.com/google-research/bert.git

                                                                                • CLI

                                                                                  gh repo clone google-research/bert

                                                                                • sshUrl

                                                                                  git@github.com:google-research/bert.git

                                                                                • Share this Page

                                                                                  share link

                                                                                  Consider Popular Natural Language Processing Libraries

                                                                                  transformers

                                                                                  by huggingface

                                                                                  funNLP

                                                                                  by fighting41love

                                                                                  bert

                                                                                  by google-research

                                                                                  jieba

                                                                                  by fxsjy

                                                                                  Python

                                                                                  by geekcomputers

                                                                                  Try Top Libraries by google-research

                                                                                  google-research

                                                                                  by google-researchJupyter Notebook

                                                                                  vision_transformer

                                                                                  by google-researchJupyter Notebook

                                                                                  text-to-text-transfer-transformer

                                                                                  by google-researchPython

                                                                                  simclr

                                                                                  by google-researchJupyter Notebook

                                                                                  arxiv-latex-cleaner

                                                                                  by google-researchPython

                                                                                  Compare Natural Language Processing Libraries with Highest Support

                                                                                  transformers

                                                                                  by huggingface

                                                                                  bert

                                                                                  by google-research

                                                                                  allennlp

                                                                                  by allenai

                                                                                  flair

                                                                                  by flairNLP

                                                                                  spaCy

                                                                                  by explosion

                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit