kandi background
Explore Kits

bert | TensorFlow code and pre-trained models for BERT | Natural Language Processing library

 by   google-research Python Version: Current License: Apache-2.0

 by   google-research Python Version: Current License: Apache-2.0

Download this library from

kandi X-RAY | bert Summary

bert is a Python library typically used in Institutions, Learning, Education, Artificial Intelligence, Natural Language Processing, Tensorflow, Bert, Neural Network, Transformer applications. bert has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has high support. You can install using 'pip install bert' or download it from GitHub, PyPI.
BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. Our academic paper which describes BERT in detail and provides full results on a number of tasks can be found here: https://arxiv.org/abs/1810.04805.
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • bert has a highly active ecosystem.
  • It has 28940 star(s) with 8174 fork(s). There are 982 watchers for this library.
  • It had no major release in the last 12 months.
  • There are 731 open issues and 337 have been closed. On average issues are closed in 152 days. There are 87 open pull requests and 0 closed requests.
  • It has a positive sentiment in the developer community.
  • The latest version of bert is current.
bert Support
Best in #Natural Language Processing
Average in #Natural Language Processing
bert Support
Best in #Natural Language Processing
Average in #Natural Language Processing

quality kandi Quality

  • bert has 0 bugs and 0 code smells.
bert Quality
Best in #Natural Language Processing
Average in #Natural Language Processing
bert Quality
Best in #Natural Language Processing
Average in #Natural Language Processing

securitySecurity

  • bert has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • bert code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.
bert Security
Best in #Natural Language Processing
Average in #Natural Language Processing
bert Security
Best in #Natural Language Processing
Average in #Natural Language Processing

license License

  • bert is licensed under the Apache-2.0 License. This license is Permissive.
  • Permissive licenses have the least restrictions, and you can use them in most projects.
bert License
Best in #Natural Language Processing
Average in #Natural Language Processing
bert License
Best in #Natural Language Processing
Average in #Natural Language Processing

buildReuse

  • bert releases are not available. You will need to build from source code and install.
  • Deployable package is available in PyPI.
  • Build file is available. You can build the component from source.
  • Installation instructions are not available. Examples and code snippets are available.
  • bert saves you 1764 person hours of effort in developing the same functionality from scratch.
  • It has 3902 lines of code, 187 functions and 13 files.
  • It has high code complexity. Code complexity directly impacts maintainability of the code.
bert Reuse
Best in #Natural Language Processing
Average in #Natural Language Processing
bert Reuse
Best in #Natural Language Processing
Average in #Natural Language Processing
Top functions reviewed by kandi - BETA

kandi has reviewed bert and discovered the below as its top functions. This is intended to give you an instant insight into bert implemented functionality, and help decide if they suit your requirements.

  • Writes predictions
    • Compute softmax
    • Returns the n_best_size of the logits
    • Return the final prediction
  • Convert examples to features
    • Convert a single example
    • Return a string representation of text
    • Truncate a sequence pair
  • Validate flags
    • Validate a case insensitive case
  • Returns a list of input examples
    • Embed word embedding
      • Return a list of input examples
        • Builds the input function
          • Tokenize text
            • Validates that the case matches the given checkpoint
              • Build a file - based input function
                • Create TrainingInstances
                  • Reads input_file
                    • Creates an attention mask from from_tensor
                      • Converts examples into features
                        • Reads squad examples
                          • Process a feature
                            • Write examples to examples
                              • Transformer transformer model
                                • Embedding postprocessor
                                  • Build a function for TPUEstimator

                                    Get all kandi verified functions for this library.

                                    Get all kandi verified functions for this library.

                                    bert Key Features

                                    TensorFlow code and pre-trained models for BERT

                                    bert Examples and Code Snippets

                                    See all related Code Snippets

                                    BERT

                                    copy iconCopydownload iconDownload
                                    @article{turc2019,
                                      title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
                                      author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
                                      journal={arXiv preprint arXiv:1908.08962v2 },
                                      year={2019}
                                    }
                                    

                                    What is BERT?

                                    copy iconCopydownload iconDownload
                                    Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
                                    Labels: [MASK1] = store; [MASK2] = gallon
                                    

                                    Fine-tuning with Cloud TPUs

                                    copy iconCopydownload iconDownload
                                      --use_tpu=True \
                                      --tpu_name=$TPU_NAME
                                    

                                    Sentence (and sentence-pair) classification tasks

                                    copy iconCopydownload iconDownload
                                    export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
                                    export GLUE_DIR=/path/to/glue
                                    
                                    python run_classifier.py \
                                      --task_name=MRPC \
                                      --do_train=true \
                                      --do_eval=true \
                                      --data_dir=$GLUE_DIR/MRPC \
                                      --vocab_file=$BERT_BASE_DIR/vocab.txt \
                                      --bert_config_file=$BERT_BASE_DIR/bert_config.json \
                                      --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
                                      --max_seq_length=128 \
                                      --train_batch_size=32 \
                                      --learning_rate=2e-5 \
                                      --num_train_epochs=3.0 \
                                      --output_dir=/tmp/mrpc_output/
                                    

                                    SQuAD 1.1

                                    copy iconCopydownload iconDownload
                                    python run_squad.py \
                                      --vocab_file=$BERT_BASE_DIR/vocab.txt \
                                      --bert_config_file=$BERT_BASE_DIR/bert_config.json \
                                      --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
                                      --do_train=True \
                                      --train_file=$SQUAD_DIR/train-v1.1.json \
                                      --do_predict=True \
                                      --predict_file=$SQUAD_DIR/dev-v1.1.json \
                                      --train_batch_size=12 \
                                      --learning_rate=3e-5 \
                                      --num_train_epochs=2.0 \
                                      --max_seq_length=384 \
                                      --doc_stride=128 \
                                      --output_dir=/tmp/squad_base/
                                    

                                    SQuAD 2.0

                                    copy iconCopydownload iconDownload
                                    python run_squad.py \
                                      --vocab_file=$BERT_LARGE_DIR/vocab.txt \
                                      --bert_config_file=$BERT_LARGE_DIR/bert_config.json \
                                      --init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \
                                      --do_train=True \
                                      --train_file=$SQUAD_DIR/train-v2.0.json \
                                      --do_predict=True \
                                      --predict_file=$SQUAD_DIR/dev-v2.0.json \
                                      --train_batch_size=24 \
                                      --learning_rate=3e-5 \
                                      --num_train_epochs=2.0 \
                                      --max_seq_length=384 \
                                      --doc_stride=128 \
                                      --output_dir=gs://some_bucket/squad_large/ \
                                      --use_tpu=True \
                                      --tpu_name=$TPU_NAME \
                                      --version_2_with_negative=True
                                    

                                    Using BERT to extract fixed feature vectors (like ELMo)

                                    copy iconCopydownload iconDownload
                                    # Sentence A and Sentence B are separated by the ||| delimiter for sentence
                                    # pair tasks like question answering and entailment.
                                    # For single sentence inputs, put one sentence per line and DON'T use the
                                    # delimiter.
                                    echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt
                                    
                                    python extract_features.py \
                                      --input_file=/tmp/input.txt \
                                      --output_file=/tmp/output.jsonl \
                                      --vocab_file=$BERT_BASE_DIR/vocab.txt \
                                      --bert_config_file=$BERT_BASE_DIR/bert_config.json \
                                      --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
                                      --layers=-1,-2,-3,-4 \
                                      --max_seq_length=128 \
                                      --batch_size=8
                                    

                                    Tokenization

                                    copy iconCopydownload iconDownload
                                    Input:  John Johanson 's   house
                                    Labels: NNP  NNP      POS NN
                                    

                                    Pre-training with BERT

                                    copy iconCopydownload iconDownload
                                    python create_pretraining_data.py \
                                      --input_file=./sample_text.txt \
                                      --output_file=/tmp/tf_examples.tfrecord \
                                      --vocab_file=$BERT_BASE_DIR/vocab.txt \
                                      --do_lower_case=True \
                                      --max_seq_length=128 \
                                      --max_predictions_per_seq=20 \
                                      --masked_lm_prob=0.15 \
                                      --random_seed=12345 \
                                      --dupe_factor=5
                                    

                                    FAQ

                                    copy iconCopydownload iconDownload
                                    @article{devlin2018bert,
                                      title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
                                      author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
                                      journal={arXiv preprint arXiv:1810.04805},
                                      year={2018}
                                    }
                                    

                                    Convert pandas dataframe to datasetDict

                                    copy iconCopydownload iconDownload
                                    import datasets
                                    import pandas as pd
                                    
                                    
                                    train_df = pd.DataFrame({
                                         "label" : [1, 2, 3],
                                         "text" : ["apple", "pear", "strawberry"]
                                    })
                                    
                                    test_df = pd.DataFrame({
                                         "label" : [2, 2, 1],
                                         "text" : ["banana", "pear", "apple"]
                                    })
                                    
                                    train_dataset = Dataset.from_dict(train_df)
                                    test_dataset = Dataset.from_dict(test_df)
                                    my_dataset_dict = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})
                                    
                                    DatasetDict({
                                        train: Dataset({
                                            features: ['label', 'text'],
                                            num_rows: 3
                                        })
                                        test: Dataset({
                                            features: ['label', 'text'],
                                            num_rows: 3
                                        })
                                    })
                                    
                                    import datasets
                                    import pandas as pd
                                    
                                    
                                    train_df = pd.DataFrame({
                                         "label" : [1, 2, 3],
                                         "text" : ["apple", "pear", "strawberry"]
                                    })
                                    
                                    test_df = pd.DataFrame({
                                         "label" : [2, 2, 1],
                                         "text" : ["banana", "pear", "apple"]
                                    })
                                    
                                    train_dataset = Dataset.from_dict(train_df)
                                    test_dataset = Dataset.from_dict(test_df)
                                    my_dataset_dict = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})
                                    
                                    DatasetDict({
                                        train: Dataset({
                                            features: ['label', 'text'],
                                            num_rows: 3
                                        })
                                        test: Dataset({
                                            features: ['label', 'text'],
                                            num_rows: 3
                                        })
                                    })
                                    

                                    What is the loss function used in Trainer from the Transformers library of Hugging Face?

                                    copy iconCopydownload iconDownload
                                    if labels is not None:
                                        if self.config.problem_type is None:
                                            if self.num_labels == 1:
                                                self.config.problem_type = "regression"
                                            elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                                                self.config.problem_type = "single_label_classification"
                                            else:
                                                self.config.problem_type = "multi_label_classification"
                                    
                                        if self.config.problem_type == "regression":
                                            loss_fct = MSELoss()
                                            if self.num_labels == 1:
                                                loss = loss_fct(logits.squeeze(), labels.squeeze())
                                            else:
                                                loss = loss_fct(logits, labels)
                                        elif self.config.problem_type == "single_label_classification":
                                            loss_fct = CrossEntropyLoss()
                                            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
                                        elif self.config.problem_type == "multi_label_classification":
                                            loss_fct = BCEWithLogitsLoss()
                                            loss = loss_fct(logits, labels)
                                    
                                    

                                    how to save and load custom siamese bert model

                                    copy iconCopydownload iconDownload
                                    tf.saved_model.save(model, 'models/bert_siamese_v1')
                                    model = tf.saved_model.load('models/bert_siamese_v1')
                                    
                                    f = model.signatures["serving_default"]
                                    x1 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
                                    x2 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
                                    x3 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
                                    print(f)
                                    print(f(attention_masks = x1, input_ids = x2, token_type_ids = x3))
                                    
                                    ConcreteFunction signature_wrapper(*, token_type_ids, attention_masks, input_ids)
                                      Args:
                                        attention_masks: int32 Tensor, shape=(None, 128)
                                        input_ids: int32 Tensor, shape=(None, 128)
                                        token_type_ids: int32 Tensor, shape=(None, 128)
                                      Returns:
                                        {'dense': <1>}
                                          <1>: float32 Tensor, shape=(None, 3)
                                    {'dense': <tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[0.40711606, 0.13456087, 0.45832306]], dtype=float32)>}
                                    
                                    tf.saved_model.save(model, 'models/bert_siamese_v1')
                                    model = tf.saved_model.load('models/bert_siamese_v1')
                                    
                                    f = model.signatures["serving_default"]
                                    x1 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
                                    x2 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
                                    x3 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
                                    print(f)
                                    print(f(attention_masks = x1, input_ids = x2, token_type_ids = x3))
                                    
                                    ConcreteFunction signature_wrapper(*, token_type_ids, attention_masks, input_ids)
                                      Args:
                                        attention_masks: int32 Tensor, shape=(None, 128)
                                        input_ids: int32 Tensor, shape=(None, 128)
                                        token_type_ids: int32 Tensor, shape=(None, 128)
                                      Returns:
                                        {'dense': <1>}
                                          <1>: float32 Tensor, shape=(None, 3)
                                    {'dense': <tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[0.40711606, 0.13456087, 0.45832306]], dtype=float32)>}
                                    
                                    tf.saved_model.save(model, 'models/bert_siamese_v1')
                                    model = tf.saved_model.load('models/bert_siamese_v1')
                                    
                                    f = model.signatures["serving_default"]
                                    x1 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
                                    x2 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
                                    x3 = tf.random.uniform((1, 128), maxval=100, dtype=tf.int32)
                                    print(f)
                                    print(f(attention_masks = x1, input_ids = x2, token_type_ids = x3))
                                    
                                    ConcreteFunction signature_wrapper(*, token_type_ids, attention_masks, input_ids)
                                      Args:
                                        attention_masks: int32 Tensor, shape=(None, 128)
                                        input_ids: int32 Tensor, shape=(None, 128)
                                        token_type_ids: int32 Tensor, shape=(None, 128)
                                      Returns:
                                        {'dense': <1>}
                                          <1>: float32 Tensor, shape=(None, 3)
                                    {'dense': <tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[0.40711606, 0.13456087, 0.45832306]], dtype=float32)>}
                                    
                                     model.save_weights('./checkpoints/my_checkpoint')
                                       
                                     model = create_model()
                                       
                                     model.load_weights('./checkpoints/my_checkpoint') 
                                    
                                    # Create and train a new model instance.
                                    model = create_model()
                                    model.fit(train_images, train_labels, epochs=5)
                                    
                                    # Save the entire model as a SavedModel.
                                    !mkdir -p saved_model
                                    model.save('saved_model/my_model')
                                    
                                    new_model = tf.keras.models.load_model('saved_model/my_model')
                                    
                                     model.save_weights('./checkpoints/my_checkpoint')
                                       
                                     model = create_model()
                                       
                                     model.load_weights('./checkpoints/my_checkpoint') 
                                    
                                    # Create and train a new model instance.
                                    model = create_model()
                                    model.fit(train_images, train_labels, epochs=5)
                                    
                                    # Save the entire model as a SavedModel.
                                    !mkdir -p saved_model
                                    model.save('saved_model/my_model')
                                    
                                    new_model = tf.keras.models.load_model('saved_model/my_model')
                                    
                                     model.save_weights('./checkpoints/my_checkpoint')
                                       
                                     model = create_model()
                                       
                                     model.load_weights('./checkpoints/my_checkpoint') 
                                    
                                    # Create and train a new model instance.
                                    model = create_model()
                                    model.fit(train_images, train_labels, epochs=5)
                                    
                                    # Save the entire model as a SavedModel.
                                    !mkdir -p saved_model
                                    model.save('saved_model/my_model')
                                    
                                    new_model = tf.keras.models.load_model('saved_model/my_model')
                                    

                                    Simple Transformers producing nothing?

                                    copy iconCopydownload iconDownload
                                    model = Seq2SeqModel(
                                        encoder_decoder_type="marian",
                                        encoder_decoder_name="Helsinki-NLP/opus-mt-en-mul",
                                        args=args,
                                        use_cuda=True,
                                    )
                                    
                                    # Input
                                    to_predict = ["They went to the public swimming pool.", "she was driving the shiny black car."]
                                    predictions = model.predict(to_predict)
                                    print(predictions)
                                    
                                    # Output
                                    ['Ils aient cher à la piscine publice.', 'elle conduit la véricine noir glancer.']
                                    
                                    model = Seq2SeqModel(
                                        encoder_decoder_type="marian",
                                        encoder_decoder_name="Helsinki-NLP/opus-mt-en-mul",
                                        args=args,
                                        use_cuda=True,
                                    )
                                    
                                    # Input
                                    to_predict = ["They went to the public swimming pool.", "she was driving the shiny black car."]
                                    predictions = model.predict(to_predict)
                                    print(predictions)
                                    
                                    # Output
                                    ['Ils aient cher à la piscine publice.', 'elle conduit la véricine noir glancer.']
                                    

                                    Organize data for transformer fine-tuning

                                    copy iconCopydownload iconDownload
                                    import torch
                                    from transformers import AutoTokenizer
                                    
                                    sentences1 = ... # List containing all sentences 1
                                    sentences2 = ... # List containing all sentences 2
                                    labels = ... # List containing all labels (0 or 1)
                                    
                                    TOKENIZER_NAME = "bert-base-cased"
                                    tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
                                    
                                    encodings = tokenizer(
                                        sentences1,
                                        sentences2,
                                        return_tensors="pt"
                                    )
                                    
                                    labels = torch.tensor(labels)
                                    
                                    class CustomRealDataset(torch.utils.data.Dataset):
                                        def __init__(self, encodings, labels):
                                            self.encodings = encodings
                                            self.labels = labels
                                    
                                        def __getitem__(self, idx):
                                            item = {key: value[idx] for key, value in self.encodings.items()}
                                            item["labels"] = self.labels[idx]
                                            return item
                                    
                                        def __len__(self):
                                            return len(self.labels)
                                    
                                    import torch
                                    from transformers import AutoTokenizer
                                    
                                    sentences1 = ... # List containing all sentences 1
                                    sentences2 = ... # List containing all sentences 2
                                    labels = ... # List containing all labels (0 or 1)
                                    
                                    TOKENIZER_NAME = "bert-base-cased"
                                    tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
                                    
                                    encodings = tokenizer(
                                        sentences1,
                                        sentences2,
                                        return_tensors="pt"
                                    )
                                    
                                    labels = torch.tensor(labels)
                                    
                                    class CustomRealDataset(torch.utils.data.Dataset):
                                        def __init__(self, encodings, labels):
                                            self.encodings = encodings
                                            self.labels = labels
                                    
                                        def __getitem__(self, idx):
                                            item = {key: value[idx] for key, value in self.encodings.items()}
                                            item["labels"] = self.labels[idx]
                                            return item
                                    
                                        def __len__(self):
                                            return len(self.labels)
                                    

                                    attributeerror: 'dataframe' object has no attribute 'data_type'

                                    copy iconCopydownload iconDownload
                                    from sklearn.model_selection import train_test_split
                                    
                                    X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                                                      df.label.values, 
                                                                                      test_size=0.15, 
                                                                                      random_state=42, 
                                                                                      stratify=df.label.values)
                                    
                                    df['data_type'] = ['not_set']*df.shape[0]  # <- HERE
                                    
                                    df.loc[X_train, 'data_type'] = 'train'  # <- HERE
                                    df.loc[X_val, 'data_type'] = 'val'  # <- HERE
                                    
                                    df.groupby(['Conference', 'label', 'data_type']).count()
                                    
                                    import pandas as pd
                                    from sklearn.model_selection import train_test_split
                                    
                                    # The Data
                                    df = pd.read_csv('data/title_conference.csv')
                                    df['label'] = pd.factorize(df['Conference'])[0]
                                    
                                    # Train and Validation Split
                                    X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                                                      df.label.values, 
                                                                                      test_size=0.15, 
                                                                                      random_state=42, 
                                                                                      stratify=df.label.values)
                                    
                                    df['data_type'] = ['not_set']*df.shape[0]
                                    
                                    df.loc[X_train, 'data_type'] = 'train'
                                    df.loc[X_val, 'data_type'] = 'val'
                                    
                                    from transformers import BertTokenizer
                                    
                                    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                                                              do_lower_case=True)
                                    
                                    encoded_data_train = tokenizer.batch_encode_plus(
                                        df[df.data_type=='train'].Title.values, 
                                        add_special_tokens=True, 
                                        return_attention_mask=True, 
                                        pad_to_max_length=True, 
                                        max_length=256, 
                                        return_tensors='pt'
                                    )
                                    
                                    >>> encoded_data_train
                                    {'input_ids': tensor([[  101,  8144,  1999,  ...,     0,     0,     0],
                                            [  101,  2152,  2836,  ...,     0,     0,     0],
                                            [  101, 22454, 25806,  ...,     0,     0,     0],
                                            ...,
                                            [  101,  1037,  2047,  ...,     0,     0,     0],
                                            [  101, 13229,  7375,  ...,     0,     0,     0],
                                            [  101,  2006,  1996,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            ...,
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            ...,
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0]])}
                                    
                                    from sklearn.model_selection import train_test_split
                                    
                                    X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                                                      df.label.values, 
                                                                                      test_size=0.15, 
                                                                                      random_state=42, 
                                                                                      stratify=df.label.values)
                                    
                                    df['data_type'] = ['not_set']*df.shape[0]  # <- HERE
                                    
                                    df.loc[X_train, 'data_type'] = 'train'  # <- HERE
                                    df.loc[X_val, 'data_type'] = 'val'  # <- HERE
                                    
                                    df.groupby(['Conference', 'label', 'data_type']).count()
                                    
                                    import pandas as pd
                                    from sklearn.model_selection import train_test_split
                                    
                                    # The Data
                                    df = pd.read_csv('data/title_conference.csv')
                                    df['label'] = pd.factorize(df['Conference'])[0]
                                    
                                    # Train and Validation Split
                                    X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                                                      df.label.values, 
                                                                                      test_size=0.15, 
                                                                                      random_state=42, 
                                                                                      stratify=df.label.values)
                                    
                                    df['data_type'] = ['not_set']*df.shape[0]
                                    
                                    df.loc[X_train, 'data_type'] = 'train'
                                    df.loc[X_val, 'data_type'] = 'val'
                                    
                                    from transformers import BertTokenizer
                                    
                                    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                                                              do_lower_case=True)
                                    
                                    encoded_data_train = tokenizer.batch_encode_plus(
                                        df[df.data_type=='train'].Title.values, 
                                        add_special_tokens=True, 
                                        return_attention_mask=True, 
                                        pad_to_max_length=True, 
                                        max_length=256, 
                                        return_tensors='pt'
                                    )
                                    
                                    >>> encoded_data_train
                                    {'input_ids': tensor([[  101,  8144,  1999,  ...,     0,     0,     0],
                                            [  101,  2152,  2836,  ...,     0,     0,     0],
                                            [  101, 22454, 25806,  ...,     0,     0,     0],
                                            ...,
                                            [  101,  1037,  2047,  ...,     0,     0,     0],
                                            [  101, 13229,  7375,  ...,     0,     0,     0],
                                            [  101,  2006,  1996,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            ...,
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            ...,
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0]])}
                                    
                                    from sklearn.model_selection import train_test_split
                                    
                                    X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                                                      df.label.values, 
                                                                                      test_size=0.15, 
                                                                                      random_state=42, 
                                                                                      stratify=df.label.values)
                                    
                                    df['data_type'] = ['not_set']*df.shape[0]  # <- HERE
                                    
                                    df.loc[X_train, 'data_type'] = 'train'  # <- HERE
                                    df.loc[X_val, 'data_type'] = 'val'  # <- HERE
                                    
                                    df.groupby(['Conference', 'label', 'data_type']).count()
                                    
                                    import pandas as pd
                                    from sklearn.model_selection import train_test_split
                                    
                                    # The Data
                                    df = pd.read_csv('data/title_conference.csv')
                                    df['label'] = pd.factorize(df['Conference'])[0]
                                    
                                    # Train and Validation Split
                                    X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                                                      df.label.values, 
                                                                                      test_size=0.15, 
                                                                                      random_state=42, 
                                                                                      stratify=df.label.values)
                                    
                                    df['data_type'] = ['not_set']*df.shape[0]
                                    
                                    df.loc[X_train, 'data_type'] = 'train'
                                    df.loc[X_val, 'data_type'] = 'val'
                                    
                                    from transformers import BertTokenizer
                                    
                                    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                                                              do_lower_case=True)
                                    
                                    encoded_data_train = tokenizer.batch_encode_plus(
                                        df[df.data_type=='train'].Title.values, 
                                        add_special_tokens=True, 
                                        return_attention_mask=True, 
                                        pad_to_max_length=True, 
                                        max_length=256, 
                                        return_tensors='pt'
                                    )
                                    
                                    >>> encoded_data_train
                                    {'input_ids': tensor([[  101,  8144,  1999,  ...,     0,     0,     0],
                                            [  101,  2152,  2836,  ...,     0,     0,     0],
                                            [  101, 22454, 25806,  ...,     0,     0,     0],
                                            ...,
                                            [  101,  1037,  2047,  ...,     0,     0,     0],
                                            [  101, 13229,  7375,  ...,     0,     0,     0],
                                            [  101,  2006,  1996,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            ...,
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            ...,
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0]])}
                                    
                                    from sklearn.model_selection import train_test_split
                                    
                                    X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                                                      df.label.values, 
                                                                                      test_size=0.15, 
                                                                                      random_state=42, 
                                                                                      stratify=df.label.values)
                                    
                                    df['data_type'] = ['not_set']*df.shape[0]  # <- HERE
                                    
                                    df.loc[X_train, 'data_type'] = 'train'  # <- HERE
                                    df.loc[X_val, 'data_type'] = 'val'  # <- HERE
                                    
                                    df.groupby(['Conference', 'label', 'data_type']).count()
                                    
                                    import pandas as pd
                                    from sklearn.model_selection import train_test_split
                                    
                                    # The Data
                                    df = pd.read_csv('data/title_conference.csv')
                                    df['label'] = pd.factorize(df['Conference'])[0]
                                    
                                    # Train and Validation Split
                                    X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                                                      df.label.values, 
                                                                                      test_size=0.15, 
                                                                                      random_state=42, 
                                                                                      stratify=df.label.values)
                                    
                                    df['data_type'] = ['not_set']*df.shape[0]
                                    
                                    df.loc[X_train, 'data_type'] = 'train'
                                    df.loc[X_val, 'data_type'] = 'val'
                                    
                                    from transformers import BertTokenizer
                                    
                                    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                                                              do_lower_case=True)
                                    
                                    encoded_data_train = tokenizer.batch_encode_plus(
                                        df[df.data_type=='train'].Title.values, 
                                        add_special_tokens=True, 
                                        return_attention_mask=True, 
                                        pad_to_max_length=True, 
                                        max_length=256, 
                                        return_tensors='pt'
                                    )
                                    
                                    >>> encoded_data_train
                                    {'input_ids': tensor([[  101,  8144,  1999,  ...,     0,     0,     0],
                                            [  101,  2152,  2836,  ...,     0,     0,     0],
                                            [  101, 22454, 25806,  ...,     0,     0,     0],
                                            ...,
                                            [  101,  1037,  2047,  ...,     0,     0,     0],
                                            [  101, 13229,  7375,  ...,     0,     0,     0],
                                            [  101,  2006,  1996,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            ...,
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0],
                                            [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            ...,
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0],
                                            [1, 1, 1,  ..., 0, 0, 0]])}
                                    

                                    InternalError when using TPU for training Keras model

                                    copy iconCopydownload iconDownload
                                    batch_size = 16
                                    steps_per_epoch = training_data_size // batch_size
                                    
                                    train_dataset, train_data_size = load_dataset_from_tfds(
                                      in_memory_ds, tfds_info, train_split, batch_size, bert_preprocess_model)
                                    
                                    batch_size = 16
                                    steps_per_epoch = training_data_size // batch_size
                                    
                                    train_dataset, train_data_size = load_dataset_from_tfds(
                                      in_memory_ds, tfds_info, train_split, batch_size, bert_preprocess_model)
                                    

                                    How to calculate perplexity of a sentence using huggingface masked language models?

                                    copy iconCopydownload iconDownload
                                    from transformers import AutoModelForMaskedLM, AutoTokenizer
                                    import torch
                                    import numpy as np
                                    
                                    model_name = 'cointegrated/rubert-tiny'
                                    model = AutoModelForMaskedLM.from_pretrained(model_name)
                                    tokenizer = AutoTokenizer.from_pretrained(model_name)
                                    
                                    def score(model, tokenizer, sentence):
                                        tensor_input = tokenizer.encode(sentence, return_tensors='pt')
                                        repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
                                        mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
                                        masked_input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
                                        labels = repeat_input.masked_fill( masked_input != tokenizer.mask_token_id, -100)
                                        with torch.inference_mode():
                                            loss = model(masked_input, labels=labels).loss
                                        return np.exp(loss.item())
                                    
                                    print(score(sentence='London is the capital of Great Britain.', model=model, tokenizer=tokenizer)) 
                                    # 4.541251105675365
                                    print(score(sentence='London is the capital of South America.', model=model, tokenizer=tokenizer)) 
                                    # 6.162017238332462
                                    

                                    XPath 1.0, 1st node in subtree

                                    copy iconCopydownload iconDownload
                                    /root/road/descendant::person[1]/@name
                                    

                                    Remove inconsistent duplicate entries from data frame with Base R

                                    copy iconCopydownload iconDownload
                                      new_df<-do.call("rbind",
                                                Filter(function(x) all(x$amount == x$amount[1]),
                                                        split(df,df$name)))
                                    
                                                name amount
                                    Andy         Andy    100
                                    Bert         Bert     50
                                    Cindy.3     Cindy     30
                                    Cindy.4     Cindy     30
                                    David       David    200
                                    Frank       Frank     90
                                    George.9   George    120
                                    George.10  George    120
                                    George.11  George    120
                                    Herbert   Herbert    300
                                    
                                     new_df<-new_df[!duplicated(new_df$name),]
                                     row.names(new_df) <- 1:nrow(new_df)
                                    
                                    new_df
                                             name amount
                                        1    Andy    100
                                        2    Bert     50
                                        3   Cindy     30
                                        4   David    200
                                        5   Frank     90
                                        6  George    120
                                        7 Herbert    300
                                    
                                    library(dplyr)
                                    (df %>% 
                                      group_by(name) %>% 
                                       mutate(name = ifelse(!all(amount==first(amount)), NA, name)) %>% 
                                       na.omit()  -> new_df)
                                    
                                     A tibble: 10 x 2
                                    # Groups:   name [7]
                                       name    amount
                                       <chr>    <dbl>
                                     1 Andy       100
                                     2 Bert        50
                                     3 Cindy       30
                                     4 Cindy       30
                                     5 David      200
                                     6 Frank       90
                                     7 George     120
                                     8 George     120
                                     9 George     120
                                    10 Herbert    300
                                    
                                    new_df %>%
                                       filter(!duplicated(name)) %>% 
                                       ungroup()
                                    # A tibble: 7 x 2
                                      name    amount
                                      <chr>    <dbl>
                                    1 Andy       100
                                    2 Bert        50
                                    3 Cindy       30
                                    4 David      200
                                    5 Frank       90
                                    6 George     120
                                    7 Herbert    300
                                    
                                      new_df<-do.call("rbind",
                                                Filter(function(x) all(x$amount == x$amount[1]),
                                                        split(df,df$name)))
                                    
                                                name amount
                                    Andy         Andy    100
                                    Bert         Bert     50
                                    Cindy.3     Cindy     30
                                    Cindy.4     Cindy     30
                                    David       David    200
                                    Frank       Frank     90
                                    George.9   George    120
                                    George.10  George    120
                                    George.11  George    120
                                    Herbert   Herbert    300
                                    
                                     new_df<-new_df[!duplicated(new_df$name),]
                                     row.names(new_df) <- 1:nrow(new_df)
                                    
                                    new_df
                                             name amount
                                        1    Andy    100
                                        2    Bert     50
                                        3   Cindy     30
                                        4   David    200
                                        5   Frank     90
                                        6  George    120
                                        7 Herbert    300
                                    
                                    library(dplyr)
                                    (df %>% 
                                      group_by(name) %>% 
                                       mutate(name = ifelse(!all(amount==first(amount)), NA, name)) %>% 
                                       na.omit()  -> new_df)
                                    
                                     A tibble: 10 x 2
                                    # Groups:   name [7]
                                       name    amount
                                       <chr>    <dbl>
                                     1 Andy       100
                                     2 Bert        50
                                     3 Cindy       30
                                     4 Cindy       30
                                     5 David      200
                                     6 Frank       90
                                     7 George     120
                                     8 George     120
                                     9 George     120
                                    10 Herbert    300
                                    
                                    new_df %>%
                                       filter(!duplicated(name)) %>% 
                                       ungroup()
                                    # A tibble: 7 x 2
                                      name    amount
                                      <chr>    <dbl>
                                    1 Andy       100
                                    2 Bert        50
                                    3 Cindy       30
                                    4 David      200
                                    5 Frank       90
                                    6 George     120
                                    7 Herbert    300
                                    
                                      new_df<-do.call("rbind",
                                                Filter(function(x) all(x$amount == x$amount[1]),
                                                        split(df,df$name)))
                                    
                                                name amount
                                    Andy         Andy    100
                                    Bert         Bert     50
                                    Cindy.3     Cindy     30
                                    Cindy.4     Cindy     30
                                    David       David    200
                                    Frank       Frank     90
                                    George.9   George    120
                                    George.10  George    120
                                    George.11  George    120
                                    Herbert   Herbert    300
                                    
                                     new_df<-new_df[!duplicated(new_df$name),]
                                     row.names(new_df) <- 1:nrow(new_df)
                                    
                                    new_df
                                             name amount
                                        1    Andy    100
                                        2    Bert     50
                                        3   Cindy     30
                                        4   David    200
                                        5   Frank     90
                                        6  George    120
                                        7 Herbert    300
                                    
                                    library(dplyr)
                                    (df %>% 
                                      group_by(name) %>% 
                                       mutate(name = ifelse(!all(amount==first(amount)), NA, name)) %>% 
                                       na.omit()  -> new_df)
                                    
                                     A tibble: 10 x 2
                                    # Groups:   name [7]
                                       name    amount
                                       <chr>    <dbl>
                                     1 Andy       100
                                     2 Bert        50
                                     3 Cindy       30
                                     4 Cindy       30
                                     5 David      200
                                     6 Frank       90
                                     7 George     120
                                     8 George     120
                                     9 George     120
                                    10 Herbert    300
                                    
                                    new_df %>%
                                       filter(!duplicated(name)) %>% 
                                       ungroup()
                                    # A tibble: 7 x 2
                                      name    amount
                                      <chr>    <dbl>
                                    1 Andy       100
                                    2 Bert        50
                                    3 Cindy       30
                                    4 David      200
                                    5 Frank       90
                                    6 George     120
                                    7 Herbert    300
                                    
                                      new_df<-do.call("rbind",
                                                Filter(function(x) all(x$amount == x$amount[1]),
                                                        split(df,df$name)))
                                    
                                                name amount
                                    Andy         Andy    100
                                    Bert         Bert     50
                                    Cindy.3     Cindy     30
                                    Cindy.4     Cindy     30
                                    David       David    200
                                    Frank       Frank     90
                                    George.9   George    120
                                    George.10  George    120
                                    George.11  George    120
                                    Herbert   Herbert    300
                                    
                                     new_df<-new_df[!duplicated(new_df$name),]
                                     row.names(new_df) <- 1:nrow(new_df)
                                    
                                    new_df
                                             name amount
                                        1    Andy    100
                                        2    Bert     50
                                        3   Cindy     30
                                        4   David    200
                                        5   Frank     90
                                        6  George    120
                                        7 Herbert    300
                                    
                                    library(dplyr)
                                    (df %>% 
                                      group_by(name) %>% 
                                       mutate(name = ifelse(!all(amount==first(amount)), NA, name)) %>% 
                                       na.omit()  -> new_df)
                                    
                                     A tibble: 10 x 2
                                    # Groups:   name [7]
                                       name    amount
                                       <chr>    <dbl>
                                     1 Andy       100
                                     2 Bert        50
                                     3 Cindy       30
                                     4 Cindy       30
                                     5 David      200
                                     6 Frank       90
                                     7 George     120
                                     8 George     120
                                     9 George     120
                                    10 Herbert    300
                                    
                                    new_df %>%
                                       filter(!duplicated(name)) %>% 
                                       ungroup()
                                    # A tibble: 7 x 2
                                      name    amount
                                      <chr>    <dbl>
                                    1 Andy       100
                                    2 Bert        50
                                    3 Cindy       30
                                    4 David      200
                                    5 Frank       90
                                    6 George     120
                                    7 Herbert    300
                                    
                                      new_df<-do.call("rbind",
                                                Filter(function(x) all(x$amount == x$amount[1]),
                                                        split(df,df$name)))
                                    
                                                name amount
                                    Andy         Andy    100
                                    Bert         Bert     50
                                    Cindy.3     Cindy     30
                                    Cindy.4     Cindy     30
                                    David       David    200
                                    Frank       Frank     90
                                    George.9   George    120
                                    George.10  George    120
                                    George.11  George    120
                                    Herbert   Herbert    300
                                    
                                     new_df<-new_df[!duplicated(new_df$name),]
                                     row.names(new_df) <- 1:nrow(new_df)
                                    
                                    new_df
                                             name amount
                                        1    Andy    100
                                        2    Bert     50
                                        3   Cindy     30
                                        4   David    200
                                        5   Frank     90
                                        6  George    120
                                        7 Herbert    300
                                    
                                    library(dplyr)
                                    (df %>% 
                                      group_by(name) %>% 
                                       mutate(name = ifelse(!all(amount==first(amount)), NA, name)) %>% 
                                       na.omit()  -> new_df)
                                    
                                     A tibble: 10 x 2
                                    # Groups:   name [7]
                                       name    amount
                                       <chr>    <dbl>
                                     1 Andy       100
                                     2 Bert        50
                                     3 Cindy       30
                                     4 Cindy       30
                                     5 David      200
                                     6 Frank       90
                                     7 George     120
                                     8 George     120
                                     9 George     120
                                    10 Herbert    300
                                    
                                    new_df %>%
                                       filter(!duplicated(name)) %>% 
                                       ungroup()
                                    # A tibble: 7 x 2
                                      name    amount
                                      <chr>    <dbl>
                                    1 Andy       100
                                    2 Bert        50
                                    3 Cindy       30
                                    4 David      200
                                    5 Frank       90
                                    6 George     120
                                    7 Herbert    300
                                    
                                    df[!duplicated(df), ]
                                    df[!duplicated(df$name),]
                                    
                                          name amount
                                    1     Andy    100
                                    2     Bert     50
                                    3    Cindy     30
                                    5    David    200
                                    6    Edgar     65
                                    8    Frank     90
                                    9   George    120
                                    12 Herbert    300
                                    13    Iris     15
                                    
                                    df <- unique(df)
                                    df <- split(df, df$name)
                                    
                                    df <- df[sapply(df, nrow) == 1]
                                    df <- do.call(rbind, df)
                                    rownames(df) <- 1:nrow(df)
                                    
                                         name amount
                                    1    Andy    100
                                    2    Bert     50
                                    3   Cindy     30
                                    4   David    200
                                    5   Frank     90
                                    6  George    120
                                    7 Herbert    300
                                    
                                    df[!duplicated(df), ]
                                    df[!duplicated(df$name),]
                                    
                                          name amount
                                    1     Andy    100
                                    2     Bert     50
                                    3    Cindy     30
                                    5    David    200
                                    6    Edgar     65
                                    8    Frank     90
                                    9   George    120
                                    12 Herbert    300
                                    13    Iris     15
                                    
                                    df <- unique(df)
                                    df <- split(df, df$name)
                                    
                                    df <- df[sapply(df, nrow) == 1]
                                    df <- do.call(rbind, df)
                                    rownames(df) <- 1:nrow(df)
                                    
                                         name amount
                                    1    Andy    100
                                    2    Bert     50
                                    3   Cindy     30
                                    4   David    200
                                    5   Frank     90
                                    6  George    120
                                    7 Herbert    300
                                    
                                    df[!duplicated(df), ]
                                    df[!duplicated(df$name),]
                                    
                                          name amount
                                    1     Andy    100
                                    2     Bert     50
                                    3    Cindy     30
                                    5    David    200
                                    6    Edgar     65
                                    8    Frank     90
                                    9   George    120
                                    12 Herbert    300
                                    13    Iris     15
                                    
                                    df <- unique(df)
                                    df <- split(df, df$name)
                                    
                                    df <- df[sapply(df, nrow) == 1]
                                    df <- do.call(rbind, df)
                                    rownames(df) <- 1:nrow(df)
                                    
                                         name amount
                                    1    Andy    100
                                    2    Bert     50
                                    3   Cindy     30
                                    4   David    200
                                    5   Frank     90
                                    6  George    120
                                    7 Herbert    300
                                    
                                    df[!duplicated(df), ]
                                    df[!duplicated(df$name),]
                                    
                                          name amount
                                    1     Andy    100
                                    2     Bert     50
                                    3    Cindy     30
                                    5    David    200
                                    6    Edgar     65
                                    8    Frank     90
                                    9   George    120
                                    12 Herbert    300
                                    13    Iris     15
                                    
                                    df <- unique(df)
                                    df <- split(df, df$name)
                                    
                                    df <- df[sapply(df, nrow) == 1]
                                    df <- do.call(rbind, df)
                                    rownames(df) <- 1:nrow(df)
                                    
                                         name amount
                                    1    Andy    100
                                    2    Bert     50
                                    3   Cindy     30
                                    4   David    200
                                    5   Frank     90
                                    6  George    120
                                    7 Herbert    300
                                    
                                    t <- aggregate( amount ~ name, df, function(x) c(unique(x),length(x)) )
                                    
                                    t_m <- t[!sapply( t$amount, function(x) (length(x)>2) ),]
                                    
                                    setNames( stack( setNames(lapply( t_m$amount, function(x) 
                                      rep(x[1],x[2]) ), t_m$name) )[,c("ind", "values")], colnames(df) ) 
                                          name amount
                                    1     Andy    100
                                    2     Bert     50
                                    3    Cindy     30
                                    4    Cindy     30
                                    5    David    200
                                    6    Frank     90
                                    7   George    120
                                    8   George    120
                                    9   George    120
                                    10 Herbert    300
                                    
                                    t <- aggregate( amount ~ name, df, unique )
                                    
                                    t[lengths(t$amount) == 1,]
                                         name amount
                                    1    Andy    100
                                    2    Bert     50
                                    3   Cindy     30
                                    4   David    200
                                    6   Frank     90
                                    7  George    120
                                    8 Herbert    300
                                    
                                    t <- aggregate( amount ~ name, df, function(x) c(unique(x),length(x)) )
                                    
                                    t_m <- t[!sapply( t$amount, function(x) (length(x)>2) ),]
                                    
                                    setNames( stack( setNames(lapply( t_m$amount, function(x) 
                                      rep(x[1],x[2]) ), t_m$name) )[,c("ind", "values")], colnames(df) ) 
                                          name amount
                                    1     Andy    100
                                    2     Bert     50
                                    3    Cindy     30
                                    4    Cindy     30
                                    5    David    200
                                    6    Frank     90
                                    7   George    120
                                    8   George    120
                                    9   George    120
                                    10 Herbert    300
                                    
                                    t <- aggregate( amount ~ name, df, unique )
                                    
                                    t[lengths(t$amount) == 1,]
                                         name amount
                                    1    Andy    100
                                    2    Bert     50
                                    3   Cindy     30
                                    4   David    200
                                    6   Frank     90
                                    7  George    120
                                    8 Herbert    300
                                    
                                    df <- data.frame(name = c("Andy", "Bert", "Cindy", "Cindy", "David", "Edgar", "Edgar", "Frank", "George", "George", "George", "Herbert", "Iris", "Iris", "Iris"), amount = c(100, 50, 30, 30, 200, 65, 55, 90, 120, 120, 120, 300, 15, 25, 25))
                                    
                                    df_unq <- unique(df)
                                    df3 <- df_unq[!(duplicated(df_unq$name)|duplicated(df_unq$name, fromLast = TRUE)), ]
                                    
                                    df3
                                    #>       name amount
                                    #> 1     Andy    100
                                    #> 2     Bert     50
                                    #> 3    Cindy     30
                                    #> 5    David    200
                                    #> 8    Frank     90
                                    #> 9   George    120
                                    #> 12 Herbert    300
                                    
                                    df[df$name %in% df3$name, ]
                                    #>       name amount
                                    #> 1     Andy    100
                                    #> 2     Bert     50
                                    #> 3    Cindy     30
                                    #> 4    Cindy     30
                                    #> 5    David    200
                                    #> 8    Frank     90
                                    #> 9   George    120
                                    #> 10  George    120
                                    #> 11  George    120
                                    #> 12 Herbert    300
                                    
                                    with(df, df[!name %in% names(Filter(var, split(amount, name))), ])
                                    
                                    #       name amount
                                    # 1     Andy    100
                                    # 2     Bert     50
                                    # 3    Cindy     30
                                    # 4    Cindy     30
                                    # 5    David    200
                                    # 8    Frank     90
                                    # 9   George    120
                                    # 10  George    120
                                    # 11  George    120
                                    # 12 Herbert    300
                                    
                                    with(df, df[!name %in% names(Filter(var, split(amount, name))), ]) |>
                                        unique()
                                    
                                    #       name amount
                                    # 1     Andy    100
                                    # 2     Bert     50
                                    # 3    Cindy     30
                                    # 5    David    200
                                    # 8    Frank     90
                                    # 9   George    120
                                    # 12 Herbert    300
                                    
                                    with(df, df[!name %in% names(Filter(var, split(amount, name))), ])
                                    
                                    #       name amount
                                    # 1     Andy    100
                                    # 2     Bert     50
                                    # 3    Cindy     30
                                    # 4    Cindy     30
                                    # 5    David    200
                                    # 8    Frank     90
                                    # 9   George    120
                                    # 10  George    120
                                    # 11  George    120
                                    # 12 Herbert    300
                                    
                                    with(df, df[!name %in% names(Filter(var, split(amount, name))), ]) |>
                                        unique()
                                    
                                    #       name amount
                                    # 1     Andy    100
                                    # 2     Bert     50
                                    # 3    Cindy     30
                                    # 5    David    200
                                    # 8    Frank     90
                                    # 9   George    120
                                    # 12 Herbert    300
                                    

                                    See all related Code Snippets

                                    Community Discussions

                                    Trending Discussions on bert
                                    • Convert pandas dataframe to datasetDict
                                    • What is the loss function used in Trainer from the Transformers library of Hugging Face?
                                    • how to save and load custom siamese bert model
                                    • How to change AllenNLP BERT based Semantic Role Labeling to RoBERTa in AllenNLP
                                    • Simple Transformers producing nothing?
                                    • Organize data for transformer fine-tuning
                                    • attributeerror: 'dataframe' object has no attribute 'data_type'
                                    • InternalError when using TPU for training Keras model
                                    • How to calculate perplexity of a sentence using huggingface masked language models?
                                    • XPath 1.0, 1st node in subtree
                                    Trending Discussions on bert

                                    QUESTION

                                    Convert pandas dataframe to datasetDict

                                    Asked 2022-Mar-25 at 15:47

                                    I cannot find anywhere how to convert a pandas dataframe to type datasets.dataset_dict.DatasetDict, for optimal use in a BERT workflow with a huggingface model. Take these simple dataframes, for example.

                                    train_df = pd.DataFrame({
                                         "label" : [1, 2, 3],
                                         "text" : ["apple", "pear", "strawberry"]
                                    })
                                    
                                    test_df = pd.DataFrame({
                                         "label" : [2, 2, 1],
                                         "text" : ["banana", "pear", "apple"]
                                    })
                                    

                                    What is the most efficient way to convert these to the type above?

                                    ANSWER

                                    Answered 2022-Mar-25 at 15:47

                                    One possibility is to first create two Datasets and then join them:

                                    import datasets
                                    import pandas as pd
                                    
                                    
                                    train_df = pd.DataFrame({
                                         "label" : [1, 2, 3],
                                         "text" : ["apple", "pear", "strawberry"]
                                    })
                                    
                                    test_df = pd.DataFrame({
                                         "label" : [2, 2, 1],
                                         "text" : ["banana", "pear", "apple"]
                                    })
                                    
                                    train_dataset = Dataset.from_dict(train_df)
                                    test_dataset = Dataset.from_dict(test_df)
                                    my_dataset_dict = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})
                                    

                                    The result is:

                                    DatasetDict({
                                        train: Dataset({
                                            features: ['label', 'text'],
                                            num_rows: 3
                                        })
                                        test: Dataset({
                                            features: ['label', 'text'],
                                            num_rows: 3
                                        })
                                    })
                                    

                                    Source https://stackoverflow.com/questions/71618974

                                    Community Discussions, Code Snippets contain sources that include Stack Exchange Network

                                    Vulnerabilities

                                    No vulnerabilities reported

                                    Install bert

                                    You can install using 'pip install bert' or download it from GitHub, PyPI.
                                    You can use bert like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

                                    Support

                                    For help or issues using BERT, please submit a GitHub issue. For personal communication related to BERT, please contact Jacob Devlin (jacobdevlin@google.com), Ming-Wei Chang (mingweichang@google.com), or Kenton Lee (kentonl@google.com).

                                    DOWNLOAD this Library from

                                    Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
                                    over 430 million Knowledge Items
                                    Find more libraries
                                    Reuse Solution Kits and Libraries Curated by Popular Use Cases
                                    Explore Kits

                                    Save this library and start creating your kit

                                    Share this Page

                                    share link
                                    Reuse Pre-built Kits with bert
                                    Consider Popular Natural Language Processing Libraries
                                    Try Top Libraries by google-research
                                    Compare Natural Language Processing Libraries with Highest Support
                                    Compare Natural Language Processing Libraries with Highest Quality
                                    Compare Natural Language Processing Libraries with Highest Security
                                    Compare Natural Language Processing Libraries with Permissive License
                                    Compare Natural Language Processing Libraries with Highest Reuse
                                    Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
                                    over 430 million Knowledge Items
                                    Find more libraries
                                    Reuse Solution Kits and Libraries Curated by Popular Use Cases
                                    Explore Kits

                                    Save this library and start creating your kit

                                    • © 2022 Open Weaver Inc.