kandi background
Explore Kits

sentence-transformers | Multilingual Sentence & Image Embeddings with BERT | Natural Language Processing library

 by   UKPLab Python Version: v2.0.0 License: Apache-2.0

 by   UKPLab Python Version: v2.0.0 License: Apache-2.0

Download this library from

kandi X-RAY | sentence-transformers Summary

sentence-transformers is a Python library typically used in Artificial Intelligence, Natural Language Processing, Deep Learning, Pytorch, Tensorflow, Bert, Transformer applications. sentence-transformers has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can install using 'pip install sentence-transformers' or download it from GitHub, PyPI.
This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various task. Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity. We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases. Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task. For the full documentation, see www.SBERT.net.
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • sentence-transformers has a medium active ecosystem.
  • It has 5944 star(s) with 1145 fork(s). There are 104 watchers for this library.
  • It had no major release in the last 12 months.
  • There are 503 open issues and 539 have been closed. On average issues are closed in 37 days. There are 11 open pull requests and 0 closed requests.
  • It has a neutral sentiment in the developer community.
  • The latest version of sentence-transformers is v2.0.0
This Library - Support
Best in #Natural Language Processing
Average in #Natural Language Processing
This Library - Support
Best in #Natural Language Processing
Average in #Natural Language Processing

quality kandi Quality

  • sentence-transformers has 0 bugs and 0 code smells.
This Library - Quality
Best in #Natural Language Processing
Average in #Natural Language Processing
This Library - Quality
Best in #Natural Language Processing
Average in #Natural Language Processing

securitySecurity

  • sentence-transformers has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • sentence-transformers code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.
This Library - Security
Best in #Natural Language Processing
Average in #Natural Language Processing
This Library - Security
Best in #Natural Language Processing
Average in #Natural Language Processing

license License

  • sentence-transformers is licensed under the Apache-2.0 License. This license is Permissive.
  • Permissive licenses have the least restrictions, and you can use them in most projects.
This Library - License
Best in #Natural Language Processing
Average in #Natural Language Processing
This Library - License
Best in #Natural Language Processing
Average in #Natural Language Processing

buildReuse

  • sentence-transformers releases are available to install and integrate.
  • Deployable package is available in PyPI.
  • Build file is available. You can build the component from source.
  • Installation instructions, examples and code snippets are available.
  • sentence-transformers saves you 3939 person hours of effort in developing the same functionality from scratch.
  • It has 12029 lines of code, 425 functions and 182 files.
  • It has high code complexity. Code complexity directly impacts maintainability of the code.
This Library - Reuse
Best in #Natural Language Processing
Average in #Natural Language Processing
This Library - Reuse
Best in #Natural Language Processing
Average in #Natural Language Processing
Top functions reviewed by kandi - BETA

kandi has reviewed sentence-transformers and discovered the below as its top functions. This is intended to give you an instant insight into sentence-transformers implemented functionality, and help decide if they suit your requirements.

  • Fit the AdamW model .
  • Compute metrics for each query .
  • Performs community detection .
  • Predict from sentences .
  • Perform a semantic search using the semantic search algorithm .
  • Download a repository from a repository .
  • Performs morphological mining .
  • Compute scores for the given model .
  • Forward mini batch of mini batches .
  • Create a TREC dataset .

sentence-transformers Key Features

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019)

Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation (EMNLP 2020)

Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks (NAACL 2021)

The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes (arXiv 2020)

TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning (arXiv 2021)

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (arXiv 2021)

Installation

copy iconCopydownload iconDownload
pip install -U sentence-transformers

Getting Started

copy iconCopydownload iconDownload
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

Citing & Authors

copy iconCopydownload iconDownload
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

The last dimension of the inputs to a Dense layer should be defined. Found None. Full input shape received: <unknown>

copy iconCopydownload iconDownload
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModel

MODEL_PATH = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = TFAutoModel.from_pretrained(MODEL_PATH, from_pt=True)

class SBert(tf.keras.layers.Layer):
    def __init__(self, tokenizer, model):
        super(SBert, self).__init__()
        
        self.tokenizer = tokenizer
        self.model = model
        
    def tf_encode(self, inputs):
        def encode(inputs):
            inputs = [x[0].decode("utf-8") for x in inputs.numpy()]
            outputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='tf')
            return outputs['input_ids'], outputs['token_type_ids'], outputs['attention_mask']
        return tf.py_function(func=encode, inp=[inputs], Tout=[tf.int32, tf.int32, tf.int32])
    
    def process(self, i, t, a):
      def __call(i, t, a):
        model_output = self.model({'input_ids': i.numpy(), 'token_type_ids': t.numpy(), 'attention_mask': a.numpy()})
        return model_output[0]
      return tf.py_function(func=__call, inp=[i, t, a], Tout=[tf.float32])

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = tf.squeeze(tf.stack(model_output), axis=0)
        input_mask_expanded = tf.cast(
            tf.broadcast_to(tf.expand_dims(attention_mask, -1), tf.shape(token_embeddings)),
            tf.float32
        )
        a = tf.math.reduce_sum(token_embeddings * input_mask_expanded, axis=1)
        b = tf.clip_by_value(tf.math.reduce_sum(input_mask_expanded, axis=1), 1e-9, tf.float32.max)
        embeddings = a / b
        embeddings, _ = tf.linalg.normalize(embeddings, 2, axis=1)
        return embeddings

    def call(self, inputs):
        input_ids, token_type_ids, attention_mask = self.tf_encode(inputs)
        input_ids.set_shape(tf.TensorShape((None, None)))
        token_type_ids.set_shape(tf.TensorShape((None, None)))
        attention_mask.set_shape(tf.TensorShape((None, None)))

        model_output = self.process(input_ids, token_type_ids, attention_mask)
        model_output[0].set_shape(tf.TensorShape((None, None, 384)))
        embeddings = self.mean_pooling(model_output, attention_mask)
        return embeddings

    
sbert = SBert(tokenizer, model)
inputs = tf.keras.layers.Input((1,), dtype=tf.string)
outputs = sbert(inputs)
outputs = tf.keras.layers.Dense(32)(outputs)
model = tf.keras.Model(inputs, outputs)
print(model(tf.constant(['some text', 'more text'])))
print(model.summary())
tf.Tensor(
[[-0.06719425 -0.02954631 -0.05811356 -0.1456391  -0.13001677  0.00145465
   0.0401044   0.05949172 -0.02589339  0.07255618 -0.00958113  0.01159782
   0.02508018  0.03075579 -0.01910635 -0.03231853  0.00875124  0.01143366
  -0.04365401 -0.02090197  0.07030752 -0.02872834  0.10535908  0.05691438
  -0.017165   -0.02044982  0.02580127 -0.04564123 -0.0631128  -0.00303708
   0.00133517  0.01613527]
 [-0.11922387  0.02304137 -0.02670465 -0.13117084 -0.11492493  0.03961402
   0.08129141 -0.05999354  0.0039564   0.02892766  0.00493046  0.00440936
  -0.07966737  0.11354238  0.03141225  0.00048972  0.04658606 -0.03658888
  -0.05292419 -0.04639702  0.08445395  0.00522146  0.04359548  0.0290177
  -0.02171512 -0.03399373 -0.00418095 -0.04019783 -0.04733383 -0.03972956
   0.01890458 -0.03927581]], shape=(2, 32), dtype=float32)
Model: "model_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_18 (InputLayer)       [(None, 1)]               0         
                                                                 
 s_bert_17 (SBert)           (None, 384)               22713216  
                                                                 
 dense_78 (Dense)            (None, 32)                12320     
                                                                 
=================================================================
Total params: 22,725,536
Trainable params: 22,725,536
Non-trainable params: 0
_________________________________________________________________
None
-----------------------
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModel

MODEL_PATH = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = TFAutoModel.from_pretrained(MODEL_PATH, from_pt=True)

class SBert(tf.keras.layers.Layer):
    def __init__(self, tokenizer, model):
        super(SBert, self).__init__()
        
        self.tokenizer = tokenizer
        self.model = model
        
    def tf_encode(self, inputs):
        def encode(inputs):
            inputs = [x[0].decode("utf-8") for x in inputs.numpy()]
            outputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='tf')
            return outputs['input_ids'], outputs['token_type_ids'], outputs['attention_mask']
        return tf.py_function(func=encode, inp=[inputs], Tout=[tf.int32, tf.int32, tf.int32])
    
    def process(self, i, t, a):
      def __call(i, t, a):
        model_output = self.model({'input_ids': i.numpy(), 'token_type_ids': t.numpy(), 'attention_mask': a.numpy()})
        return model_output[0]
      return tf.py_function(func=__call, inp=[i, t, a], Tout=[tf.float32])

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = tf.squeeze(tf.stack(model_output), axis=0)
        input_mask_expanded = tf.cast(
            tf.broadcast_to(tf.expand_dims(attention_mask, -1), tf.shape(token_embeddings)),
            tf.float32
        )
        a = tf.math.reduce_sum(token_embeddings * input_mask_expanded, axis=1)
        b = tf.clip_by_value(tf.math.reduce_sum(input_mask_expanded, axis=1), 1e-9, tf.float32.max)
        embeddings = a / b
        embeddings, _ = tf.linalg.normalize(embeddings, 2, axis=1)
        return embeddings

    def call(self, inputs):
        input_ids, token_type_ids, attention_mask = self.tf_encode(inputs)
        input_ids.set_shape(tf.TensorShape((None, None)))
        token_type_ids.set_shape(tf.TensorShape((None, None)))
        attention_mask.set_shape(tf.TensorShape((None, None)))

        model_output = self.process(input_ids, token_type_ids, attention_mask)
        model_output[0].set_shape(tf.TensorShape((None, None, 384)))
        embeddings = self.mean_pooling(model_output, attention_mask)
        return embeddings

    
sbert = SBert(tokenizer, model)
inputs = tf.keras.layers.Input((1,), dtype=tf.string)
outputs = sbert(inputs)
outputs = tf.keras.layers.Dense(32)(outputs)
model = tf.keras.Model(inputs, outputs)
print(model(tf.constant(['some text', 'more text'])))
print(model.summary())
tf.Tensor(
[[-0.06719425 -0.02954631 -0.05811356 -0.1456391  -0.13001677  0.00145465
   0.0401044   0.05949172 -0.02589339  0.07255618 -0.00958113  0.01159782
   0.02508018  0.03075579 -0.01910635 -0.03231853  0.00875124  0.01143366
  -0.04365401 -0.02090197  0.07030752 -0.02872834  0.10535908  0.05691438
  -0.017165   -0.02044982  0.02580127 -0.04564123 -0.0631128  -0.00303708
   0.00133517  0.01613527]
 [-0.11922387  0.02304137 -0.02670465 -0.13117084 -0.11492493  0.03961402
   0.08129141 -0.05999354  0.0039564   0.02892766  0.00493046  0.00440936
  -0.07966737  0.11354238  0.03141225  0.00048972  0.04658606 -0.03658888
  -0.05292419 -0.04639702  0.08445395  0.00522146  0.04359548  0.0290177
  -0.02171512 -0.03399373 -0.00418095 -0.04019783 -0.04733383 -0.03972956
   0.01890458 -0.03927581]], shape=(2, 32), dtype=float32)
Model: "model_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_18 (InputLayer)       [(None, 1)]               0         
                                                                 
 s_bert_17 (SBert)           (None, 384)               22713216  
                                                                 
 dense_78 (Dense)            (None, 32)                12320     
                                                                 
=================================================================
Total params: 22,725,536
Trainable params: 22,725,536
Non-trainable params: 0
_________________________________________________________________
None

Use `sentence-transformers` inside of a keras model

copy iconCopydownload iconDownload
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModel


MODEL_PATH = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = TFAutoModel.from_pretrained(MODEL_PATH, from_pt=True)


class SBert(tf.keras.layers.Layer):
    def __init__(self, tokenizer, model):
        super(SBert, self).__init__()
        
        self.tokenizer = tokenizer
        self.model = model
        
    def tf_encode(self, inputs):
        def encode(inputs):
            inputs = [x.decode("utf-8") for x in inputs.numpy()]
            outputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='tf')
            return outputs['input_ids'], outputs['token_type_ids'], outputs['attention_mask']
        return tf.py_function(func=encode, inp=[inputs], Tout=[tf.int32, tf.int32, tf.int32])
    
    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = tf.cast(
            tf.broadcast_to(tf.expand_dims(attention_mask, -1), tf.shape(token_embeddings)),
            tf.float32
        )
        a = tf.math.reduce_sum(token_embeddings * input_mask_expanded, axis=1)
        b = tf.clip_by_value(tf.math.reduce_sum(input_mask_expanded, axis=1), 1e-9, tf.float32.max)
        embeddings = a / b
        embeddings, _ = tf.linalg.normalize(embeddings, 2, axis=1)
        return embeddings
    def call(self, inputs):
        input_ids, token_type_ids, attention_mask = self.tf_encode(inputs)
        model_output = self.model({'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': attention_mask})
        embeddings = self.mean_pooling(model_output, attention_mask)
        return embeddings
    
    
sbert = SBert(tokenizer, model)
sbert(['some text', 'more text'])
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModel

MODEL_PATH = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = TFAutoModel.from_pretrained(MODEL_PATH, from_pt=True)

class SBert(tf.keras.layers.Layer):
    def __init__(self, tokenizer, model):
        super(SBert, self).__init__()
        
        self.tokenizer = tokenizer
        self.model = model
        
    def tf_encode(self, inputs):
        def encode(inputs):
            inputs = [x[0].decode("utf-8") for x in inputs.numpy()]
            outputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='tf')
            return outputs['input_ids'], outputs['token_type_ids'], outputs['attention_mask']
        return tf.py_function(func=encode, inp=[inputs], Tout=[tf.int32, tf.int32, tf.int32])
    
    def process(self, i, t, a):
      def __call(i, t, a):
        model_output = self.model({'input_ids': i.numpy(), 'token_type_ids': t.numpy(), 'attention_mask': a.numpy()})
        return model_output[0]
      return tf.py_function(func=__call, inp=[i, t, a], Tout=[tf.float32])

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = tf.squeeze(tf.stack(model_output), axis=0)
        input_mask_expanded = tf.cast(
            tf.broadcast_to(tf.expand_dims(attention_mask, -1), tf.shape(token_embeddings)),
            tf.float32
        )
        a = tf.math.reduce_sum(token_embeddings * input_mask_expanded, axis=1)
        b = tf.clip_by_value(tf.math.reduce_sum(input_mask_expanded, axis=1), 1e-9, tf.float32.max)
        embeddings = a / b
        embeddings, _ = tf.linalg.normalize(embeddings, 2, axis=1)
        return embeddings
    def call(self, inputs):
        input_ids, token_type_ids, attention_mask = self.tf_encode(inputs)
        model_output = self.process(input_ids, token_type_ids, attention_mask)
        embeddings = self.mean_pooling(model_output, attention_mask)
        return embeddings
    
    
sbert = SBert(tokenizer, model)
inputs = tf.keras.layers.Input((1,), dtype=tf.string)
outputs = sbert(inputs)
model = tf.keras.Model(inputs, outputs)
model(tf.constant(['some text', 'more text']))
TensorShape([2, 384]).shape
-----------------------
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModel


MODEL_PATH = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = TFAutoModel.from_pretrained(MODEL_PATH, from_pt=True)


class SBert(tf.keras.layers.Layer):
    def __init__(self, tokenizer, model):
        super(SBert, self).__init__()
        
        self.tokenizer = tokenizer
        self.model = model
        
    def tf_encode(self, inputs):
        def encode(inputs):
            inputs = [x.decode("utf-8") for x in inputs.numpy()]
            outputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='tf')
            return outputs['input_ids'], outputs['token_type_ids'], outputs['attention_mask']
        return tf.py_function(func=encode, inp=[inputs], Tout=[tf.int32, tf.int32, tf.int32])
    
    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = tf.cast(
            tf.broadcast_to(tf.expand_dims(attention_mask, -1), tf.shape(token_embeddings)),
            tf.float32
        )
        a = tf.math.reduce_sum(token_embeddings * input_mask_expanded, axis=1)
        b = tf.clip_by_value(tf.math.reduce_sum(input_mask_expanded, axis=1), 1e-9, tf.float32.max)
        embeddings = a / b
        embeddings, _ = tf.linalg.normalize(embeddings, 2, axis=1)
        return embeddings
    def call(self, inputs):
        input_ids, token_type_ids, attention_mask = self.tf_encode(inputs)
        model_output = self.model({'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': attention_mask})
        embeddings = self.mean_pooling(model_output, attention_mask)
        return embeddings
    
    
sbert = SBert(tokenizer, model)
sbert(['some text', 'more text'])
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModel

MODEL_PATH = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = TFAutoModel.from_pretrained(MODEL_PATH, from_pt=True)

class SBert(tf.keras.layers.Layer):
    def __init__(self, tokenizer, model):
        super(SBert, self).__init__()
        
        self.tokenizer = tokenizer
        self.model = model
        
    def tf_encode(self, inputs):
        def encode(inputs):
            inputs = [x[0].decode("utf-8") for x in inputs.numpy()]
            outputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='tf')
            return outputs['input_ids'], outputs['token_type_ids'], outputs['attention_mask']
        return tf.py_function(func=encode, inp=[inputs], Tout=[tf.int32, tf.int32, tf.int32])
    
    def process(self, i, t, a):
      def __call(i, t, a):
        model_output = self.model({'input_ids': i.numpy(), 'token_type_ids': t.numpy(), 'attention_mask': a.numpy()})
        return model_output[0]
      return tf.py_function(func=__call, inp=[i, t, a], Tout=[tf.float32])

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = tf.squeeze(tf.stack(model_output), axis=0)
        input_mask_expanded = tf.cast(
            tf.broadcast_to(tf.expand_dims(attention_mask, -1), tf.shape(token_embeddings)),
            tf.float32
        )
        a = tf.math.reduce_sum(token_embeddings * input_mask_expanded, axis=1)
        b = tf.clip_by_value(tf.math.reduce_sum(input_mask_expanded, axis=1), 1e-9, tf.float32.max)
        embeddings = a / b
        embeddings, _ = tf.linalg.normalize(embeddings, 2, axis=1)
        return embeddings
    def call(self, inputs):
        input_ids, token_type_ids, attention_mask = self.tf_encode(inputs)
        model_output = self.process(input_ids, token_type_ids, attention_mask)
        embeddings = self.mean_pooling(model_output, attention_mask)
        return embeddings
    
    
sbert = SBert(tokenizer, model)
inputs = tf.keras.layers.Input((1,), dtype=tf.string)
outputs = sbert(inputs)
model = tf.keras.Model(inputs, outputs)
model(tf.constant(['some text', 'more text']))
TensorShape([2, 384]).shape
-----------------------
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModel


MODEL_PATH = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = TFAutoModel.from_pretrained(MODEL_PATH, from_pt=True)


class SBert(tf.keras.layers.Layer):
    def __init__(self, tokenizer, model):
        super(SBert, self).__init__()
        
        self.tokenizer = tokenizer
        self.model = model
        
    def tf_encode(self, inputs):
        def encode(inputs):
            inputs = [x.decode("utf-8") for x in inputs.numpy()]
            outputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='tf')
            return outputs['input_ids'], outputs['token_type_ids'], outputs['attention_mask']
        return tf.py_function(func=encode, inp=[inputs], Tout=[tf.int32, tf.int32, tf.int32])
    
    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = tf.cast(
            tf.broadcast_to(tf.expand_dims(attention_mask, -1), tf.shape(token_embeddings)),
            tf.float32
        )
        a = tf.math.reduce_sum(token_embeddings * input_mask_expanded, axis=1)
        b = tf.clip_by_value(tf.math.reduce_sum(input_mask_expanded, axis=1), 1e-9, tf.float32.max)
        embeddings = a / b
        embeddings, _ = tf.linalg.normalize(embeddings, 2, axis=1)
        return embeddings
    def call(self, inputs):
        input_ids, token_type_ids, attention_mask = self.tf_encode(inputs)
        model_output = self.model({'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': attention_mask})
        embeddings = self.mean_pooling(model_output, attention_mask)
        return embeddings
    
    
sbert = SBert(tokenizer, model)
sbert(['some text', 'more text'])
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModel

MODEL_PATH = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = TFAutoModel.from_pretrained(MODEL_PATH, from_pt=True)

class SBert(tf.keras.layers.Layer):
    def __init__(self, tokenizer, model):
        super(SBert, self).__init__()
        
        self.tokenizer = tokenizer
        self.model = model
        
    def tf_encode(self, inputs):
        def encode(inputs):
            inputs = [x[0].decode("utf-8") for x in inputs.numpy()]
            outputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='tf')
            return outputs['input_ids'], outputs['token_type_ids'], outputs['attention_mask']
        return tf.py_function(func=encode, inp=[inputs], Tout=[tf.int32, tf.int32, tf.int32])
    
    def process(self, i, t, a):
      def __call(i, t, a):
        model_output = self.model({'input_ids': i.numpy(), 'token_type_ids': t.numpy(), 'attention_mask': a.numpy()})
        return model_output[0]
      return tf.py_function(func=__call, inp=[i, t, a], Tout=[tf.float32])

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = tf.squeeze(tf.stack(model_output), axis=0)
        input_mask_expanded = tf.cast(
            tf.broadcast_to(tf.expand_dims(attention_mask, -1), tf.shape(token_embeddings)),
            tf.float32
        )
        a = tf.math.reduce_sum(token_embeddings * input_mask_expanded, axis=1)
        b = tf.clip_by_value(tf.math.reduce_sum(input_mask_expanded, axis=1), 1e-9, tf.float32.max)
        embeddings = a / b
        embeddings, _ = tf.linalg.normalize(embeddings, 2, axis=1)
        return embeddings
    def call(self, inputs):
        input_ids, token_type_ids, attention_mask = self.tf_encode(inputs)
        model_output = self.process(input_ids, token_type_ids, attention_mask)
        embeddings = self.mean_pooling(model_output, attention_mask)
        return embeddings
    
    
sbert = SBert(tokenizer, model)
inputs = tf.keras.layers.Input((1,), dtype=tf.string)
outputs = sbert(inputs)
model = tf.keras.Model(inputs, outputs)
model(tf.constant(['some text', 'more text']))
TensorShape([2, 384]).shape

Using sentence transformers with limited access to internet

copy iconCopydownload iconDownload
['1_Pooling', 'config_sentence_transformers.json', 'tokenizer.json', 'tokenizer_config.json', 'modules.json', 'sentence_bert_config.json', 'pytorch_model.bin', 'special_tokens_map.json', 'config.json', 'train_script.py', 'data_config.json', 'README.md', '.gitattributes', 'vocab.txt']

Using Sentence-Bert with other features in scikit-learn

copy iconCopydownload iconDownload
X_train = pd.DataFrame({
    'tweet':['foo', 'foo', 'bar'],
    'feature1':[1, 1, 0],
    'feature2':[1, 0, 1],
})
y_train = [1, 1, 0]
model = SentenceTransformer('mrm8488/bert-tiny-finetuned-squadv2') # model named is changed for time and computation gians :)
embedder = FunctionTransformer(lambda item:model.encode(item, convert_to_tensor=True, show_progress_bar=False).detach().cpu().numpy())
preprocessor = ColumnTransformer(
    transformers=[('embedder', embedder, 'tweet')],
    remainder='passthrough'
    )
X_train = preprocessor.fit_transform(X_train) # X_train.shape => (len(df), your_transformer_model_hidden_dim + your_features_count)
gnb = GaussianNB()
gnb.fit(X_train, y_train) 
-----------------------
X_train = pd.DataFrame({
    'tweet':['foo', 'foo', 'bar'],
    'feature1':[1, 1, 0],
    'feature2':[1, 0, 1],
})
y_train = [1, 1, 0]
model = SentenceTransformer('mrm8488/bert-tiny-finetuned-squadv2') # model named is changed for time and computation gians :)
embedder = FunctionTransformer(lambda item:model.encode(item, convert_to_tensor=True, show_progress_bar=False).detach().cpu().numpy())
preprocessor = ColumnTransformer(
    transformers=[('embedder', embedder, 'tweet')],
    remainder='passthrough'
    )
X_train = preprocessor.fit_transform(X_train) # X_train.shape => (len(df), your_transformer_model_hidden_dim + your_features_count)
gnb = GaussianNB()
gnb.fit(X_train, y_train) 
-----------------------
X_train = pd.DataFrame({
    'tweet':['foo', 'foo', 'bar'],
    'feature1':[1, 1, 0],
    'feature2':[1, 0, 1],
})
y_train = [1, 1, 0]
model = SentenceTransformer('mrm8488/bert-tiny-finetuned-squadv2') # model named is changed for time and computation gians :)
embedder = FunctionTransformer(lambda item:model.encode(item, convert_to_tensor=True, show_progress_bar=False).detach().cpu().numpy())
preprocessor = ColumnTransformer(
    transformers=[('embedder', embedder, 'tweet')],
    remainder='passthrough'
    )
X_train = preprocessor.fit_transform(X_train) # X_train.shape => (len(df), your_transformer_model_hidden_dim + your_features_count)
gnb = GaussianNB()
gnb.fit(X_train, y_train) 
-----------------------
X_train = pd.DataFrame({
    'tweet':['foo', 'foo', 'bar'],
    'feature1':[1, 1, 0],
    'feature2':[1, 0, 1],
})
y_train = [1, 1, 0]
model = SentenceTransformer('mrm8488/bert-tiny-finetuned-squadv2') # model named is changed for time and computation gians :)
embedder = FunctionTransformer(lambda item:model.encode(item, convert_to_tensor=True, show_progress_bar=False).detach().cpu().numpy())
preprocessor = ColumnTransformer(
    transformers=[('embedder', embedder, 'tweet')],
    remainder='passthrough'
    )
X_train = preprocessor.fit_transform(X_train) # X_train.shape => (len(df), your_transformer_model_hidden_dim + your_features_count)
gnb = GaussianNB()
gnb.fit(X_train, y_train) 

String comparison with BERT seems to ignore "not" in sentence

copy iconCopydownload iconDownload
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("roberta-large-mnli")
model = AutoModelForSequenceClassification.from_pretrained("roberta-large-mnli")

def test_entailment(text1, text2):
    batch = tokenizer(text1, text2, return_tensors='pt').to(model.device)
    with torch.no_grad():
        proba = torch.softmax(model(**batch).logits, -1)
    return proba.cpu().numpy()[0, model.config.label2id['ENTAILMENT']]

def test_equivalence(text1, text2):
    return test_entailment(text1, text2) * test_entailment(text2, text1)

print(test_equivalence("I'm a good person", "I'm not a good person"))  # 2.0751484e-07
print(test_equivalence("I'm a good person", "You are a good person"))  # 0.49342492
print(test_equivalence("I'm a good person", "I'm not a bad person"))   # 0.94236994

python packages not being installed on the virtual environment using ubuntu

copy iconCopydownload iconDownload
pip3 install package_name --user

AttributeError: Caught AttributeError in DataLoader worker process 0. - fine tuning pre-trained transformer model

copy iconCopydownload iconDownload
import pandas as pd
# initialise data of lists.
data = {'input':[
          "Alpro, Cioccolato bevanda a base di soia 1 ltr", #Alpro, Chocolate soy drink 1 ltr
          "Milka  cioccolato al latte 100 g", #Milka milk chocolate 100 g
          "Danone, HiPRO 25g Proteine gusto cioccolato 330 ml", #Danone, HiPRO 25g Protein chocolate flavor 330 ml
         ]
        }
 
# Creates pandas DataFrame.
x_sample = pd.DataFrame(data)
print(x_sample['input'])

# load model
from sentence_transformers import SentenceTransformer, SentencesDataset, InputExample, losses, evaluation
from torch.utils.data import DataLoader

embedder = SentenceTransformer('sentence-transformers/paraphrase-xlm-r-multilingual-v1') # or any other pretrained model
print("embedder loaded...")

# define your train dataset, the dataloader, and the train loss
# train_dataset = SentencesDataset(x_sample["input"].tolist(), embedder)
# train_dataloader = DataLoader(train_dataset, shuffle=False, batch_size=4, num_workers=1)
# train_loss = losses.CosineSimilarityLoss(embedder)

# dummy evaluator to make the api work
sentences1 = ['latte al cioccolato', 'latte al cioccolato','latte al cioccolato']
sentences2 = ['Alpro, Cioccolato bevanda a base di soia 1 ltr', 'Danone, HiPRO 25g Proteine gusto cioccolato 330 ml','Milka  cioccolato al latte 100 g']
scores = [0.99,0.95,0.4]
evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)

examples = []
for s1,s2,l in zip(sentences1, sentences2, scores):
  examples.append(InputExample(texts=[s1, s2], label=l))
train_dataloader = DataLoader(examples, shuffle=False, batch_size=4, num_workers=1)
train_loss = losses.CosineSimilarityLoss(embedder)
# tune the model
embedder.fit(train_objectives=[(train_dataloader, train_loss)], 
    epochs=5, 
    warmup_steps=500, 
    evaluator=evaluator, 
    evaluation_steps=1,
    output_path='fine_tuned_bert',
    save_best_model= True,
    show_progress_bar= True
    )

Heroku: Compiled Slug Size is too large Python

copy iconCopydownload iconDownload
heroku run bash -a <appname>
du -ha --max-depth 1 /app | sort -hr
-----------------------
heroku run bash -a <appname>
du -ha --max-depth 1 /app | sort -hr

Download pre-trained BERT model locally

copy iconCopydownload iconDownload
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
tokenizer.save_pretrained('./local_directory/')
model.save_pretrained('./local_directory/')

Sentence embedding using T5

copy iconCopydownload iconDownload
model.encoder(input_ids=s, attention_mask=attn, return_dict=True)
pooled_sentence = output.last_hidden_state # shape is [batch_size, seq_len, hidden_size]
# pooled_sentence will represent the embeddings for each word in the sentence
# you need to sum/average the pooled_sentence
pooled_sentence = torch.mean(pooled_sentence, dim=1)

Community Discussions

Trending Discussions on sentence-transformers
  • The last dimension of the inputs to a Dense layer should be defined. Found None. Full input shape received: &lt;unknown&gt;
  • Use `sentence-transformers` inside of a keras model
  • Using sentence transformers with limited access to internet
  • Using Sentence-Bert with other features in scikit-learn
  • String comparison with BERT seems to ignore &quot;not&quot; in sentence
  • Token indices sequence length Issue
  • python packages not being installed on the virtual environment using ubuntu
  • AttributeError: Caught AttributeError in DataLoader worker process 0. - fine tuning pre-trained transformer model
  • How to increase dimension-vector size of BERT sentence-transformers embedding
  • BERT problem with context/semantic search in italian language
Trending Discussions on sentence-transformers

QUESTION

The last dimension of the inputs to a Dense layer should be defined. Found None. Full input shape received: &lt;unknown&gt;

Asked 2022-Mar-10 at 08:57

I am having trouble when switching a model from some local dummy data to using a TF dataset.

Sorry for the long model code, I have tried to shorten it as much as possible.

The following works fine:

import tensorflow as tf
import tensorflow_recommenders as tfrs
from transformers import AutoTokenizer, TFAutoModel


MODEL_PATH = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = TFAutoModel.from_pretrained(MODEL_PATH, from_pt=True)


class SBert(tf.keras.layers.Layer):
    def __init__(self, tokenizer, model):
        super(SBert, self).__init__()
        
        self.tokenizer = tokenizer
        self.model = model
        
    def tf_encode(self, inputs):
        def encode(inputs):
            inputs = [x[0].decode("utf-8") for x in inputs.numpy()]
            outputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='tf')
            return outputs['input_ids'], outputs['token_type_ids'], outputs['attention_mask']
        return tf.py_function(func=encode, inp=[inputs], Tout=[tf.int32, tf.int32, tf.int32])
    
    def process(self, i, t, a):
        def __call(i, t, a):
            model_output = self.model(
                {'input_ids': i.numpy(), 'token_type_ids': t.numpy(), 'attention_mask': a.numpy()}
            )
            return model_output[0]
        return tf.py_function(func=__call, inp=[i, t, a], Tout=[tf.float32])

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = tf.squeeze(tf.stack(model_output), axis=0)
        input_mask_expanded = tf.cast(
            tf.broadcast_to(tf.expand_dims(attention_mask, -1), tf.shape(token_embeddings)),
            tf.float32
        )
        a = tf.math.reduce_sum(token_embeddings * input_mask_expanded, axis=1)
        b = tf.clip_by_value(tf.math.reduce_sum(input_mask_expanded, axis=1), 1e-9, tf.float32.max)
        embeddings = a / b
        embeddings, _ = tf.linalg.normalize(embeddings, 2, axis=1)
        return embeddings
    
    def call(self, inputs):
        input_ids, token_type_ids, attention_mask = self.tf_encode(inputs)
        model_output = self.process(input_ids, token_type_ids, attention_mask)
        embeddings = self.mean_pooling(model_output, attention_mask)
        return embeddings


sbert = SBert(tokenizer, model)
inputs = tf.keras.layers.Input(shape=(1,), dtype=tf.string)
outputs = sbert(inputs)
model = tf.keras.Model(inputs, outputs)
model(tf.constant(['some text', 'more text']))

The call to the model outputs tensors - yipee :)

Now I want to use this layer inside of a larger two tower model:

class Encoder(tf.keras.Model):
    def __init__(self):
        super().__init__()
        
        self.text_embedding = self._build_text_embedding()
    
    def _build_text_embedding(self):
        sbert = SBert(tokenizer, model)
        inputs = tf.keras.layers.Input(shape=(1,), dtype=tf.string)
        outputs = sbert(inputs)
        return tf.keras.Model(inputs, outputs)
    
    def call(self, inputs):
        return self.text_embedding(inputs)

    
class RecModel(tfrs.models.Model):
    def __init__(self):
        super().__init__()
        
        self.query_model = tf.keras.Sequential([
            Encoder(),
            tf.keras.layers.Dense(32)
        ])
        
        self.candidate_model = tf.keras.Sequential([
            Encoder(),
            tf.keras.layers.Dense(32)
        ])
    
        self.retrieval_task = tfrs.tasks.Retrieval(
            metrics=tfrs.metrics.FactorizedTopK(
                candidates=tf.data.Dataset.from_tensor_slices(
                    data['text']
                ).batch(1).map(self.candidate_model),
            ),
            batch_metrics=[
                tf.keras.metrics.TopKCategoricalAccuracy(k=5)
            ]
        )

    def call(self, features):
        query_embeddings = self.query_model(features['query'])
        candidate_embeddings = self.candidate_model(features['text'])
        return (
            query_embeddings,
            candidate_embeddings,
        )   

    def compute_loss(self, features, training=False):
        query_embeddings, candidate_embeddings = self(features)
        retrieval_loss = self.retrieval_task(query_embeddings, candidate_embeddings)
        return retrieval_loss

Create a small dummy dataset:

data = {
    'query': ['blue', 'cat', 'football'],
    'text': ['a nice colour', 'a type of animal', 'a sport']
}

ds = tf.data.Dataset.from_tensor_slices(data).batch(1)

Try to compile:

model = RecModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad())

And we hit the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-df4cc46e0307> in <module>
----> 1 model = RecModel()
      2 model.compile(optimizer=tf.keras.optimizers.Adagrad())

<ipython-input-8-a774041744b9> in __init__(self)
     33                 candidates=tf.data.Dataset.from_tensor_slices(
     34                     data['text']
---> 35                 ).batch(1).map(self.candidate_model),
     36             ),
     37             batch_metrics=[

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py in map(self, map_func, num_parallel_calls, deterministic, name)
   2014         warnings.warn("The `deterministic` argument has no effect unless the "
   2015                       "`num_parallel_calls` argument is specified.")
-> 2016       return MapDataset(self, map_func, preserve_cardinality=True, name=name)
   2017     else:
   2018       return ParallelMapDataset(

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py in __init__(self, input_dataset, map_func, use_inter_op_parallelism, preserve_cardinality, use_legacy_function, name)
   5193         self._transformation_name(),
   5194         dataset=input_dataset,
-> 5195         use_legacy_function=use_legacy_function)
   5196     self._metadata = dataset_metadata_pb2.Metadata()
   5197     if name:

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tensorflow/python/data/ops/structured_function.py in __init__(self, func, transformation_name, dataset, input_classes, input_shapes, input_types, input_structure, add_to_graph, use_legacy_function, defun_kwargs)
    269         fn_factory = trace_tf_function(defun_kwargs)
    270 
--> 271     self._function = fn_factory()
    272     # There is no graph to add in eager mode.
    273     add_to_graph &= not context.executing_eagerly()

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tensorflow/python/eager/function.py in get_concrete_function(self, *args, **kwargs)
   3069     """
   3070     graph_function = self._get_concrete_function_garbage_collected(
-> 3071         *args, **kwargs)
   3072     graph_function._garbage_collector.release()  # pylint: disable=protected-access
   3073     return graph_function

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tensorflow/python/eager/function.py in _get_concrete_function_garbage_collected(self, *args, **kwargs)
   3034       args, kwargs = None, None
   3035     with self._lock:
-> 3036       graph_function, _ = self._maybe_define_function(args, kwargs)
   3037       seen_names = set()
   3038       captured = object_identity.ObjectIdentitySet(

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tensorflow/python/eager/function.py in _maybe_define_function(self, args, kwargs)
   3290 
   3291           self._function_cache.add_call_context(cache_key.call_context)
-> 3292           graph_function = self._create_graph_function(args, kwargs)
   3293           self._function_cache.add(cache_key, cache_key_deletion_observer,
   3294                                    graph_function)

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tensorflow/python/eager/function.py in _create_graph_function(self, args, kwargs, override_flat_arg_shapes)
   3138             arg_names=arg_names,
   3139             override_flat_arg_shapes=override_flat_arg_shapes,
-> 3140             capture_by_value=self._capture_by_value),
   3141         self._function_attributes,
   3142         function_spec=self.function_spec,

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py in func_graph_from_py_func(name, python_func, args, kwargs, signature, func_graph, autograph, autograph_options, add_control_dependencies, arg_names, op_return_value, collections, capture_by_value, override_flat_arg_shapes, acd_record_initial_resource_uses)
   1159         _, original_func = tf_decorator.unwrap(python_func)
   1160 
-> 1161       func_outputs = python_func(*func_args, **func_kwargs)
   1162 
   1163       # invariant: `func_outputs` contains only Tensors, CompositeTensors,

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tensorflow/python/data/ops/structured_function.py in wrapped_fn(*args)
    246           attributes=defun_kwargs)
    247       def wrapped_fn(*args):  # pylint: disable=missing-docstring
--> 248         ret = wrapper_helper(*args)
    249         ret = structure.to_tensor_list(self._output_structure, ret)
    250         return [ops.convert_to_tensor(t) for t in ret]

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tensorflow/python/data/ops/structured_function.py in wrapper_helper(*args)
    175       if not _should_unpack(nested_args):
    176         nested_args = (nested_args,)
--> 177       ret = autograph.tf_convert(self._func, ag_ctx)(*nested_args)
    178       if _should_pack(ret):
    179         ret = tuple(ret)

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py in wrapper(*args, **kwargs)
    687       try:
    688         with conversion_ctx:
--> 689           return converted_call(f, args, kwargs, options=options)
    690       except Exception as e:  # pylint:disable=broad-except
    691         if hasattr(e, 'ag_error_metadata'):

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py in converted_call(f, args, kwargs, caller_fn_scope, options)
    375 
    376   if not options.user_requested and conversion.is_allowlisted(f):
--> 377     return _call_unconverted(f, args, kwargs, options)
    378 
    379   # internal_convert_user_code is for example turned off when issuing a dynamic

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py in _call_unconverted(f, args, kwargs, options, update_cache)
    456 
    457   if kwargs is not None:
--> 458     return f(*args, **kwargs)
    459   return f(*args)
    460 

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
     65     except Exception as e:  # pylint: disable=broad-except
     66       filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67       raise e.with_traceback(filtered_tb) from None
     68     finally:
     69       del filtered_tb

~/.pyenv/versions/3.7.8/lib/python3.7/site-packages/keras/layers/core/dense.py in build(self, input_shape)
    137     last_dim = tf.compat.dimension_value(input_shape[-1])
    138     if last_dim is None:
--> 139       raise ValueError('The last dimension of the inputs to a Dense layer '
    140                        'should be defined. Found None. '
    141                        f'Full input shape received: {input_shape}')

ValueError: Exception encountered when calling layer "sequential_5" (type Sequential).

The last dimension of the inputs to a Dense layer should be defined. Found None. Full input shape received: <unknown>

Call arguments received:
  • inputs=tf.Tensor(shape=(None,), dtype=string)
  • training=None
  • mask=None

I am not quite sure where I should set the shape - as using regular tensors and not TF dataset works ok.

ANSWER

Answered 2022-Mar-10 at 08:57

You will have to explicitly set the shapes of the tensors coming from tf.py_functions. Using None will allow variable input lengths. The Bert output dimension (384,) is, however, necessary:

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModel

MODEL_PATH = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = TFAutoModel.from_pretrained(MODEL_PATH, from_pt=True)

class SBert(tf.keras.layers.Layer):
    def __init__(self, tokenizer, model):
        super(SBert, self).__init__()
        
        self.tokenizer = tokenizer
        self.model = model
        
    def tf_encode(self, inputs):
        def encode(inputs):
            inputs = [x[0].decode("utf-8") for x in inputs.numpy()]
            outputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='tf')
            return outputs['input_ids'], outputs['token_type_ids'], outputs['attention_mask']
        return tf.py_function(func=encode, inp=[inputs], Tout=[tf.int32, tf.int32, tf.int32])
    
    def process(self, i, t, a):
      def __call(i, t, a):
        model_output = self.model({'input_ids': i.numpy(), 'token_type_ids': t.numpy(), 'attention_mask': a.numpy()})
        return model_output[0]
      return tf.py_function(func=__call, inp=[i, t, a], Tout=[tf.float32])

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = tf.squeeze(tf.stack(model_output), axis=0)
        input_mask_expanded = tf.cast(
            tf.broadcast_to(tf.expand_dims(attention_mask, -1), tf.shape(token_embeddings)),
            tf.float32
        )
        a = tf.math.reduce_sum(token_embeddings * input_mask_expanded, axis=1)
        b = tf.clip_by_value(tf.math.reduce_sum(input_mask_expanded, axis=1), 1e-9, tf.float32.max)
        embeddings = a / b
        embeddings, _ = tf.linalg.normalize(embeddings, 2, axis=1)
        return embeddings

    def call(self, inputs):
        input_ids, token_type_ids, attention_mask = self.tf_encode(inputs)
        input_ids.set_shape(tf.TensorShape((None, None)))
        token_type_ids.set_shape(tf.TensorShape((None, None)))
        attention_mask.set_shape(tf.TensorShape((None, None)))

        model_output = self.process(input_ids, token_type_ids, attention_mask)
        model_output[0].set_shape(tf.TensorShape((None, None, 384)))
        embeddings = self.mean_pooling(model_output, attention_mask)
        return embeddings

    
sbert = SBert(tokenizer, model)
inputs = tf.keras.layers.Input((1,), dtype=tf.string)
outputs = sbert(inputs)
outputs = tf.keras.layers.Dense(32)(outputs)
model = tf.keras.Model(inputs, outputs)
print(model(tf.constant(['some text', 'more text'])))
print(model.summary())
tf.Tensor(
[[-0.06719425 -0.02954631 -0.05811356 -0.1456391  -0.13001677  0.00145465
   0.0401044   0.05949172 -0.02589339  0.07255618 -0.00958113  0.01159782
   0.02508018  0.03075579 -0.01910635 -0.03231853  0.00875124  0.01143366
  -0.04365401 -0.02090197  0.07030752 -0.02872834  0.10535908  0.05691438
  -0.017165   -0.02044982  0.02580127 -0.04564123 -0.0631128  -0.00303708
   0.00133517  0.01613527]
 [-0.11922387  0.02304137 -0.02670465 -0.13117084 -0.11492493  0.03961402
   0.08129141 -0.05999354  0.0039564   0.02892766  0.00493046  0.00440936
  -0.07966737  0.11354238  0.03141225  0.00048972  0.04658606 -0.03658888
  -0.05292419 -0.04639702  0.08445395  0.00522146  0.04359548  0.0290177
  -0.02171512 -0.03399373 -0.00418095 -0.04019783 -0.04733383 -0.03972956
   0.01890458 -0.03927581]], shape=(2, 32), dtype=float32)
Model: "model_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_18 (InputLayer)       [(None, 1)]               0         
                                                                 
 s_bert_17 (SBert)           (None, 384)               22713216  
                                                                 
 dense_78 (Dense)            (None, 32)                12320     
                                                                 
=================================================================
Total params: 22,725,536
Trainable params: 22,725,536
Non-trainable params: 0
_________________________________________________________________
None

Source https://stackoverflow.com/questions/71414627

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install sentence-transformers

We recommend Python 3.6 or higher, PyTorch 1.6.0 or higher and transformers v4.6.0 or higher. The code does not work with Python 2.7.
See Quickstart in our documenation. This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task. First download a pretrained model. Then provide some sentences to the model. And that's it already. We now have a list of numpy arrays with the embeddings.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

DOWNLOAD this Library from

Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

Share this Page

share link
Reuse Pre-built Kits with sentence-transformers
Consider Popular Natural Language Processing Libraries
Compare Natural Language Processing Libraries with Highest Support
Compare Natural Language Processing Libraries with Highest Quality
Compare Natural Language Processing Libraries with Highest Security
Compare Natural Language Processing Libraries with Permissive License
Compare Natural Language Processing Libraries with Highest Reuse
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

  • © 2022 Open Weaver Inc.