bert-sense | Source code accompanying the KONVENS 2019 paper | Natural Language Processing library
kandi X-RAY | bert-sense Summary
kandi X-RAY | bert-sense Summary
Source code accompanying the KONVENS 2019 paper "Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings"
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Compute the accuracy of the trained model
- Train Embeddings
- Return a list of semcor Sentence objects
- Collects a list of the words and their senses
- Parse Sentence
- Create the word embedding map
- Compute tensorflow embeddings
- Given a sentence return a list of tokens
- Loads the word sense embedding
- Opens an XML file
- Applies the Bertran tokenizer
bert-sense Key Features
bert-sense Examples and Code Snippets
Community Discussions
Trending Discussions on bert-sense
QUESTION
I am working on a word-level classification task on multilingual data, I am using XLM-R, I know that XLM-R uses sentencepiece
as tokenizers which sometimes tokenizes words into subword.
For example the sentence "deception master" is tokenized as
de
ception
master
, the word deception has been tokenized into two sub-words.
How can I get the embedding of deception
. I can take the mean of the subwords to get the embedding of the word as done here. But I have to implement my code in TensorFlow and TensorFlow computational graph doesn't support NumPy.
I could store the final hidden embeddings after taking the mean of the subwords into a NumPy array and give this array as input to the model, but I want to fine-tune the transformer.
How to get the word embeddings from the sub-word embeddings given by the transformer
...ANSWER
Answered 2021-Mar-30 at 08:16Joining subword embeddings into words for word labeling is not how this problem is usually approached. The usual approach is the opposite: keep the subwords as they are, but adjust the labels to respect the tokenization of the pre-trained model.
One of the reasons is that the data is typically in batches. When merging subwords into words, every sentence in the batch would end up having a different length which would require processing each sentence independently and pad the batch again – this would be slow. Also, if you do not average the neighboring embeddings, you get more fine-grained information from the loss function, which tells explicitly what subword is responsible for an error.
When tokenizing using SentencePiece, you can get the indices in the original string:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install bert-sense
You can use bert-sense like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page