self-attention | Transformer | Artificial Intelligence library
kandi X-RAY | self-attention Summary
kandi X-RAY | self-attention Summary
Transformer
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Construct attention layer
- Dense layer
- Mask inputs
- Compute the Q sequence
- Mask input tensors
self-attention Key Features
self-attention Examples and Code Snippets
import torch
from vit_pytorch.scalable_vit import ScalableViT
model = ScalableViT(
num_classes = 1000,
dim = 64, # starting model dimension. at every stage, dimension is doubled
heads = (2, 4, 8, 16),
import torch
from vit_pytorch.sep_vit import SepViT
v = SepViT(
num_classes = 1000,
dim = 32, # dimensions of first stage, which doubles every stage (32, 64, 128, 256) for SepViT-Lite
dim_head = 32, # attention hea
Community Discussions
Trending Discussions on self-attention
QUESTION
The configuration file for the HuggingFace google/mt5-small Model (https://huggingface.co/google/mt5-small)
defines
...ANSWER
Answered 2022-Jan-20 at 09:48This is a very good question, and shows a common misconception about Transformers, stemming from an (unfortunate) formulation in the original Transformers paper. In particular, the authors write the following in Section 3.2.2:
In this work, we employ
h = 8
parallel attention layers, or heads. For each of these we used_k = d_v = d_(model) / h = 64
. [...]
Note that the equality of d_k/d_v = d_(model)
is not strictly necessary; it is only important that you do match the final hidden representation (d_(model)
) after the Feed-Forward portion of each layer. Specifically for mt5-small
, the authors actually use an internal dimension of 384
which is simply the product of parameters d_kv * num_heads = 64 * 6
.
Now, the problem is that many libraries make a similar assumption of the enforced relation between d_kv
and d_(model)
, because it saves some implementation effort that most people won't use anyways. I suspect (not super familiar with AllenNLP) that they have made similar assumptions here, which is why you cannot load the model.
Also, to clarify this, here is a peek at the modules
of a loaded mt5-small
:
QUESTION
I have a model that needs to implement self-attention and this is how I wrote my code:
...ANSWER
Answered 2021-Aug-11 at 11:48In other words, is this necessary?
In short, No.
The SelfAttention
class will be automatically loaded if it has been registered as a nn.module, nn.Parameters, or manually registered buffers.
A quick example:
QUESTION
I have n
-vectors which need to be influenced by each other and output n
vectors with same dimensionality d
. I believe this is what torch.nn.MultiheadAttention
does. But the forward function expects query, key and value as inputs. According to this blog, I need to initialize a random weight matrix of shape (d x d)
for each of q
, k
and v
and multiply each of my vectors with these weight matrices and get 3 (n x d)
matrices. Now are the q
, k
and v
expected by torch.nn.MultiheadAttention
just these three matrices or do I have it mistaken?
ANSWER
Answered 2021-Jan-09 at 16:34When you want to use self attention, just pass your input vector into torch.nn.MultiheadAttention
for the query, key and value.
QUESTION
I'm implementing self-attention part in transformer encoder using pytorch nn.MultiheadAttention
and confusing in the padding masking of transformer.
The following picture shows the self-attention weight of the query (row) and key (column).
As you can see, there are some tokens "" and I have already mask it in key. Therefore the tokens will not calculate the attention weight.
There are still two questions:
In query part, can I also mask them("") except for the red square part? Is this reasonable?
How can I mask "" in the query?
The attention weights also use the softmax
function along the row by giving mask in src_mask
or src_key_padding_mask
argument. If I set all the "" row into -inf
, the softmax
will return nan
and the loss with be nan
ANSWER
Answered 2020-Dec-15 at 10:41There is no need to mask the queries during self-attention, it should be enough if do not use the states corresponding to the tokens later in the network (either as hidden states or keys/values), they will not influence the loss function nor anything else in the network.
If you want to make sure that you did not make a bug causing the gradient flowing through the tokens you can explicitly zero-out the self-attention using
torch.where
after it is computed.
QUESTION
I am implementing the Multi-Head Self-Attention in Pytorch now. I looked at a couple of implementations and they seem a bit wrong, or at least I am not sure why it is done the way it is. They would often apply the linear projection just once:
...ANSWER
Answered 2020-Dec-17 at 13:03They are equivalent.
Theoretically (and in paper writing), it is easier to consider them as separate linear projections. Say if you have 8 heads, and each head has a M->N
projection, then one would have 8
N by M
matrix.
In implementation though, it is faster to have a M->8N
transformation by having a 8N by M
matrix.
One can concatenate the matrices in the first formulation to obtain the matrix in the second formulation.
QUESTION
At training time, as far as I understand from the "Attention is all you need" paper, the way that masked-self-attention is used in the decoder is by feeding the output sequence multiple times, each time removing the mask from the next token.
Q1. At inference time, the expected output sequence length is not known. How do you decide on how many masked tokens to add? Do you always fill the max-length of your input with masked tokens and stop when an end of sequence symbol is predicted?
Q2. The GPT inference objective task is a little different. A "query" vector is injected to the model (for example [text1;text2] and [text2;text1] in the similarity task). How is the masking used in this scenario? I would expect that the whole sequence will be injected in only one step with no masking, however this contradicts the masked-self-attention methodology.
...ANSWER
Answered 2020-Nov-12 at 09:05In the standard Transformer, the target sentence is provided to the decoder only once (you might confuse that with the masked language-model objective for BERT).
The purpose of the masking is to make sure that the states do not attend to tokens that are "in the future" but only to those "in the past". The mask looks like this (queries are on the vertical axis; keys and values on the horizontal axis):
QUESTION
I am confused with these two structures. In theory, the output of them are all connected to their input. what magic make 'self-attention mechanism' is more powerful than the full-connection layer?
...ANSWER
Answered 2020-Oct-06 at 03:33Ignoring details like normalization, biases, and such, fully connected networks are fixed-weights:
QUESTION
Following an amazing blog, I implemented my own self-attention module. However, I found PyTorch has already implemented a multi-head attention module. The input to the forward pass of the MultiheadAttention
module includes Q
(which is query vector) , K
(key vector), and V
(value vector). It is strange that PyTorch wouldn't just take the input embedding and compute the Q
, K
, V
vectors on the inside. In the self-attention module that I implemented, I compute this Q
, K
, V
vectors from the input embeddings multiplied by the Q
, K
, V
weights. At this point, I am not sure what the Q
, K
, and V
vector inputs that MultiheadAttention
module requires. Should they be Q
, K
, and V
weights or vectors and should these be normal vectors, or should these be Parameters?
ANSWER
Answered 2020-Aug-04 at 16:13If you look at the implementation of Multihead attention in pytorch. Q,K and V are learned during the training process. In most cases should be smaller then the embedding vectors. So you just need to define their dimension, everything else is taken by the module. You have two choices :
QUESTION
Im trying to understand the newly implemented keras
transformer class: https://keras.io/examples/nlp/text_classification_with_transformer/
I see text is first embedded and then self-attention is used. But what if I want to use another embedding than the TokenAndPositionEmbedding
- e.g. in my case I have pre-embedded sentences and like to use self-attention on them.
What I dont understand is what the self.pos_emb
does. The class TokenAndPositionEmbedding
is returning x
and positions
, with x
being the token_embedding
and positions
being the number of words to consider? So its basically returning two things? I dont understant that..
ANSWER
Answered 2020-May-27 at 16:53As you know the transformer is the structure based on nothing but just lots of Dense
layers with concepts of residual; however, this make the time series data losing its time dependence. So for transformer, you need to locate the position, which you can consider as the additional information for this structure so that it won't miss the time dependence. If you would like to understand it better by using keras, I will suggest the official tutorial written by Tensorflow: https://www.tensorflow.org/tutorials/text/transformer
which details the things you would like to know.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install self-attention
You can use self-attention like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page