self-attention | Transformer | Artificial Intelligence library

by alphadl Python Version: Current License: No License

X-Ray Key Features Code Snippets(2)Community Discussions(9)Vulnerabilities Install Support

kandi X-RAY | self-attention Summary

self-attention is a Python library typically used in Artificial Intelligence, Neural Network, Transformer applications. self-attention has no bugs, it has no vulnerabilities and it has low support. However self-attention build file is not available. You can download it from GitHub.

Transformer

Support

Quality

Security

License

Reuse

Support

self-attention has a low active ecosystem.

It has 7 star(s) with 4 fork(s). There are 1 watchers for this library.

It had no major release in the last 6 months.

self-attention has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of self-attention is current.

Quality

self-attention has 0 bugs and 16 code smells.

Security

self-attention has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

self-attention code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

self-attention does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

self-attention releases are not available. You will need to build from source code and install.

self-attention has no build file. You will be need to create the build yourself to build the component from source.

It has 164 lines of code, 12 functions and 2 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed self-attention and discovered the below as its top functions. This is intended to give you an instant insight into self-attention implemented functionality, and help decide if they suit your requirements.

Construct attention layer
Dense layer
Mask inputs
Compute the Q sequence
Mask input tensors

Get all kandi verified functions for this library.

self-attention Key Features

No Key Features are available at this moment for self-attention.

self-attention Examples and Code Snippets

ScalableViT

pypi

Lines of Code : 17

License : No License

Copy

import torch
from vit_pytorch.scalable_vit import ScalableViT

model = ScalableViT(
    num_classes = 1000,
    dim = 64,                               # starting model dimension. at every stage, dimension is doubled
    heads = (2, 4, 8, 16),

SepViT

pypi

Lines of Code : 16

License : No License

Copy

import torch
from vit_pytorch.sep_vit import SepViT

v = SepViT(
    num_classes = 1000,
    dim = 32,               # dimensions of first stage, which doubles every stage (32, 64, 128, 256) for SepViT-Lite
    dim_head = 32,          # attention hea

Community Discussions

Trending Discussions on self-attention

Google mT5-small configuration error because number attention heads is not divider of model dimension

Do I need to load the weights of another class I use in my NN class?

Inputs to the nn.MultiheadAttention?

Query padding mask and key padding mask in Transformer encoder

Multi Head Attention: Correct implementation of Linear Transformations of Q, K, V

How is the GPT's masked-self-attention is utilized on fine-tuning/inference

what's the difference between "self-attention mechanism" and "full-connection" layer?

What should be the Query Q, Key K and Value V vectors/matrics in torch.nn.MultiheadAttention?

Self-Attention using transformer block keras

QUESTION

Google mT5-small configuration error because number attention heads is not divider of model dimension

Asked 2022-Jan-20 at 09:48

The configuration file for the HuggingFace google/mt5-small Model (https://huggingface.co/google/mt5-small)

defines

...

ANSWER

Answered 2022-Jan-20 at 09:48

This is a very good question, and shows a common misconception about Transformers, stemming from an (unfortunate) formulation in the original Transformers paper. In particular, the authors write the following in Section 3.2.2:

In this work, we employ h = 8 parallel attention layers, or heads. For each of these we use d_k = d_v = d_(model) / h = 64. [...]

Note that the equality of d_k/d_v = d_(model) is not strictly necessary; it is only important that you do match the final hidden representation (d_(model)) after the Feed-Forward portion of each layer. Specifically for mt5-small, the authors actually use an internal dimension of 384 which is simply the product of parameters d_kv * num_heads = 64 * 6.

Now, the problem is that many libraries make a similar assumption of the enforced relation between d_kv and d_(model), because it saves some implementation effort that most people won't use anyways. I suspect (not super familiar with AllenNLP) that they have made similar assumptions here, which is why you cannot load the model.

Also, to clarify this, here is a peek at the modules of a loaded mt5-small:

Source https://stackoverflow.com/questions/70769151

QUESTION

Do I need to load the weights of another class I use in my NN class?

Asked 2021-Aug-11 at 11:48

I have a model that needs to implement self-attention and this is how I wrote my code:

...

ANSWER

Answered 2021-Aug-11 at 11:48

In other words, is this necessary?

In short, No.

The SelfAttention class will be automatically loaded if it has been registered as a nn.module, nn.Parameters, or manually registered buffers.

A quick example:

Source https://stackoverflow.com/questions/68740357

QUESTION

Inputs to the nn.MultiheadAttention?

Asked 2021-Jan-09 at 16:34

I have n-vectors which need to be influenced by each other and output n vectors with same dimensionality d. I believe this is what torch.nn.MultiheadAttention does. But the forward function expects query, key and value as inputs. According to this blog, I need to initialize a random weight matrix of shape (d x d) for each of q, k and v and multiply each of my vectors with these weight matrices and get 3 (n x d) matrices. Now are the q, k and v expected by torch.nn.MultiheadAttention just these three matrices or do I have it mistaken?

...

ANSWER

Answered 2021-Jan-09 at 16:34

When you want to use self attention, just pass your input vector into torch.nn.MultiheadAttention for the query, key and value.

Source https://stackoverflow.com/questions/65642832

QUESTION

Query padding mask and key padding mask in Transformer encoder

Asked 2020-Dec-26 at 18:02

I'm implementing self-attention part in transformer encoder using pytorch nn.MultiheadAttention and confusing in the padding masking of transformer.

The following picture shows the self-attention weight of the query (row) and key (column).

As you can see, there are some tokens "" and I have already mask it in key. Therefore the tokens will not calculate the attention weight.

There are still two questions:

In query part, can I also mask them("") except for the red square part? Is this reasonable?
How can I mask "" in the query?

The attention weights also use the softmax function along the row by giving mask in src_mask or src_key_padding_mask argument. If I set all the "" row into -inf, the softmax will return nan and the loss with be nan

...

ANSWER

Answered 2020-Dec-15 at 10:41

There is no need to mask the queries during self-attention, it should be enough if do not use the states corresponding to the tokens later in the network (either as hidden states or keys/values), they will not influence the loss function nor anything else in the network.

If you want to make sure that you did not make a bug causing the gradient flowing through the tokens you can explicitly zero-out the self-attention using torch.where after it is computed.

Source https://stackoverflow.com/questions/65262928

QUESTION

Multi Head Attention: Correct implementation of Linear Transformations of Q, K, V

Asked 2020-Dec-17 at 13:03

I am implementing the Multi-Head Self-Attention in Pytorch now. I looked at a couple of implementations and they seem a bit wrong, or at least I am not sure why it is done the way it is. They would often apply the linear projection just once:

...

ANSWER

Answered 2020-Dec-17 at 13:03

They are equivalent.

Theoretically (and in paper writing), it is easier to consider them as separate linear projections. Say if you have 8 heads, and each head has a M->N projection, then one would have 8 N by M matrix.

In implementation though, it is faster to have a M->8N transformation by having a 8N by M matrix.

One can concatenate the matrices in the first formulation to obtain the matrix in the second formulation.

Source https://stackoverflow.com/questions/65340088

QUESTION

How is the GPT's masked-self-attention is utilized on fine-tuning/inference

Asked 2020-Nov-12 at 09:05

At training time, as far as I understand from the "Attention is all you need" paper, the way that masked-self-attention is used in the decoder is by feeding the output sequence multiple times, each time removing the mask from the next token.

Q1. At inference time, the expected output sequence length is not known. How do you decide on how many masked tokens to add? Do you always fill the max-length of your input with masked tokens and stop when an end of sequence symbol is predicted?

Q2. The GPT inference objective task is a little different. A "query" vector is injected to the model (for example [text1;text2] and [text2;text1] in the similarity task). How is the masking used in this scenario? I would expect that the whole sequence will be injected in only one step with no masking, however this contradicts the masked-self-attention methodology.

...

ANSWER

Answered 2020-Nov-12 at 09:05

In the standard Transformer, the target sentence is provided to the decoder only once (you might confuse that with the masked language-model objective for BERT).

The purpose of the masking is to make sure that the states do not attend to tokens that are "in the future" but only to those "in the past". The mask looks like this (queries are on the vertical axis; keys and values on the horizontal axis):

Source https://stackoverflow.com/questions/64799622

QUESTION

what's the difference between "self-attention mechanism" and "full-connection" layer?

Asked 2020-Oct-06 at 03:33

I am confused with these two structures. In theory, the output of them are all connected to their input. what magic make 'self-attention mechanism' is more powerful than the full-connection layer?

...

ANSWER

Answered 2020-Oct-06 at 03:33

Ignoring details like normalization, biases, and such, fully connected networks are fixed-weights:

Source https://stackoverflow.com/questions/64218678

QUESTION

What should be the Query Q, Key K and Value V vectors/matrics in torch.nn.MultiheadAttention?

Asked 2020-Aug-04 at 16:13

Following an amazing blog, I implemented my own self-attention module. However, I found PyTorch has already implemented a multi-head attention module. The input to the forward pass of the MultiheadAttention module includes Q (which is query vector) , K (key vector), and V (value vector). It is strange that PyTorch wouldn't just take the input embedding and compute the Q, K, V vectors on the inside. In the self-attention module that I implemented, I compute this Q, K, V vectors from the input embeddings multiplied by the Q, K, V weights. At this point, I am not sure what the Q, K, and V vector inputs that MultiheadAttention module requires. Should they be Q, K, and V weights or vectors and should these be normal vectors, or should these be Parameters?

...

ANSWER

Answered 2020-Aug-04 at 16:13

If you look at the implementation of Multihead attention in pytorch. Q,K and V are learned during the training process. In most cases should be smaller then the embedding vectors. So you just need to define their dimension, everything else is taken by the module. You have two choices :

Source https://stackoverflow.com/questions/63248948

QUESTION

Self-Attention using transformer block keras

Asked 2020-May-27 at 16:53

Im trying to understand the newly implemented keras transformer class: https://keras.io/examples/nlp/text_classification_with_transformer/

I see text is first embedded and then self-attention is used. But what if I want to use another embedding than the TokenAndPositionEmbedding - e.g. in my case I have pre-embedded sentences and like to use self-attention on them.

What I dont understand is what the self.pos_emb does. The class TokenAndPositionEmbedding is returning x and positions, with x being the token_embedding and positions being the number of words to consider? So its basically returning two things? I dont understant that..

...

ANSWER

Answered 2020-May-27 at 16:53

As you know the transformer is the structure based on nothing but just lots of Dense layers with concepts of residual; however, this make the time series data losing its time dependence. So for transformer, you need to locate the position, which you can consider as the additional information for this structure so that it won't miss the time dependence. If you would like to understand it better by using keras, I will suggest the official tutorial written by Tensorflow: https://www.tensorflow.org/tutorials/text/transformer which details the things you would like to know.

Source https://stackoverflow.com/questions/62039928

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install self-attention

You can download it from GitHub.
You can use self-attention like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: