pytorch-seq2seq | An open source framework for seq2seq models in PyTorch | Machine Learning library
kandi X-RAY | pytorch-seq2seq Summary
kandi X-RAY | pytorch-seq2seq Summary
An open source framework for seq2seq models in PyTorch.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Forward computation
- Inflate tensor
- Validate inputs
- Compute predicted softmax
- Update the cuda
- Backtrack decoding
- Predict n features from src_seq
- Get features from src_seq
- Load model
- Flattens the parameters
- Evaluate the model
- Reset acc_loss
- Generate a dataset
- Run the decoder
- Predict sequence from src_seq
- Builds the vocab
- Draw the cuda
pytorch-seq2seq Key Features
pytorch-seq2seq Examples and Code Snippets
Seq2seq(
(encoder): EncoderRNN(
(input_dropout): Dropout(p=0.5, inplace=False)
(conv): Sequential(
(0): Conv2d(1, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(16, eps=1e-05, momentum=0.1, a
make download-datasets
make normalize-datasets
make apply-transforms-sri-py150
make apply-transforms-c2s-java-small
make extract-transformed-tokens
./experiments/normal_seq2seq_train.sh
./experiments/run_attack_0.sh
./experiments/run_attack_1.sh
Community Discussions
Trending Discussions on pytorch-seq2seq
QUESTION
I am doing the following operation,
...ANSWER
Answered 2020-Aug-27 at 06:07I took a look at your code (which by the way, didnt run with seq_len = 10
) and the problem is that you hard coded the batch_size
to be equal 1 (line 143
) in your code.
It looks like the example you are trying to run the model on has batch_size = 2
.
Just uncomment the previous line where you wrote batch_size = query.shape[0]
and everything runs fine.
QUESTION
I'm implementing the Attention in PyTorch. I got questions during implementing the attention mechanism.
What is the initial state of the decoder $s_0$? Some post represents it as zero vector and some implements it as the final hidden state of the encoder. So what is real $s_0$? The original paper doesn't mention it.
Do I alternate the maxout layer to dropout layer? The original paper uses maxout layer of Goodfellow.
Is there any differences between encoder's dropout probability and decoder's? Some implementation sets different probabilities of dropouts for encoder and decoder.
When calculating $a_{ij}$ in the alignment model (concat), there are two trainable weights $W$ and $U$ . I think the better way to implement it is using two linear layers. If I use a linear layer, should I remove bias term in the linear layers?
The dimension of the output of the encoder(=$H$) doesn't fit the decoder's hidden state. $H$ is concatenated, so it has to be 2000 (for the original paper). However, the decoder's hidden dimension is also 1000. Do I need to add a linear layer after the encoder to fit the encoder's dimension and the decoder's dimension?
ANSWER
Answered 2020-Jun-18 at 07:53In general, many answers are: it is different in different implementations. The original implementation from the paper is at https://github.com/lisa-groundhog/GroundHog/tree/master/experiments/nmt. For later implementations that reached better translation quality, you can check:
Neural Monkey or Nematus in TensorFlow
OpenNMT in PyTorch
Marian in C++
Now to your points:
In the original paper, it was a zero vector. Later implementations use a projection of either of the encoder final state or the average of the encoder states. The argument for using average is that it propagates the gradients more directly into the encoder states. However, this decision does not seem to influence the translation quality much.
Maxout layer is a variant of a non-linear layer. It is sort of two ReLU layers in one: you do two independent linear projections and take the maximum of them. You can happily replace Maxout with ReLU (modern implementations do so), but you still should use dropout.
I don't know about any meaningful use case in MT when I would set the dropout rates differently. Note, however, that seq2seq models are used in many wild scenarios when it might make sense.
Most implementations do use bias when computing attention energies. If you use two linear layers, you will have the bias split into two variables. Biases are usually zero-initialized, they will thus get the same gradients and the same updates. However, you can always disable the bias in a linear layer.
Yes, if you want to initialize s0 with the decoder states. In the attention mechanism, matrix U takes care of it.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install pytorch-seq2seq
Currently we only support installation from source code using setuptools. Checkout the source code and run the following commands:. If you already had a version of PyTorch installed on your system, please verify that the active torch package is at least version 0.1.11.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page