image_captioning | generate captions for images | Machine Learning library
kandi X-RAY | image_captioning Summary
kandi X-RAY | image_captioning Summary
Build a model to generate captions from images. When given an image, the model is able to describe in English what is in the image. In order to achieve this, our model is comprised of an encoder which is a CNN and a decoder which is an RNN. The CNN encoder is given images for a classification task and its output is fed into the RNN decoder which outputs English sentences. The model and the tuning of its hyperparamaters are based on ideas presented in the paper Show and Tell: A Neural Image Caption Generator and Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Get a loader for a given transformation
- Return a list of random indices
- Validate the validation mode
- Return a list of words from the given indices
- Save the loss checkpoint
- Train model
- Save checkpoint
- Sample prediction
- Sample a beam search
- Generate a list of tokens corresponding to the input tensor
- Clean a sentence
- Loads the vocab
- Add captions to corpus
- Build the vocabulary
- Adds a word to the corpus
image_captioning Key Features
image_captioning Examples and Code Snippets
Community Discussions
Trending Discussions on image_captioning
QUESTION
Let us suppose I have a model like:
...ANSWER
Answered 2021-Jan-14 at 23:28Absolutely. One way to demonstrate which words have the greatest impact is through integrated gradients methods. For PyTorch, one package you can use is Captum. I would check out this page for a good example: https://captum.ai/tutorials/IMDB_TorchText_Interpret
For Tensorflow, one package that you can use is Seldon. I would check out this page for a good example: https://docs.seldon.io/projects/alibi/en/stable/examples/integrated_gradients_imdb.html
QUESTION
I am trying to understand the TensorFlow implementation of Image captioning with visual attention. I understand what SparseCategoricalCrossentropy is but what is loss_function
doing? Can someone explain? Tensorflow Implementation
ANSWER
Answered 2021-Mar-04 at 13:14We need to go back to what is in real
. In real
we have words encoded as number
with tf.keras.preprocessing.text.Tokenizer
. In the tutorial, the value 0 is for the token.
QUESTION
def camera(transform):
capture = cv2.VideoCapture(0)
while True:
ret, frame = capture.read()
cv2.imshow('video', frame)
# esc
if cv2.waitKey(1) == 27:
photo = frame
break
capture.release()
cv2.destroyAllWindows()
img = Image.fromarray(cv2.cvtColor(photo, cv2.COLOR_BGR2RGB))
img = img.resize([224, 224], Image.LANCZOS)
if transform is not None:
img = transform(img).unsqueeze(0)
return img
...ANSWER
Answered 2021-Feb-17 at 07:50You could convert your PIL.Image
to torch.Tensor
with torchvision.transforms.ToTensor
:
QUESTION
I have been checking out models with attention in those tutorials below.
https://www.tensorflow.org/tutorials/text/nmt_with_attention
and
https://www.tensorflow.org/tutorials/text/image_captioning
In both tutorials, I do not understand the defining decoder part.
in NMT with attention decoder part as below,
...ANSWER
Answered 2020-Apr-17 at 07:19The reason for the reshaping is calling the fully-connected layer that in TensorFlow (unlike Pytorch) accepts only two-dimensional inputs.
In the first example, the call
method of the decoder is supposed to be executed within a for loop for each time step (both at training and inference time). But, GRU needs input in shape batch × length × dim, and if you call it step-by-step, the length is 1.
In the second example, you can call the decoder on the entire ground-truth sequence at the training time, but it still will work with length 1, so you can use it in a for loop at inference time.
QUESTION
I am looking here at the Bahdanau attention class. I noticed that the final shape of the context vector is (batch_size, hidden_size)
. I am wondering how they got that shape given that attention_weights has shape (batch_size, 64, 1)
and features has shape (batch_size, 64, embedding_dim)
. They multiplied the two (I believe it is a matrix product) and then summed up over the first axis. Where is the hidden size coming from in the context vector?
ANSWER
Answered 2020-Feb-03 at 00:31The context vector resulting from Bahdanau
attention is a weighted average of all the hidden states of the encoder. The following image from Ref shows how this is calculated. Essentially we do the following.
- Compute attention weights, which is a
(batch size, encoder time steps, 1)
sized tensor - Multiply each hidden state
(batch size, hidden size)
element-wise withe
values. Resulting in(batch_size, encoder timesteps, hidden size)
- Average over the time dimension, resulting in
(batch size, hidden size)
QUESTION
Now, I want feature of image to compute their similarity. We can get feature using pre-trained VGG19 model in tensorflow easily. But VGG19 model has many layers, and I don't know which layer should I use to get feature. Which layer's output is appropriate for this problem?
...ANSWER
Answered 2019-Jul-06 at 06:56The include_top=False
may be used because the last 3 layers (for that specific model) are fully connected layers which are not typically good feature vectors. If the model directly outputs a feature vector, then you don't need it.
Most people use the last layer for transfer learning, but it may depend on your application. For example, Gatys et. al. show that the first few layers of VGG are sensitive to the style of the image and later layers are sensitive to the content.
I would probably try all of them in a hyperparameter search and see which gives the best performance. If by image similarity you mean the similarity of objects contained inside, I would probably start with the last layer.
QUESTION
I'm new to Pytorch, there is a doubt that am having in the Image Captioning example code . In DcoderRNN class the lstm is defined as ,
...ANSWER
Answered 2018-Mar-05 at 06:00You can analyze the shape of all input and output tensors and then it will become easier for you to understand what changes you need to make.
Let's say: captions = B x S
where S
= sentence (caption) length.
QUESTION
I'm trying convert a working image captioning CNN-LSTM network from TensorFlow to CNTK, and have what I think is a correctly trained model, but am having trouble figuring out how to extract predictions from the final trained CNTK model.
This is the general architecture I'm working with: This is my CNTK model:
...ANSWER
Answered 2017-Dec-04 at 18:28I think the function you are looking is RecurrenceFrom()
. Its documentation contains the following example:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install image_captioning
Clone the COCO API repo into this project's directory:
Setup COCO API (also described in the readme here):
Install PyTorch (4.0 recommended) and torchvision. Linux or Mac: conda install pytorch torchvision -c pytorch Windows: conda install -c peterjc123 pytorch-cpu pip install torchvision
Others:
Python 3
pycocotools
nltk
numpy
scikit-image
matplotlib
tqdm
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page