Build a Generative AI based Automated Image Captioning and Visual QnA Engine
by kandikits Updated: Jul 25, 2023
1-Click Kit
AI has been to build intelligent agents that can understand the vision and language inputs and communicate with humans through natural language.
Vision and language, two of the most fundamental methods for humans to perceive the world, are also two key cornerstones of AI. A longstanding goal of AI has been to build intelligent agents that can understand the world through vision and language inputs, and communicate with humans through natural language.
In order to achieve this goal, vision-language pre-training has emerged as an effective approach, where deep neural network models are pre-trained on large scale image-text datasets to improve performance on downstream vision-language tasks, such as image-text retrieval, image captioning, and visual question answering.
Image Captioning and Visual Question and Answering involves the usage of Large Multimodal Models (LMMs). Multimodal Learning seeks to allow computers to represent real-world objects and concepts using multiple data streams. We make use of one such model - Saleforce's BLIP (Bootstrapping Language-Image Pre-training)
Deployment Information
For Windows OS,
- Download, extract the zip file and run. Do ensure to extract the zip file before running it.
- After successful installation of the kit, press 'Y' to run the kit and execute cells in the notebook.
- To run the kit manually, press 'N' and locate the zip file 'image_captioning'.zip'.
- Extract the zip file and navigate to the directory 'image_captioning'.
- Open command prompt in the extracted directory 'image_captioning' and run the command 'jupyter notebook'
- Locate and open the 'Captioning_visualQnA.ipynb' notebook from the Jupyter Notebook browser window.
- Execute cells in the notebook.
For other Operating Systems,
- Click here to download the repository.
- Extract the zip file and navigate to the directory image_captioning.zip
- Extract the zip file and navigate to the directory 'image_captioning'.
- Open command prompt in the extracted directory 'image_captioning' and run the command 'jupyter notebook'
- Locate and open the 'Captioning_visualQnA.ipynb' notebook from the Jupyter Notebook browser window.
- Execute cells in the notebook.
Click the button below to download the solution and follow the deployment information to begin set-up. This 1-click kit has all the required dependencies and resources to build your Image Captioning and Visual QnA Engine.
Libraries used in this solution
Development Environment
VSCode and Jupyter Notebook are used for development and debugging. Jupyter Notebook is a web based interactive environment often used for experiments, whereas VSCode is used to get a typical experience of IDE for developers.
Jupyter Notebook is used for our development.
jupyterby jupyter
Jupyter metapackage for installation, docs and chat
jupyterby jupyter
Python 14404 Version:Current License: Permissive (BSD-3-Clause)
Machine Learning
Machine learning libraries and frameworks here are helpful in providing state-of-the-art solutions using Machine learning
transformersby huggingface
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
transformersby huggingface
Python 104111 Version:v4.30.2 License: Permissive (Apache-2.0)
pytorchby pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
pytorchby pytorch
Python 67874 Version:v2.0.1 License: Others (Non-SPDX)
Kit Solution Source
App User Interface
gradioby gradio-app
Create UIs for your machine learning model in Python in 3 minutes
gradioby gradio-app
Python 18771 Version:v3.35.2 License: Permissive (Apache-2.0)