In this video, Adam Lucek explores the concept of multimodal retrieval augmented generation (RAG) using images instead of just text. The video walks through setting up a multimodal RAG application that allows querying and context over photos using CLIP models and ChromaDB integrations.
First, Adam explains the concept of CLIP (Contrastive Language-Image Pre-training) models, which connect images and text by learning from a variety of internet data. These models create a shared embedding space where similar images and texts are close together, enabling robust zero-shot learning. Although OpenAI’s CLIP model is not publicly available, open-source versions like OpenCLIP are used in this tutorial.
The video then moves on to setting up the environment. Adam chooses the Fashionpedia dataset from Hugging Face, which contains thousands of fashion images. He demonstrates how to load and prepare this dataset, saving a subset of images to a local directory.
Next, Adam sets up the vector database using ChromaDB, which now supports multimodal embeddings. He shows how to instantiate the ChromaDB client, load images into the database, and query the database using text inputs to retrieve relevant images.
With the retrieval part set up, Adam explains the augmented generation process using GPT-4, a vision-capable language model. He uses LangChain to create a prompt that includes both text and image data. The images are encoded in Base64 format to be processed by the language model.
Finally, Adam puts everything together, demonstrating the complete flow from user query to image retrieval to generating a response using the vision model. He tests the setup with various queries, showing how the system retrieves relevant images and generates detailed responses based on the images and text input.
The video concludes with a fully functioning multimodal RAG setup, showcasing the potential of combining text and image data for more robust AI applications.