Question Answering

Bringing It All Together

You've successfully loaded, split, and stored your documents in a searchable vectorstore. Now it's time for the final, most exciting step in the Retrieval Augmented Generation (RAG) workflow: Question Answering.

This is where we combine the power of our retriever with a Language Model (LLM) to generate answers based on the content of our documents.

The RAG Workflow: A Quick Recap

  1. Document Loading: Ingesting data from sources.

  2. Splitting: Breaking documents into smaller, manageable chunks.

  3. Storage & Retrieval: Embedding chunks and storing them in a vectorstore for efficient searching.

  4. Question Answering (Generation): Using a retriever and an LLM to generate an answer to a user's query.

Let's begin by loading our vector database and initializing our LLM.

from getpass import getpass
import os
from langchain.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI

# Set up the API Key and load the database
if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass("Enter your Google API key: ")

persist_directory = 'learn/chroma/'
embedding = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

# Initialize the LLM
llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro-latest")

print(f"Vectors in DB: {vectordb._collection.count()}")
# Expected Output: a number like 660

The RetrievalQA Chain

The simplest way to perform question answering over your documents is with the RetrievalQA chain. This chain performs the core RAG logic for you:

  1. It takes your question.

  2. It uses the provided retriever to fetch the most relevant documents.

  3. It stuffs those documents and your question into a prompt.

  4. It sends that prompt to the LLM to generate the final answer.

Here's how to set it up:

Result:

The LLM synthesizes the retrieved documents to provide a comprehensive answer, drawing directly from the knowledge stored in our vector database.

Customizing with Prompts

What if you want the LLM to answer in a specific style, persona, or format? You can gain fine-grained control by providing a custom PromptTemplate. The template instructs the model on how to behave and formats the input, using the retrieved documents ({context}) and the user's query ({question}).

Result with Custom Prompt:

Now the answer follows our instructions, providing a structured response and suggesting further reading, just as we asked.

Verifying the Source

A key advantage of RAG is transparency. Since we set return_source_documents=True, we can inspect exactly which chunks of text the LLM used to generate its answer. This is invaluable for fact-checking and debugging.

Output:

A Key Limitation: No Memory

The standard RetrievalQA chain is stateless. This means it treats every query as a brand-new question and has no memory of your previous interactions.

Notice how each question is independent:

The chain answers the question about the sun without any awareness that you just asked about the moon. This is fine for simple Q&A, but it falls short for building a conversational chatbot.

Keep Exploring 🚀

You've now built a complete Retrieval Augmented Generation (RAG) pipeline! You've gone from raw documents to an intelligent Q&A system. But this is just the beginning. The world of conversational AI is vast and constantly evolving.

Last updated