Vectorstores and Embeddings

Making Your Data Searchable

We're on a roll! We've successfully loaded our data and split it into manageable chunks. Now, we're about to transform that data into a format that our LLM can truly understand and use for intelligent retrieval. This is where embeddings and vectorstores come into play.

Recall the RAG Workflow

Let's quickly recap the Retrieval Augmented Generation (RAG) process:

  1. Document Loading: We've loaded data from various sources (PDFs, etc.).

  2. Document Splitting: We've split the data into smaller chunks.

  3. Embeddings: We'll convert text chunks into numerical representations.

  4. Vectorstore: We'll store these representations for efficient retrieval.

  5. Retrieval: We'll fetch the most relevant chunks based on a user's query.

  6. Generation: The LLM will use the retrieved information to generate a response.

You've already seen the first two steps in action with this code:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("English_THE_CREATION_OF_THE_UNIVERSE.pdf")
docs = loader.load()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150)
splits = text_splitter.split_documents(docs)

print(f"Number of splits: {len(splits)}")  # Output: 330

Now, let's dive into embeddings!

What are Embeddings?

Imagine you want to teach a computer the meaning of words. You can't just feed it raw text; it needs numbers. Embeddings are a way to convert text (or any data) into a numerical representation called a vector.

  • Embedding: A data processing technique that represents a dataset in a specific dimensional vector space.

These vectors capture the semantic meaning of the text. Words or phrases with similar meanings will have similar vectors (i.e., they'll be close to each other in the vector space).

Generating Embeddings with LangChain

LangChain makes it easy to generate embeddings using various models. Here's how you can use Google's generative AI embedding model:

from getpass import getpass
import os

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass("Enter your Google API key: ")

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

sentence1 = "I like dogs"
sentence2 = "I like canines"
sentence3 = "The weather is ugly outside"

embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

import numpy as np

# Calculate the dot product (similarity) between embeddings
print(f"Similarity between '{sentence1}' and '{sentence2}': {np.dot(embedding1, embedding2):.4f}")  # Output: ~0.98
print(f"Similarity between '{sentence1}' and '{sentence3}': {np.dot(embedding1, embedding3):.4f}")  # Output: ~0.80
print(f"Similarity between '{sentence2}' and '{sentence3}': {np.dot(embedding2, embedding3):.4f}")  # Output: ~0.80

As you can see, "dogs" and "canines" have a high similarity score (close to 1), while "weather" is less similar. This is because the embeddings capture the semantic relationship between the words.

What is a Vectorstore?

Now that we can convert text into vectors, we need a place to store them for efficient retrieval. A vectorstore is a database optimized for storing and searching vectors.

  • Vector Store: A data structure that stores vectors, each representing a specific type of information or value in a dimensional space.

LangChain integrates with many vectorstores, including Chroma, Pinecone, FAISS, and more. Here's how to use Chroma:

/# !pip install chromadb  # Install Chroma (if you haven't already)

from langchain.vectorstores import Chroma
import os

persist_directory = "learn/chroma/"
os.makedirs(persist_directory, exist_ok=True)

# !del -rf ./learn/chroma  # Remove old database files (use !rm -rf for Linux/macOS) - commented out for safety

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory,
)

print(f"Number of vectors in the database: {vectordb._collection.count()}")

Similarity Search: Finding Relevant Information

With our data embedded and stored in a vectorstore, we can now perform similarity search. This allows us to find the text chunks that are most semantically similar to a given query.

question = "What is the universe?"
docs = vectordb.similarity_search(question, k=3)  # Retrieve the top 3 most similar chunks

print(f"Number of relevant documents found: {len(docs)}")
print("Retrieved document:\n")
print(docs[0].page_content)

We can also see more similarity docs:

question = "what is the blue planet?"
docs = vectordb.similarity_search(question,k=5)

for doc in docs:
    print(doc.metadata)
print(docs[4].page_content)

Persistence

Chroma automatically persists the database to disk. Here's the code.

vectordb.persist()

Why This Matters

By combining embeddings and vectorstores, we've created a system that can:

  • Understand Meaning: Capture the semantic meaning of text.

  • Search Efficiently: Quickly find relevant information in a large dataset.

  • Power RAG: Enable LLMs to access and utilize external knowledge.

This is a key step towards building truly intelligent applications that can answer questions, generate summaries, and engage in meaningful conversations based on real-world data.

Keep Exploring

LangChain offers many options for embedding models and vectorstores. Experiment with different combinations to see how they affect performance. As you continue your journey, you'll unlock even more powerful ways to work with LLMs and build amazing applications.

Last updated