Document Splitting

LangChain document splitting

Alright everyone, now that we've successfully learned how to gather our knowledge – loading data from all sorts of places like PDFs, websites, and even YouTube videos – it's time to talk about how we prepare that knowledge for our intelligent language models. Think of it like this: we've collected all the ingredients for our AI chef, but some of them are way too big to work with directly. That's where Document Splitting comes in!

Document Splitting

Imagine trying to stuff an entire book into a chatbot and expecting it to understand everything. It's like trying to swallow a watermelon whole – not going to work! Language models have a limit to how much text they can process at once. So, our job is to take these large documents and carefully break them down into smaller, more digestible chunks.

This process is absolutely crucial for building effective Retrieval Augmented Generation (RAG) systems. It ensures that our language model can actually process and understand the relevant pieces of information when you ask it a question.

Now, LangChain gives us some really smart tools for this chopping process. Let's meet a couple of the key players: RecursiveCharacterTextSplitter and CharacterTextSplitter.

The RecursiveCharacterTextSplitter is like our intelligent sous chef. It tries to break down the text in a way that makes sense, looking for natural breaks like paragraphs and sentences first. If a chunk is still too big, it will recursively try splitting it down further using smaller delimiters. It's all about trying to keep the meaning intact as much as possible.

On the other hand, the CharacterTextSplitter is a bit more straightforward. It simply splits the text based on specific characters you tell it, like newlines or spaces. It's a more direct approach, but sometimes you need that level of control.

We also touched on TokenTextSplitter. Now, this is interesting because language models don't always see text the way we do with words. They often break it down into smaller units called "tokens." Splitting by tokens can be more efficient because it aligns better with how the language model actually processes the text.

You will see in the code how easy it is to use these tools. You just tell LangChain how big you want your chunks to be (chunk_size) and how much overlap you want between them (chunk_overlap). That overlap is super important because it helps the language model maintain context from one chunk to the next, so it doesn't miss important connections in the information.

And just like we applied our loading techniques to different data sources, we can easily apply these splitting techniques to the documents we've loaded. LangChain makes it seamless with the split_documents() method.

LangChain provides several tools to split text effectively. Here are a few key ones:

RecursiveCharacterTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Maximum characters per chunk
    chunk_overlap=50, # Overlapping not work in this splitter 
)

long_text = """
Artificial intelligence (AI) is the imitation of human intelligence in machines that are programmed to think and act like humans.
The term can apply to any machine that exhibits characteristics associated with the human mind, such as learning and problem-solving.
The typical characteristics of artificial intelligence are its ability to reason and act in a way that best achieves specific goals.
A subset of artificial intelligence is machine learning (ML), which refers to the ability of computer programs to automatically learn and adapt from new data.
Deep learning techniques enable this automatic learning by exploiting vast amounts of unstructured data, such as text, images, or video.
"""

chunks = text_splitter.split_text(long_text)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

CharacterTextSplitter

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\\n\\n",  # Split by double newlines (paragraphs)
    chunk_size=200,
    chunk_overlap=50,
)

long_text = """
Artificial intelligence (AI) is the imitation of human intelligence in machines that are programmed to think and act like humans.
The term can apply to any machine that exhibits characteristics associated with the human mind, such as learning and problem-solving.
The typical characteristics of artificial intelligence are its ability to reason and act in a way that best achieves specific goals.
A subset of artificial intelligence is machine learning (ML), which refers to the ability of computer programs to automatically learn and adapt from new data.
Deep learning techniques enable this automatic learning by exploiting vast amounts of unstructured data, such as text, images, or video.
"""

chunks = text_splitter.split_text(long_text)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

TokenTextSplitter

from langchain.text_splitter import TokenTextSplitter

text1 = "Artificial intelligence (AI) refers to the simulation of human intelligence in machines"
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

chunks = text_splitter.split_text(text1)
print(chunks)

Applying Splitting to Documents

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("English_THE_CREATION_OF_THE_UNIVERSE.pdf")
docs = loader.load()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
)
splits = text_splitter.split_documents(docs)

print(f"Total chunks: {len(splits)}")

# Example of accessing a split
# for i, split in enumerate(splits):
#     print(f"Chunk {i+1}:\n{split.page_content[:200]}...\n")

Document splitting is a critical step in preparing data for LLMs. By breaking down large documents into smaller, more manageable chunks, we enable LLMs to process information effectively and accurately. LangChain provides a suite of powerful text splitters to handle various document types and splitting strategies.

In the next section, we'll explore how to make these text chunks searchable using embeddings and vector stores.

Last updated