Document Loading

LangChain document loading

Now, we learn about how to load data from multiple sources. In the real world, many types of data are such as PDF, Docx, Websites, etc.

See the image below to think about the type of document we can load with Langchain. My suggestion is that the best structure data format is JSON and unstructured data is PDF.

You always try JSON or PDF data because this format of data Embedding model and Vector store is understood better than other data formats.

We found that the JSON and PDF formats in the semantic search are very good compared to other formats.

So, let's do code. Now we learn about how to load a pdf.

PDF Loading

First, we install belows two library:

 pip install langchain
 pip install pypdf

Now we load a PDF with the below code:

from langchain.document_loaders import PyPDFLoader #Import the pdf loader 
loader = PyPDFLoader("English_THE_CREATION_OF_THE_UNIVERSE.pdf") 
pages = loader.load() #Now load the pdf in pages1 variable

You can find the pdf that I using at this link: Universe PDF

We can load multiple PDFs with the below code:

from langchain.document_loaders import PyPDFLoader
# Load PDF
loaders = [
     # Duplicate documents on purpose - messy data
     PyPDFLoader("English_THE_CREATION_OF_THE_UNIVERSE.pdf"),
     PyPDFLoader("universe.pdf") #add all document of your needs  
]
docs = []
for loader in loaders:
     docs.extend(loader.load()) #loading all pdf in docs variable

When you load the PDF, there are two things you need to know. One is each page, which is a document, and a document contains text (page_content) and metadata.

Now we can count how many documents we loaded from PDF with the below code:

len(pages)
# Out: 210

We can print page 100 with the code below:

page100 = pages[100]
print(page100.page_content[0:1000])

''' 
out: mosphere of Earth is specially created to support life in a number of cru-
cial ways.
The atmosphere of Earth is composed of 77% nitrogen, 21% oxygen,
and 1% carbon dioxide. Let's start with the most important gas: oxygen.Oxygen is vitally important to life because it enters into most of the chem-ical reactions that release the energy that all complex life-forms require.
Carbon compounds react with oxygen. As a result of these reactions,
water, carbon dioxide, and energy are produced. Small "bundles" of ener-gy that are called ATP (adenosine triphosphate) and are used in living cellsare generated by these reactions. This is why we constantly need oxygento live and why we breathe to satisfy that need.
The interesting aspect of this business is that the percentage of oxygen
in the air we breathe is very precisely determined. Michael Denton writeson this point:
Could your atmosphere contain more oxygen and still support life? No!Oxygen is a very reactive element. Even the current percenta
'''

URL Loading

Now, we learn how to load text from a URL. 🔗

On the webpage, we see. If you think that you copied the text from a website, then make a PDF file with your copied text. It takes Many times to do this. Langchain provides a url loader that you can use to load all text in Langchain without pdf making. So, now we explore the URL loader of the langchain.

If you want to load all text from a URL, you can use the code below:

from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://www.deeplearning.ai/the-batch/issue-244/")

If you need to load all text from multiple URLs, you can use the code below:

# multiple URLs loading
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader(["https://www.deeplearning.ai/the-batch/issue-244/","https://www.deeplearning.ai/the-batch/issue-245/"])
docs = loader.load()
print(docs[0].page_content[:1000])

We learned how to load a PDF and URL. Now we learn how to load the entire website text with the website base url.

Recursive Website Loading

RecursiveUrlLoader in LangChain automates website crawling, extracting content from multiple pages by following links up to a specified depth. This allows you to efficiently gather information from entire websites, building comprehensive knowledge bases for your LLM applications. It handles link discovery and content extraction, simplifying the process of incorporating web data into your RAG pipelines.

First, you need to install the below libary:

pip install html2text

Now you can load the entire website text with the code below:

from langchain.document_loaders import RecursiveUrlLoader
from langchain.document_transformers import Html2TextTransformer

url = "https://modelcontextprotocol.io/"
loader = RecursiveUrlLoader(
    url=url, max_depth=2, exclude_dirs=["excluded_directory"]
)
docs = loader.load()

html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)

print(f"Loaded {len(docs_transformed)} documents from the website.")

You can load PDFs, URLs, and Recursive Website Loading.

YouTube Transcript Loading

LangChain's YoutubeLoader simplifies fetching and loading the transcript (closed captions) of YouTube videos. By providing a video URL, you can easily access the spoken content as LangChain Document objects, making it readily available for analysis, semantic search, and integration into your RAG pipelines, thus unlocking the valuable information contained within video content.

First, we transcribe a video with the URL shown below:

from langchain_community.document_loaders import YoutubeLoader
loader = YoutubeLoader.from_youtube_url(
    "https://youtu.be/2SMaVPl7nV8", add_video_info=True)
data = loader.load()
print(data[:2000])

This code imports the YoutubeLoader, creates an instance to load the transcript from the specified YouTube URL, and then loads the transcript into the data variable. The print(data[:2000]) line displays the first 2000 characters of the loaded transcript

Multiple URL Loading and Saving to Files:

from langchain_community.document_loaders import YoutubeLoader
import io
urls=["https://www.youtube.com/watch?time_continue=1&v=CO6tSmZgfH4","https://www.youtube.com/watch?v=eIBPZfls2sA"]
for url in urls:
    loader=YoutubeLoader.from_youtube_url(url)
    docs=loader.load()
    name=url.split("=")[1]
    with io.open(name+".txt","w",encoding="utf-8")as f1:
        for doc in docs:
            f1.write(doc.page_content)
        f1.close()
print(doc)

This part iterates through a list of YouTube URLs. For each URL, it loads the transcript, extracts the video ID from the URL to create a filename, and then saves the entire transcript content into a separate .txt file. The final print(doc) will output the last loaded Document object.

With these loaders in your toolkit, you're well on your way to building intelligent systems that can reason over vast amounts of information. You've seen how LangChain abstracts away the complexities of handling different data formats, allowing you to focus on the core logic of your applications.

But remember, this is just the beginning! LangChain boasts an even wider array of document loaders, ready to tackle almost any data source you can imagine. Think about:

Loading from various file types: .csv, .docx, .epub, .markdown, .odt, .xlsx, and many more!
Integrating with databases: Load data directly from SQL databases, NoSQL stores, and graph databases.

The possibilities are truly limitless! Each loader is designed to seamlessly integrate with the rest of the LangChain ecosystem, making it easy to transform and process your data for your specific needs.

Keep exploring the LangChain documentation and experimenting with these different loaders. The more data sources you can effectively connect to, the more knowledgeable and capable your RAG applications will become. You're building the foundation for truly intelligent and context-aware systems. Keep up the great work, and let's continue to unlock the power of language models with the right data!

PreviousIntroduction NextDocument Splitting

Last updated 5 months ago