Document Loading

LangChain document loading

Now, we learn about how to load data from multiple sources. In the real world, many types of data are such as PDF, Docx, Websites, etc.

See the image below to think about the type of document we can load with Langchain. My suggestion is that the best structure data format is JSON and unstructured data is PDF.

You always try JSON or PDF data because this format of data Embedding model and Vector store is understood better than other data formats.

We found that the JSON and PDF formats in the semantic search are very good compared to other formats.

LangChain Document Loaders

So, let's do code. Now we learn about how to load a pdf.

PDF Loading

First, we install belows two library:

Now we load a PDF with the below code:

You can find the pdf that I using at this link: Universe PDFarrow-up-right

We can load multiple PDFs with the below code:

When you load the PDF, there are two things you need to know. One is each page, which is a document, and a document contains text (page_content) and metadata.

Now we can count how many documents we loaded from PDF with the below code:

We can print page 100 with the code below:

URL Loading

Now, we learn how to load text from a URL. 🔗

On the webpage, we see. If you think that you copied the text from a website, then make a PDF file with your copied text. It takes Many times to do this. Langchain provides a url loader that you can use to load all text in Langchain without pdf making. So, now we explore the URL loader of the langchain.

If you want to load all text from a URL, you can use the code below:

If you need to load all text from multiple URLs, you can use the code below:

We learned how to load a PDF and URL. Now we learn how to load the entire website text with the website base url.

Recursive Website Loading

RecursiveUrlLoader in LangChain automates website crawling, extracting content from multiple pages by following links up to a specified depth. This allows you to efficiently gather information from entire websites, building comprehensive knowledge bases for your LLM applications. It handles link discovery and content extraction, simplifying the process of incorporating web data into your RAG pipelines.

First, you need to install the below libary:

Now you can load the entire website text with the code below:

You can load PDFs, URLs, and Recursive Website Loading.

YouTube Transcript Loading

LangChain's YoutubeLoader simplifies fetching and loading the transcript (closed captions) of YouTube videos. By providing a video URL, you can easily access the spoken content as LangChain Document objects, making it readily available for analysis, semantic search, and integration into your RAG pipelines, thus unlocking the valuable information contained within video content.

First, we transcribe a video with the URL shown below:

This code imports the YoutubeLoader, creates an instance to load the transcript from the specified YouTube URL, and then loads the transcript into the data variable. The print(data[:2000]) line displays the first 2000 characters of the loaded transcript

Multiple URL Loading and Saving to Files:

This part iterates through a list of YouTube URLs. For each URL, it loads the transcript, extracts the video ID from the URL to create a filename, and then saves the entire transcript content into a separate .txt file. The final print(doc) will output the last loaded Document object.

With these loaders in your toolkit, you're well on your way to building intelligent systems that can reason over vast amounts of information. You've seen how LangChain abstracts away the complexities of handling different data formats, allowing you to focus on the core logic of your applications.

But remember, this is just the beginning! LangChain boasts an even wider array of document loaders, ready to tackle almost any data source you can imagine. Think about:

  • Loading from various file types: .csv, .docx, .epub, .markdown, .odt, .xlsx, and many more!

  • Integrating with databases: Load data directly from SQL databases, NoSQL stores, and graph databases.

The possibilities are truly limitless! Each loader is designed to seamlessly integrate with the rest of the LangChain ecosystem, making it easy to transform and process your data for your specific needs.

Keep exploring the LangChain documentation and experimenting with these different loaders. The more data sources you can effectively connect to, the more knowledgeable and capable your RAG applications will become. You're building the foundation for truly intelligent and context-aware systems. Keep up the great work, and let's continue to unlock the power of language models with the right data!

Last updated