Skip to main content

Overview

PyPDF loader extracts text from PDF documents using the pypdf library. It supports digital text extraction and OCR for scanned documents. Ideal for standard PDF files with text layers. Loader Class: PdfLoader Config Class: PdfLoaderConfig

Dependencies

pip install "upsonic[loaders]"
For OCR functionality:
pip install rapidocr-onnxruntime

Examples

from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders import PdfLoader, PdfLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter import RecursiveChunker, RecursiveChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure loader
loader_config = PdfLoaderConfig(
    extraction_mode="hybrid",
    start_page=1,
    end_page=10
)
loader = PdfLoader(loader_config)

# Setup KnowledgeBase
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
chunker = RecursiveChunker(RecursiveChunkingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="pdf_docs",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["document.pdf"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[chunker]
)

# Query with Agent
agent = Agent("openai/gpt-4o")
task = Task("What is the main topic?", context=[kb])
result = agent.do(task)
print(result)

Parameters

ParameterTypeDescriptionDefaultSource
encodingstr | NoneFile encoding (auto-detected if None)NoneBase
error_handling"ignore" | "warn" | "raise"How to handle loading errors”warn”Base
include_metadataboolWhether to include file metadataTrueBase
custom_metadatadictAdditional metadata to includeBase
max_file_sizeint | NoneMaximum file size in bytesNoneBase
skip_empty_contentboolSkip documents with empty contentTrueBase
extraction_mode"hybrid" | "text_only" | "ocr_only"Content extraction strategy”hybrid”Specific
start_pageint | NoneFirst page to process (1-indexed)NoneSpecific
end_pageint | NoneLast page to process (inclusive)NoneSpecific
clean_page_numbersboolRemove page numbers from headers/footersTrueSpecific
page_num_start_formatstr | NoneFormat string for page start markersNoneSpecific
page_num_end_formatstr | NoneFormat string for page end markersNoneSpecific
extra_whitespace_removalboolNormalize whitespaceTrueSpecific
pdf_passwordstr | NonePassword for encrypted PDFsNoneSpecific