Skip to main content

Overview

PyMuPDF loader provides high-performance PDF processing with advanced features like structured text extraction, image handling, and annotation extraction. Ideal for large-scale document processing. Loader Class: PyMuPDFLoader Config Class: PyMuPDFLoaderConfig

Dependencies

pip install "upsonic[loaders]"
For OCR functionality:
pip install rapidocr-onnxruntime

Examples

from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders import PyMuPDFLoader, PyMuPDFLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter import RecursiveChunker, RecursiveChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure loader
loader_config = PyMuPDFLoaderConfig(
    extraction_mode="hybrid",
    text_extraction_method="dict",
    preserve_layout=True
)
loader = PyMuPDFLoader(loader_config)

# Setup KnowledgeBase
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
chunker = RecursiveChunker(RecursiveChunkingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="pymupdf_docs",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["document.pdf"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[chunker]
)

# Query with Agent
agent = Agent("openai/gpt-4o")
task = Task("Summarize the document", context=[kb])
result = agent.do(task)
print(result)

Parameters

ParameterTypeDescriptionDefaultSource
encodingstr | NoneFile encoding (auto-detected if None)NoneBase
error_handling"ignore" | "warn" | "raise"How to handle loading errors”warn”Base
include_metadataboolWhether to include file metadataTrueBase
custom_metadatadictAdditional metadata to includeBase
max_file_sizeint | NoneMaximum file size in bytesNoneBase
skip_empty_contentboolSkip documents with empty contentTrueBase
extraction_mode"hybrid" | "text_only" | "ocr_only"Content extraction strategy”hybrid”Specific
start_pageint | NoneFirst page to process (1-indexed)NoneSpecific
end_pageint | NoneLast page to process (inclusive)NoneSpecific
clean_page_numbersboolRemove page numbers from headers/footersTrueSpecific
page_num_start_formatstr | NoneFormat string for page start markersNoneSpecific
page_num_end_formatstr | NoneFormat string for page end markersNoneSpecific
extra_whitespace_removalboolNormalize whitespaceTrueSpecific
pdf_passwordstr | NonePassword for encrypted PDFsNoneSpecific
text_extraction_method"text" | "dict" | "html" | "xml"Text extraction method”text”Specific
include_imagesboolExtract and include image informationFalseSpecific
image_dpiintDPI for image rendering (72-600)150Specific
preserve_layoutboolPreserve text layout and positioningTrueSpecific
extract_annotationsboolExtract annotations and commentsFalseSpecific
annotation_format"text" | "json"Format for extracted annotations”text”Specific