Skip to main content

Overview

PdfPlumber loader excels at extracting structured content from PDFs, especially tables and complex layouts. It provides superior table detection and preserves document structure better than standard PDF loaders. Loader Class: PdfPlumberLoader Config Class: PdfPlumberLoaderConfig

Dependencies

pip install "upsonic[loaders]"
For OCR functionality:
pip install rapidocr-onnxruntime

Examples

from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders import PdfPlumberLoader, PdfPlumberLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter import RecursiveChunker, RecursiveChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure loader with table extraction
loader_config = PdfPlumberLoaderConfig(
    extraction_mode="hybrid",
    extract_tables=True,
    table_format="markdown"
)
loader = PdfPlumberLoader(loader_config)

# Setup KnowledgeBase
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
chunker = RecursiveChunker(RecursiveChunkingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="pdf_tables",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["report.pdf"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[chunker]
)

# Query with Agent
agent = Agent("openai/gpt-4o")
task = Task("Extract all table data", context=[kb])
result = agent.do(task)
print(result)

Parameters

ParameterTypeDescriptionDefaultSource
encodingstr | NoneFile encoding (auto-detected if None)NoneBase
error_handling"ignore" | "warn" | "raise"How to handle loading errors”warn”Base
include_metadataboolWhether to include file metadataTrueBase
custom_metadatadictAdditional metadata to includeBase
max_file_sizeint | NoneMaximum file size in bytesNoneBase
skip_empty_contentboolSkip documents with empty contentTrueBase
extraction_mode"hybrid" | "text_only" | "ocr_only"Content extraction strategy”hybrid”Specific
start_pageint | NoneFirst page to process (1-indexed)NoneSpecific
end_pageint | NoneLast page to process (inclusive)NoneSpecific
clean_page_numbersboolRemove page numbers from headers/footersTrueSpecific
page_num_start_formatstr | NoneFormat string for page start markersNoneSpecific
page_num_end_formatstr | NoneFormat string for page end markersNoneSpecific
extra_whitespace_removalboolNormalize whitespaceTrueSpecific
pdf_passwordstr | NonePassword for encrypted PDFsNoneSpecific
extract_tablesboolExtract and include tablesTrueSpecific
table_format"text" | "markdown" | "csv" | "grid"Format for extracted tables”markdown”Specific
table_settingsdictAdvanced table detection settingsDefault dictSpecific
extract_imagesboolExtract image informationFalseSpecific
layout_mode"default" | "layout" | "simple"Text extraction layout mode”layout”Specific
use_text_flowboolUse text flow analysisTrueSpecific
char_marginfloatMinimum distance between characters3.0Specific
line_marginfloatMinimum distance between lines0.5Specific
word_marginfloatMinimum distance between words0.1Specific
extract_page_dimensionsboolInclude page dimensions in metadataFalseSpecific
crop_boxtuple[float, float, float, float] | NoneCrop box (x0, y0, x1, y1)NoneSpecific
extract_annotationsboolExtract annotations and hyperlinksFalseSpecific
keep_blank_charsboolPreserve blank charactersFalseSpecific