PdfPlumber Loader

Overview

PdfPlumber loader excels at extracting structured content from PDFs, especially tables and complex layouts. It provides superior table detection and preserves document structure better than standard PDF loaders. Loader Class: PdfPlumberLoader Config Class: PdfPlumberLoaderConfig

Install

Install the PdfPlumber loader optional dependency group:

uv pip install "upsonic[pdfplumber-loader]"

Examples

from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders.pdfplumber import PdfPlumberLoader
from upsonic.loaders.config import PdfPlumberLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter.recursive import RecursiveChunker, RecursiveChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure loader with table extraction
loader_config = PdfPlumberLoaderConfig(
    extraction_mode="hybrid",
    extract_tables=True,
    table_format="markdown"
)
loader = PdfPlumberLoader(loader_config)

# Setup KnowledgeBase
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
chunker = RecursiveChunker(RecursiveChunkingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="pdf_tables",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["report.pdf"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[chunker]
)

# Query with Agent
agent = Agent("anthropic/claude-sonnet-4-5")
task = Task("Extract all table data", context=[kb])
result = agent.do(task)
print(result)

Parameters

Parameter	Type	Description	Default	Source
`encoding`	`str \| None`	File encoding (auto-detected if None)	None	Base
`error_handling`	`"ignore" \| "warn" \| "raise"`	How to handle loading errors	”warn”	Base
`include_metadata`	`bool`	Whether to include file metadata	True	Base
`custom_metadata`	`dict`	Additional metadata to include	Base
`max_file_size`	`int \| None`	Maximum file size in bytes	None	Base
`skip_empty_content`	`bool`	Skip documents with empty content	True	Base
`extraction_mode`	`"hybrid" \| "text_only" \| "ocr_only"`	Content extraction strategy	”hybrid”	Specific
`start_page`	`int \| None`	First page to process (1-indexed)	None	Specific
`end_page`	`int \| None`	Last page to process (inclusive)	None	Specific
`clean_page_numbers`	`bool`	Remove page numbers from headers/footers	True	Specific
`page_num_start_format`	`str \| None`	Format string for page start markers	None	Specific
`page_num_end_format`	`str \| None`	Format string for page end markers	None	Specific
`extra_whitespace_removal`	`bool`	Normalize whitespace	True	Specific
`pdf_password`	`str \| None`	Password for encrypted PDFs	None	Specific
`extract_tables`	`bool`	Extract and include tables	True	Specific
`table_format`	`"text" \| "markdown" \| "csv" \| "grid"`	Format for extracted tables	”markdown”	Specific
`table_settings`	`dict`	Advanced table detection settings	Default dict	Specific
`extract_images`	`bool`	Extract image information	False	Specific
`layout_mode`	`"default" \| "layout" \| "simple"`	Text extraction layout mode	”layout”	Specific
`use_text_flow`	`bool`	Use text flow analysis	True	Specific
`char_margin`	`float`	Minimum distance between characters	3.0	Specific
`line_margin`	`float`	Minimum distance between lines	0.5	Specific
`word_margin`	`float`	Minimum distance between words	0.1	Specific
`extract_page_dimensions`	`bool`	Include page dimensions in metadata	False	Specific
`crop_box`	`tuple[float, float, float, float] \| None`	Crop box (x0, y0, x1, y1)	None	Specific
`extract_annotations`	`bool`	Extract annotations and hyperlinks	False	Specific
`keep_blank_chars`	`bool`	Preserve blank characters	False	Specific

GET STARTED

CONCEPTS

STARTING AN AGENT PROJECT

READY TO USE SNIPPETS

DEPLOYMENT

FURTHER READINGS

Overview

Install

Examples

Parameters

GET STARTED

CONCEPTS

STARTING AN AGENT PROJECT

READY TO USE SNIPPETS

DEPLOYMENT

FURTHER READINGS

​Overview

​Install

​Examples

​Parameters

Overview

Install

Examples

Parameters