Overview
PdfPlumber loader excels at extracting structured content from PDFs, especially tables and complex layouts. It provides superior table detection and preserves document structure better than standard PDF loaders. Loader Class:PdfPlumberLoader
Config Class: PdfPlumberLoaderConfig
Install
Install the PdfPlumber loader optional dependency group:
Examples
Parameters
| Parameter | Type | Description | Default | Source |
|---|---|---|---|---|
encoding | str | None | File encoding (auto-detected if None) | None | Base |
error_handling | "ignore" | "warn" | "raise" | How to handle loading errors | ”warn” | Base |
include_metadata | bool | Whether to include file metadata | True | Base |
custom_metadata | dict | Additional metadata to include | Base | |
max_file_size | int | None | Maximum file size in bytes | None | Base |
skip_empty_content | bool | Skip documents with empty content | True | Base |
extraction_mode | "hybrid" | "text_only" | "ocr_only" | Content extraction strategy | ”hybrid” | Specific |
start_page | int | None | First page to process (1-indexed) | None | Specific |
end_page | int | None | Last page to process (inclusive) | None | Specific |
clean_page_numbers | bool | Remove page numbers from headers/footers | True | Specific |
page_num_start_format | str | None | Format string for page start markers | None | Specific |
page_num_end_format | str | None | Format string for page end markers | None | Specific |
extra_whitespace_removal | bool | Normalize whitespace | True | Specific |
pdf_password | str | None | Password for encrypted PDFs | None | Specific |
extract_tables | bool | Extract and include tables | True | Specific |
table_format | "text" | "markdown" | "csv" | "grid" | Format for extracted tables | ”markdown” | Specific |
table_settings | dict | Advanced table detection settings | Default dict | Specific |
extract_images | bool | Extract image information | False | Specific |
layout_mode | "default" | "layout" | "simple" | Text extraction layout mode | ”layout” | Specific |
use_text_flow | bool | Use text flow analysis | True | Specific |
char_margin | float | Minimum distance between characters | 3.0 | Specific |
line_margin | float | Minimum distance between lines | 0.5 | Specific |
word_margin | float | Minimum distance between words | 0.1 | Specific |
extract_page_dimensions | bool | Include page dimensions in metadata | False | Specific |
crop_box | tuple[float, float, float, float] | None | Crop box (x0, y0, x1, y1) | None | Specific |
extract_annotations | bool | Extract annotations and hyperlinks | False | Specific |
keep_blank_chars | bool | Preserve blank characters | False | Specific |

