Overview
PyPDF loader extracts text from PDF documents using thepypdf library. It supports digital text extraction and OCR for scanned documents. Ideal for standard PDF files with text layers.
Loader Class: PdfLoader
Config Class: PdfLoaderConfig
Install
Install the PyPDF loader optional dependency group:
Examples
Parameters
| Parameter | Type | Description | Default | Source |
|---|---|---|---|---|
encoding | str | None | File encoding (auto-detected if None) | None | Base |
error_handling | "ignore" | "warn" | "raise" | How to handle loading errors | ”warn” | Base |
include_metadata | bool | Whether to include file metadata | True | Base |
custom_metadata | dict | Additional metadata to include | Base | |
max_file_size | int | None | Maximum file size in bytes | None | Base |
skip_empty_content | bool | Skip documents with empty content | True | Base |
extraction_mode | "hybrid" | "text_only" | "ocr_only" | Content extraction strategy | ”hybrid” | Specific |
start_page | int | None | First page to process (1-indexed) | None | Specific |
end_page | int | None | Last page to process (inclusive) | None | Specific |
clean_page_numbers | bool | Remove page numbers from headers/footers | True | Specific |
page_num_start_format | str | None | Format string for page start markers | None | Specific |
page_num_end_format | str | None | Format string for page end markers | None | Specific |
extra_whitespace_removal | bool | Normalize whitespace | True | Specific |
pdf_password | str | None | Password for encrypted PDFs | None | Specific |

