Overview
PyMuPDF loader provides high-performance PDF processing with advanced features like structured text extraction, image handling, and annotation extraction. Ideal for large-scale document processing. Loader Class:PyMuPDFLoader
Config Class: PyMuPDFLoaderConfig
Install
Install the PyMuPDF loader optional dependency group:
Examples
Parameters
| Parameter | Type | Description | Default | Source |
|---|---|---|---|---|
encoding | str | None | File encoding (auto-detected if None) | None | Base |
error_handling | "ignore" | "warn" | "raise" | How to handle loading errors | ”warn” | Base |
include_metadata | bool | Whether to include file metadata | True | Base |
custom_metadata | dict | Additional metadata to include | Base | |
max_file_size | int | None | Maximum file size in bytes | None | Base |
skip_empty_content | bool | Skip documents with empty content | True | Base |
extraction_mode | "hybrid" | "text_only" | "ocr_only" | Content extraction strategy | ”hybrid” | Specific |
start_page | int | None | First page to process (1-indexed) | None | Specific |
end_page | int | None | Last page to process (inclusive) | None | Specific |
clean_page_numbers | bool | Remove page numbers from headers/footers | True | Specific |
page_num_start_format | str | None | Format string for page start markers | None | Specific |
page_num_end_format | str | None | Format string for page end markers | None | Specific |
extra_whitespace_removal | bool | Normalize whitespace | True | Specific |
pdf_password | str | None | Password for encrypted PDFs | None | Specific |
text_extraction_method | "text" | "dict" | "html" | "xml" | Text extraction method | ”text” | Specific |
include_images | bool | Extract and include image information | False | Specific |
image_dpi | int | DPI for image rendering (72-600) | 150 | Specific |
preserve_layout | bool | Preserve text layout and positioning | True | Specific |
extract_annotations | bool | Extract annotations and comments | False | Specific |
annotation_format | "text" | "json" | Format for extracted annotations | ”text” | Specific |

