encoding | str | None | File encoding (auto-detected if None) | None | Base |
error_handling | "ignore" | "warn" | "raise" | How to handle loading errors | ”warn” | Base |
include_metadata | bool | Whether to include file metadata | True | Base |
custom_metadata | dict | Additional metadata to include | Base | |
max_file_size | int | None | Maximum file size in bytes | None | Base |
skip_empty_content | bool | Skip documents with empty content | True | Base |
extraction_mode | "hybrid" | "text_only" | "ocr_only" | Content extraction strategy | ”hybrid” | Specific |
start_page | int | None | First page to process (1-indexed) | None | Specific |
end_page | int | None | Last page to process (inclusive) | None | Specific |
clean_page_numbers | bool | Remove page numbers from headers/footers | True | Specific |
page_num_start_format | str | None | Format string for page start markers | None | Specific |
page_num_end_format | str | None | Format string for page end markers | None | Specific |
extra_whitespace_removal | bool | Normalize whitespace | True | Specific |
pdf_password | str | None | Password for encrypted PDFs | None | Specific |
extract_tables | bool | Extract and include tables | True | Specific |
table_format | "text" | "markdown" | "csv" | "grid" | Format for extracted tables | ”markdown” | Specific |
table_settings | dict | Advanced table detection settings | Default dict | Specific |
extract_images | bool | Extract image information | False | Specific |
layout_mode | "default" | "layout" | "simple" | Text extraction layout mode | ”layout” | Specific |
use_text_flow | bool | Use text flow analysis | True | Specific |
char_margin | float | Minimum distance between characters | 3.0 | Specific |
line_margin | float | Minimum distance between lines | 0.5 | Specific |
word_margin | float | Minimum distance between words | 0.1 | Specific |
extract_page_dimensions | bool | Include page dimensions in metadata | False | Specific |
crop_box | tuple[float, float, float, float] | None | Crop box (x0, y0, x1, y1) | None | Specific |
extract_annotations | bool | Extract annotations and hyperlinks | False | Specific |
keep_blank_chars | bool | Preserve blank characters | False | Specific |