Skip to main content

Base Configuration

LoaderConfig

Base configuration class for all document loaders. Parameters:
ParameterTypeDefaultDescription
encodingOptional[str]NoneFile encoding (auto-detected if None)
error_handlingLiteral["ignore", "warn", "raise"]"warn"How to handle loading errors
include_metadataboolTrueWhether to include file metadata
custom_metadataDict[str, Any]{}Additional metadata to include
max_file_sizeOptional[int]NoneMaximum file size in bytes
skip_empty_contentboolTrueSkip documents with empty content

Text Loader Configuration

TextLoaderConfig

An enhanced configuration for loading and processing plain text files. Parameters:
ParameterTypeDefaultDescription
strip_whitespaceboolTrueIf True, removes leading/trailing whitespace from each chunk
min_chunk_lengthint1The minimum character length for a chunk to be kept after cleaning

CSV Loader Configuration

CSVLoaderConfig

Configuration for CSV file loading. Parameters:
ParameterTypeDefaultDescription
content_synthesis_modeLiteral["concatenated", "json"]"concatenated"How to create document content from rows
split_modeLiteral["single_document", "per_row", "per_chunk"]"single_document"How to split CSV into documents
rows_per_chunkint100Number of rows per document when split_mode=‘per_chunk’
include_columnsOptional[List[str]]NoneOnly include these columns
exclude_columnsOptional[List[str]]NoneExclude these columns
delimiterstr","CSV delimiter
quotecharstr'"'CSV quote character
has_headerboolTrueWhether CSV has a header row

PDF Loader Configuration

PdfLoaderConfig

An advanced configuration model for loading and processing PDF documents. Parameters:
ParameterTypeDefaultDescription
extraction_modeLiteral["hybrid", "text_only", "ocr_only"]"hybrid"The core strategy for content extraction
start_pageOptional[int]NoneThe first page number to process (1-indexed)
end_pageOptional[int]NoneThe last page number to process (inclusive)
clean_page_numbersboolTrueIf True, intelligently identifies and removes page numbers
page_num_start_formatOptional[str]NoneA Python f-string to prepend to each page’s content if page numbers are cleaned
page_num_end_formatOptional[str]NoneA Python f-string to append to each page’s content if page numbers are cleaned
extra_whitespace_removalboolTrueIf True, normalizes whitespace by collapsing multiple newlines and spaces
pdf_passwordOptional[str]NonePassword to use for decrypting protected PDF files

PyMuPDF Loader Configuration

PyMuPDFLoaderConfig

Advanced configuration for PyMuPDF-based PDF document loading. Parameters:
ParameterTypeDefaultDescription
extraction_modeLiteral["hybrid", "text_only", "ocr_only"]"hybrid"The core strategy for content extraction
start_pageOptional[int]NoneThe first page number to process (1-indexed)
end_pageOptional[int]NoneThe last page number to process (inclusive)
clean_page_numbersboolTrueIf True, intelligently identifies and removes page numbers
page_num_start_formatOptional[str]NoneA Python f-string to prepend to each page’s content if page numbers are cleaned
page_num_end_formatOptional[str]NoneA Python f-string to append to each page’s content if page numbers are cleaned
extra_whitespace_removalboolTrueIf True, normalizes whitespace by collapsing multiple newlines and spaces
pdf_passwordOptional[str]NonePassword to use for decrypting protected PDF files
text_extraction_methodLiteral["text", "dict", "html", "xml"]"text"Method for text extraction from pages
include_imagesboolFalseIf True, extracts and includes image information in metadata
image_dpiint150DPI for image rendering when performing OCR
preserve_layoutboolTrueIf True, preserves text layout and positioning information
extract_annotationsboolFalseIf True, extracts annotations and comments from the PDF
annotation_formatLiteral["text", "json"]"text"Format for extracted annotations

DOCX Loader Configuration

DOCXLoaderConfig

Configuration for DOCX file loading. Parameters:
ParameterTypeDefaultDescription
include_tablesboolTrueInclude table content
include_headersboolTrueInclude header content
include_footersboolTrueInclude footer content
table_formatLiteral["text", "markdown", "html"]"text"How to format tables

JSON Loader Configuration

JSONLoaderConfig

Advanced configuration for loading and mapping structured JSON and JSONL files. Parameters:
ParameterTypeDefaultDescription
modeLiteral["single", "multi"]"single"Processing mode: ‘single’ for one document per file, ‘multi’ to extract multiple documents from records within a file
record_selectorOptional[str]NoneA JQ query to select a list of records from the JSON object
content_mapperstr"."A JQ query to extract the content from a single record
metadata_mapperOptional[Dict[str, str]]NoneA dictionary mapping metadata keys to JQ queries for extracting metadata from each record
content_synthesis_modeLiteral["json", "text"]"json"How to format the extracted content
json_linesboolFalseSet to True if the file is in JSON Lines format (.jsonl)

XML Loader Configuration

XMLLoaderConfig

An advanced configuration for parsing and structuring XML files. Parameters:
ParameterTypeDefaultDescription
split_by_xpathstr"//*[not(*)] | //item | //product | //book"An XPath expression that identifies the elements to be treated as individual documents
content_xpathOptional[str]NoneAn optional, relative XPath from a split element to select the specific content
content_synthesis_modeLiteral["smart_text", "xml_snippet"]"smart_text"Defines the content for the Document
include_attributesboolTrueIf True, automatically includes the attributes of the split element in the metadata
metadata_xpathsOptional[Dict[str, str]]NoneA dictionary mapping metadata keys to XPath expressions to extract targeted metadata
strip_namespacesboolTrueIf True, removes all XML namespaces from the document
recover_modeboolFalseIf True, attempts to parse malformed or broken XML files instead of raising an error

YAML Loader Configuration

YAMLLoaderConfig

An advanced configuration for parsing and structuring YAML files. Parameters:
ParameterTypeDefaultDescription
split_by_jq_querystr"."A jq-style query to select objects to be treated as individual documents
handle_multiple_docsboolTrueIf True, processes YAML files containing multiple documents separated by ’---‘
content_synthesis_modeLiteral["canonical_yaml", "json", "smart_text"]"canonical_yaml"Defines the content for the Document
yaml_indentint2The indentation level to use when content_synthesis_mode is ‘canonical_yaml’
json_indentOptional[int]2The indentation level for JSON output
flatten_metadataboolTrueIf True, flattens the nested structure of the selected YAML object into the metadata
metadata_jq_queriesOptional[Dict[str, str]]NoneA dictionary mapping metadata keys to jq queries to extract targeted metadata

Markdown Loader Configuration

MarkdownLoaderConfig

Configuration for Markdown file loading. Parameters:
ParameterTypeDefaultDescription
parse_front_matterboolTrueParse YAML front matter from the top of the file
include_code_blocksboolTrueInclude code block content in the document
code_block_language_metadataboolTrueAdd code block language as metadata
heading_metadataboolTrueExtract headings and add them to the metadata
split_by_headingOptional[Literal["h1", "h2", "h3"]]NoneIf set, splits the file into multiple documents based on the specified heading level

HTML Loader Configuration

HTMLLoaderConfig

Configuration for HTML file and URL loading. Parameters:
ParameterTypeDefaultDescription
extract_textboolTrueExtract text content from HTML
preserve_structureboolTruePreserve document structure in output
include_linksboolTrueInclude links in extracted content
include_imagesboolFalseInclude image information
remove_scriptsboolTrueRemove script tags
remove_stylesboolTrueRemove style tags
extract_metadataboolTrueExtract metadata from HTML head
clean_whitespaceboolTrueClean up whitespace in output
extract_headersboolTrueExtract heading elements
extract_paragraphsboolTrueExtract paragraph content
extract_listsboolTrueExtract list content
extract_tablesboolTrueExtract table content
table_formatLiteral["text", "markdown", "html"]"text"How to format extracted tables
user_agentstr"Upsonic HTML Loader 1.0"User agent for web requests

Docling Loader Configuration

DoclingLoaderConfig

Advanced configuration for Docling-based document loading. Parameters:
ParameterTypeDefaultDescription
extraction_modeLiteral["markdown", "chunks"]"chunks"Content extraction strategy
chunker_typeLiteral["hybrid", "hierarchical"]"hybrid"Chunking algorithm when extraction_mode=‘chunks’
allowed_formatsOptional[List[str]]NoneRestrict input formats
markdown_image_placeholderstr""Placeholder text for images in markdown export
ocr_enabledboolTrueEnable OCR for scanned documents and images
ocr_force_full_pageboolFalseForce full-page OCR instead of hybrid mode
ocr_backendLiteral["rapidocr", "tesseract"]"rapidocr"OCR engine to use
ocr_langList[str]["english"]OCR languages
ocr_backend_engineLiteral["onnxruntime", "openvino", "paddle", "torch"]"onnxruntime"Backend engine for RapidOCR
ocr_text_scorefloat0.5Minimum confidence score for OCR text (0.0-1.0)
enable_table_structureboolTrueEnable table structure detection and parsing
table_structure_cell_matchingboolTrueEnable cell-level matching in tables for better structure preservation
max_pagesOptional[int]NoneMaximum number of pages to process per document
page_rangeOptional[tuple[int, int]]NoneSpecific page range to process (start, end) - 1-indexed, inclusive
parallel_processingboolTrueEnable parallel processing for batch operations
batch_sizeint10Number of documents to process in parallel during batch operations
extract_document_metadataboolTrueExtract document properties (title, author, creation date, etc.)
confidence_thresholdfloat0.5Minimum confidence score for extracted chunks (0.0-1.0)
support_urlsboolTrueAllow loading documents from HTTP/HTTPS URLs
url_timeoutint30Timeout in seconds for URL downloads

Configuration Factory

LoaderConfigFactory

Factory for creating loader configurations. Functions:

create_config

Create a configuration for the specified loader type. Parameters:
  • loader_type (str): Type of loader to create config for
  • **kwargs: Additional keyword arguments to pass to config constructor
Returns:
  • LoaderConfig: Configuration object for the specified loader type

get_supported_types

Get list of supported loader types. Returns:
  • List[str]: List of supported loader type names

simple_config

Create a simple configuration with defaults. Parameters:
  • loader_type (str): Type of loader to create config for
Returns:
  • LoaderConfig: Simple configuration with default values

advanced_config

Create an advanced configuration with custom settings. Parameters:
  • loader_type (str): Type of loader to create config for
  • **kwargs: Custom settings to override defaults
Returns:
  • LoaderConfig: Advanced configuration with custom settings
I