Skip to main content

Overview

Docling loader provides enterprise-grade document processing with support for PDF, DOCX, XLSX, PPTX, HTML, Markdown, CSV, and images. Features intelligent chunking and advanced layout understanding. Loader Class: DoclingLoader Config Class: DoclingLoaderConfig

Dependencies

pip install docling

Examples

from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders import DoclingLoader, DoclingLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter import RecursiveChunker, RecursiveChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure loader for semantic chunks
loader_config = DoclingLoaderConfig(
    extraction_mode="chunks",
    chunker_type="hybrid",
    ocr_enabled=True
)
loader = DoclingLoader(loader_config)

# Setup KnowledgeBase
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
chunker = RecursiveChunker(RecursiveChunkingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="docling_docs",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["report.pdf"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[chunker]
)

# Query with Agent
agent = Agent("openai/gpt-4o")
task = Task("Extract key insights", context=[kb])
result = agent.do(task)
print(result)

Parameters

ParameterTypeDescriptionDefaultSource
encodingstr | NoneFile encoding (auto-detected if None)NoneBase
error_handling"ignore" | "warn" | "raise"How to handle loading errors”warn”Base
include_metadataboolWhether to include file metadataTrueBase
custom_metadatadictAdditional metadata to includeBase
max_file_sizeint | NoneMaximum file size in bytesNoneBase
skip_empty_contentboolSkip documents with empty contentTrueBase
extraction_mode"markdown" | "chunks"Content extraction strategy”chunks”Specific
chunker_type"hybrid" | "hierarchical"Chunking algorithm (for chunks mode)“hybrid”Specific
allowed_formatslist[str] | NoneRestrict input formatsNoneSpecific
markdown_image_placeholderstrPlaceholder text for images""Specific
ocr_enabledboolEnable OCR for scanned documentsTrueSpecific
ocr_force_full_pageboolForce full-page OCRFalseSpecific
ocr_backend"rapidocr" | "tesseract"OCR engine to use”rapidocr”Specific
ocr_langlist[str]OCR languages[“english”]Specific
ocr_backend_engine"onnxruntime" | "openvino" | "paddle" | "torch"Backend engine for RapidOCR”onnxruntime”Specific
ocr_text_scorefloatMinimum confidence score (0.0-1.0)0.5Specific
enable_table_structureboolEnable table structure detectionTrueSpecific
table_structure_cell_matchingboolEnable cell-level matchingTrueSpecific
max_pagesint | NoneMaximum pages to processNoneSpecific
page_rangetuple[int, int] | NonePage range to process (start, end)NoneSpecific
parallel_processingboolEnable parallel processingTrueSpecific
batch_sizeintBatch size for parallel processing (1-100)10Specific
extract_document_metadataboolExtract document propertiesTrueSpecific
confidence_thresholdfloatMinimum confidence for chunks (0.0-1.0)0.5Specific
support_urlsboolAllow loading from URLsTrueSpecific
url_timeoutintTimeout for URL downloads (seconds)30Specific