Skip to main content

Overview

HTML splitter parses HTML DOM to intelligently group content into semantic blocks. Follows a multi-stage pipeline: parse & sanitize, segment by tags, chunk text within blocks, and merge small chunks. Preserves document structure and extracts rich metadata. Splitter Class: HTMLChunker Config Class: HTMLChunkingConfig

Dependencies

pip install beautifulsoup4 lxml

Examples

from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders import HTMLLoader, HTMLLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter import HTMLChunker, HTMLChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure splitter
splitter_config = HTMLChunkingConfig(
    chunk_size=512,
    chunk_overlap=50,
    split_on_tags=["h1", "h2", "h3", "p"],
    preserve_whole_tags=["table", "pre"]
)
splitter = HTMLChunker(splitter_config)

# Setup KnowledgeBase
loader = HTMLLoader(HTMLLoaderConfig())
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="html_docs",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["https://example.com/article"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[splitter]
)

# Query with Agent
agent = Agent("openai/gpt-4o")
task = Task("Extract main content", context=[kb])
result = agent.do(task)
print(result)

Parameters

ParameterTypeDescriptionDefaultSource
chunk_sizeintTarget size of each chunk1024Base
chunk_overlapintOverlapping units between chunks200Base
min_chunk_sizeint | NoneMinimum size for a chunkNoneBase
length_functionCallable[[str], int]Function to measure text lengthlenBase
strip_whitespaceboolStrip leading/trailing whitespaceFalseBase
split_on_tagslist[str]HTML tags that signify boundaries["h1", "h2", "h3", "h4", "h5", "h6", "p", "li", "table"]Specific
tags_to_ignorelist[str]Tags to remove before processing["script", "style", "nav", "footer", "aside", "header", "form", "head", "meta", "link"]Specific
tags_to_extractlist[str] | NoneAllowlist of tags to processNoneSpecific
preserve_whole_tagslist[str]Indivisible tag types["table", "pre", "code", "ul", "ol"]Specific
extract_link_infoboolTransform links to Markdown formatTrueSpecific
preserve_html_contentboolPreserve original HTML contentFalseSpecific
text_chunker_to_useBaseChunkerChunker for oversized blocksRecursiveChunkerSpecific
merge_small_chunksboolMerge small chunks with adjacentTrueSpecific
min_chunk_size_ratiofloatMinimum ratio for merging (0.0-1.0)0.3Specific