> ## Documentation Index
> Fetch the complete documentation index at: https://docs.upsonic.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# HTML Splitter

> Split HTML documents using structure-aware semantic segmentation

## Overview

HTML splitter parses HTML DOM to intelligently group content into semantic blocks. Follows a multi-stage pipeline: parse & sanitize, segment by tags, chunk text within blocks, and merge small chunks. Preserves document structure and extracts rich metadata.

**Splitter Class:** `HTMLChunker`

**Config Class:** `HTMLChunkingConfig`

## Dependencies

```bash theme={null}
uv pip install beautifulsoup4 lxml
```

## Examples

```python theme={null}
from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders.html import HTMLLoader
from upsonic.loaders.config import HTMLLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter.html_chunker import HTMLChunker, HTMLChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure splitter
splitter_config = HTMLChunkingConfig(
    chunk_size=512,
    chunk_overlap=50,
    split_on_tags=["h1", "h2", "h3", "p"],
    preserve_whole_tags=["table", "pre"]
)
splitter = HTMLChunker(splitter_config)

# Setup KnowledgeBase
loader = HTMLLoader(HTMLLoaderConfig())
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="html_docs",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["https://example.com/article"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[splitter]
)

# Query with Agent
agent = Agent("anthropic/claude-sonnet-4-5")
task = Task("Extract main content", context=[kb])
result = agent.do(task)
print(result)
```

## Parameters

| Parameter               | Type                   | Description                         | Default                                                                                   | Source   |
| ----------------------- | ---------------------- | ----------------------------------- | ----------------------------------------------------------------------------------------- | -------- |
| `chunk_size`            | `int`                  | Target size of each chunk           | 1024                                                                                      | Base     |
| `chunk_overlap`         | `int`                  | Overlapping units between chunks    | 200                                                                                       | Base     |
| `min_chunk_size`        | `int \| None`          | Minimum size for a chunk            | None                                                                                      | Base     |
| `length_function`       | `Callable[[str], int]` | Function to measure text length     | `len`                                                                                     | Base     |
| `strip_whitespace`      | `bool`                 | Strip leading/trailing whitespace   | False                                                                                     | Base     |
| `split_on_tags`         | `list[str]`            | HTML tags that signify boundaries   | `["h1", "h2", "h3", "h4", "h5", "h6", "p", "li", "table"]`                                | Specific |
| `tags_to_ignore`        | `list[str]`            | Tags to remove before processing    | `["script", "style", "nav", "footer", "aside", "header", "form", "head", "meta", "link"]` | Specific |
| `tags_to_extract`       | `list[str] \| None`    | Allowlist of tags to process        | None                                                                                      | Specific |
| `preserve_whole_tags`   | `list[str]`            | Indivisible tag types               | `["table", "pre", "code", "ul", "ol"]`                                                    | Specific |
| `extract_link_info`     | `bool`                 | Transform links to Markdown format  | True                                                                                      | Specific |
| `preserve_html_content` | `bool`                 | Preserve original HTML content      | False                                                                                     | Specific |
| `text_chunker_to_use`   | `BaseChunker`          | Chunker for oversized blocks        | RecursiveChunker                                                                          | Specific |
| `merge_small_chunks`    | `bool`                 | Merge small chunks with adjacent    | True                                                                                      | Specific |
| `min_chunk_size_ratio`  | `float`                | Minimum ratio for merging (0.0-1.0) | 0.3                                                                                       | Specific |
