Skip to main content

Overview

Markdown splitter parses Markdown syntax to identify structural boundaries like headers, code blocks, tables, and lists. Segments content by semantic blocks and preserves document hierarchy through header tracking. Splitter Class: MarkdownChunker Config Class: MarkdownChunkingConfig

Dependencies

No additional dependencies required. Uses standard library.

Examples

from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders import MarkdownLoader, MarkdownLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter import MarkdownChunker, MarkdownChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure splitter
splitter_config = MarkdownChunkingConfig(
    chunk_size=512,
    chunk_overlap=50,
    split_on_elements=["h1", "h2", "h3"],
    preserve_whole_elements=["code_block", "table"]
)
splitter = MarkdownChunker(splitter_config)

# Setup KnowledgeBase
loader = MarkdownLoader(MarkdownLoaderConfig())
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="markdown_docs",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["document.md"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[splitter]
)

# Query with Agent
agent = Agent("openai/gpt-4o")
task = Task("Extract all code examples", context=[kb])
result = agent.do(task)
print(result)

Parameters

ParameterTypeDescriptionDefaultSource
chunk_sizeintTarget size of each chunk1024Base
chunk_overlapintOverlapping units between chunks200Base
min_chunk_sizeint | NoneMinimum size for a chunkNoneBase
length_functionCallable[[str], int]Function to measure text lengthlenBase
strip_whitespaceboolStrip leading/trailing whitespaceFalseBase
split_on_elementslist[str]Elements that signify boundaries["h1", "h2", "h3", "code_block", "table", "horizontal_rule"]Specific
preserve_whole_elementslist[str]Indivisible element types["code_block", "table"]Specific
strip_elementsboolStrip Markdown syntax charactersTrueSpecific
preserve_original_contentboolPreserve original markdown contentFalseSpecific
text_chunker_to_useBaseChunkerChunker for oversized blocksRecursiveChunkerSpecific