Skip to main content

Overview

Recursive splitter intelligently splits text using a prioritized list of separators. It tries the first separator, and if segments are still too large, recursively applies the next separator. Highly effective for structured text like code and markdown, ensuring logical units stay together. Splitter Class: RecursiveChunker Config Class: RecursiveChunkingConfig

Dependencies

No additional dependencies required. Uses standard library.

Examples

from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders import TextLoader, TextLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter import RecursiveChunker, RecursiveChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure splitter
splitter_config = RecursiveChunkingConfig(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "]
)
splitter = RecursiveChunker(splitter_config)

# Setup KnowledgeBase
loader = TextLoader(TextLoaderConfig())
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="recursive_docs",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["document.txt"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[splitter]
)

# Query with Agent
agent = Agent("openai/gpt-4o")
task = Task("Summarize the main points", context=[kb])
result = agent.do(task)
print(result)

Parameters

ParameterTypeDescriptionDefaultSource
chunk_sizeintTarget size of each chunk1024Base
chunk_overlapintOverlapping units between chunks200Base
min_chunk_sizeint | NoneMinimum size for a chunkNoneBase
length_functionCallable[[str], int]Function to measure text lengthlenBase
strip_whitespaceboolStrip leading/trailing whitespaceFalseBase
separatorslist[str]Prioritized list of separators["\n\n", "\n", ". ", "? ", "! ", " ", ""]Specific
keep_separatorboolKeep separator in chunksTrueSpecific
is_separator_regexboolTreat separators as regex patternsFalseSpecific