Skip to main content

Overview

Character splitter splits text using a single, specified separator. Ideal for documents with clear and consistent delimiters. Uses a direct “Split and Merge” process for efficiency and positional integrity. Splitter Class: CharacterChunker Config Class: CharacterChunkingConfig

Dependencies

No additional dependencies required. Uses standard library.

Examples

from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders.text import TextLoader
from upsonic.loaders.config import TextLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter.character import CharacterChunker, CharacterChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure splitter
splitter_config = CharacterChunkingConfig(
    chunk_size=512,
    chunk_overlap=50,
    separator="\n\n"
)
splitter = CharacterChunker(splitter_config)

# Setup KnowledgeBase
loader = TextLoader(TextLoaderConfig())
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="character_docs",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["document.txt"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[splitter]
)

# Query with Agent
agent = Agent("openai/gpt-4o")
task = Task("What are the key sections?", context=[kb])
result = agent.do(task)
print(result)

Parameters

ParameterTypeDescriptionDefaultSource
chunk_sizeintTarget size of each chunk1024Base
chunk_overlapintOverlapping units between chunks200Base
min_chunk_sizeint | NoneMinimum size for a chunkNoneBase
length_functionCallable[[str], int]Function to measure text lengthlenBase
strip_whitespaceboolStrip leading/trailing whitespaceFalseBase
separatorstrSingle separator string or regex"\n\n"Specific
is_separator_regexboolTreat separator as regexFalseSpecific
keep_separatorboolKeep separator in chunksTrueSpecific