Skip to main content

Overview

Semantic splitter identifies boundaries by embedding sentences and finding points of high cosine distance between adjacent sentences, indicating topic changes. Uses statistical methods to determine breakpoint thresholds. Requires an embedding provider. Splitter Class: SemanticChunker Config Class: SemanticChunkingConfig

Dependencies

pip install numpy
Also requires an embedding provider (e.g., OpenAIEmbedding).

Examples

from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders import TextLoader, TextLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter import SemanticChunker, SemanticChunkingConfig, BreakpointThresholdType
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure splitter with embedding provider
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
splitter_config = SemanticChunkingConfig(
    chunk_size=512,
    chunk_overlap=50,
    embedding_provider=embedding,
    breakpoint_threshold_type=BreakpointThresholdType.PERCENTILE,
    breakpoint_threshold_amount=95.0
)
splitter = SemanticChunker(splitter_config)

# Setup KnowledgeBase
loader = TextLoader(TextLoaderConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="semantic_docs",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["document.txt"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[splitter]
)

# Query with Agent
agent = Agent("openai/gpt-4o")
task = Task("Identify different topics", context=[kb])
result = agent.do(task)
print(result)

Parameters

ParameterTypeDescriptionDefaultSource
chunk_sizeintTarget size of each chunk1024Base
chunk_overlapintOverlapping units between chunks200Base
min_chunk_sizeint | NoneMinimum size for a chunkNoneBase
length_functionCallable[[str], int]Function to measure text lengthlenBase
strip_whitespaceboolStrip leading/trailing whitespaceFalseBase
embedding_providerEmbeddingProviderRequired embedding provider instanceRequiredSpecific
breakpoint_threshold_typeBreakpointThresholdTypeStatistical method for breakpointsPERCENTILESpecific
breakpoint_threshold_amountfloatNumeric value for threshold type95.0Specific
sentence_splitterCallable[[str], list[str]]Function to split text into sentencesDefault regexSpecific