Skip to main content

Overview

JSON splitter operates on parsed JSON data, traversing the JSON graph to create chunks that are valid, self-contained JSON objects. Provides path-aware traceability by adding JSON paths to chunk metadata. Falls back to text chunking if JSON parsing fails. Splitter Class: JSONChunker Config Class: JSONChunkingConfig

Dependencies

No additional dependencies required. Uses standard library.

Examples

from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders.json import JSONLoader
from upsonic.loaders.config import JSONLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter.json_chunker import JSONChunker, JSONChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure splitter
splitter_config = JSONChunkingConfig(
    chunk_size=512,
    chunk_overlap=50,
    convert_lists_to_dicts=True,
    max_depth=50
)
splitter = JSONChunker(splitter_config)

# Setup KnowledgeBase
loader = JSONLoader(JSONLoaderConfig())
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="json_docs",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["data.json"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[splitter]
)

# Query with Agent
agent = Agent("openai/gpt-4o")
task = Task("Find all user records", context=[kb])
result = agent.do(task)
print(result)

Parameters

ParameterTypeDescriptionDefaultSource
chunk_sizeintTarget size of each chunk1024Base
chunk_overlapintOverlapping units between chunks200Base
min_chunk_sizeint | NoneMinimum size for a chunkNoneBase
length_functionCallable[[str], int]Function to measure text lengthlenBase
strip_whitespaceboolStrip leading/trailing whitespaceFalseBase
convert_lists_to_dictsboolConvert lists to dict-like objectsTrueSpecific
max_depthint | NoneMaximum recursion depth50Specific
json_encoder_optionsdictOptions for json.dumpsSpecific