Skip to main content

Overview

HTML loader extracts content from local HTML files and web URLs. Uses BeautifulSoup4 for parsing with options to extract text, tables, links, and images while preserving document structure. Loader Class: HTMLLoader Config Class: HTMLLoaderConfig

Dependencies

pip install "upsonic[loaders]"

Examples

from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders import HTMLLoader, HTMLLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter import RecursiveChunker, RecursiveChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure loader
loader_config = HTMLLoaderConfig(
    extract_text=True,
    extract_tables=True,
    table_format="markdown",
    include_links=True
)
loader = HTMLLoader(loader_config)

# Setup KnowledgeBase
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
chunker = RecursiveChunker(RecursiveChunkingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="html_docs",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["https://example.com/article"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[chunker]
)

# Query with Agent
agent = Agent("openai/gpt-4o")
task = Task("Summarize the article", context=[kb])
result = agent.do(task)
print(result)

Parameters

ParameterTypeDescriptionDefaultSource
encodingstr | NoneFile encoding (auto-detected if None)NoneBase
error_handling"ignore" | "warn" | "raise"How to handle loading errors”warn”Base
include_metadataboolWhether to include file metadataTrueBase
custom_metadatadictAdditional metadata to includeBase
max_file_sizeint | NoneMaximum file size in bytesNoneBase
skip_empty_contentboolSkip documents with empty contentTrueBase
extract_textboolExtract text content from HTMLTrueSpecific
preserve_structureboolPreserve document structure in outputTrueSpecific
include_linksboolInclude links in extracted contentTrueSpecific
include_imagesboolInclude image informationFalseSpecific
remove_scriptsboolRemove script tagsTrueSpecific
remove_stylesboolRemove style tagsTrueSpecific
extract_metadataboolExtract metadata from HTML headTrueSpecific
clean_whitespaceboolClean up whitespace in outputTrueSpecific
extract_headersboolExtract heading elementsTrueSpecific
extract_paragraphsboolExtract paragraph contentTrueSpecific
extract_listsboolExtract list contentTrueSpecific
extract_tablesboolExtract table contentTrueSpecific
table_format"text" | "markdown" | "html"How to format extracted tables”text”Specific
user_agentstrUser agent for web requests”Upsonic HTML Loader 1.0”Specific