Overview
HTML loader extracts content from local HTML files and web URLs. Uses BeautifulSoup4 for parsing with options to extract text, tables, links, and images while preserving document structure. Loader Class:HTMLLoader
Config Class: HTMLLoaderConfig
Install
Install the HTML loader optional dependency group:
Examples
Parameters
| Parameter | Type | Description | Default | Source |
|---|---|---|---|---|
encoding | str | None | File encoding (auto-detected if None) | None | Base |
error_handling | "ignore" | "warn" | "raise" | How to handle loading errors | ”warn” | Base |
include_metadata | bool | Whether to include file metadata | True | Base |
custom_metadata | dict | Additional metadata to include | Base | |
max_file_size | int | None | Maximum file size in bytes | None | Base |
skip_empty_content | bool | Skip documents with empty content | True | Base |
extract_text | bool | Extract text content from HTML | True | Specific |
preserve_structure | bool | Preserve document structure in output | True | Specific |
include_links | bool | Include links in extracted content | True | Specific |
include_images | bool | Include image information | False | Specific |
remove_scripts | bool | Remove script tags | True | Specific |
remove_styles | bool | Remove style tags | True | Specific |
extract_metadata | bool | Extract metadata from HTML head | True | Specific |
clean_whitespace | bool | Clean up whitespace in output | True | Specific |
extract_headers | bool | Extract heading elements | True | Specific |
extract_paragraphs | bool | Extract paragraph content | True | Specific |
extract_lists | bool | Extract list content | True | Specific |
extract_tables | bool | Extract table content | True | Specific |
table_format | "text" | "markdown" | "html" | How to format extracted tables | ”text” | Specific |
user_agent | str | User agent for web requests | ”Upsonic HTML Loader 1.0” | Specific |

