> ## Documentation Index
> Fetch the complete documentation index at: https://docs.upsonic.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# PyPDF Loader

> Load PDF documents using pypdf library

## Overview

PyPDF loader extracts text from PDF documents using the `pypdf` library. It supports digital text extraction and OCR for scanned documents. Ideal for standard PDF files with text layers.

**Loader Class:** `PdfLoader`

**Config Class:** `PdfLoaderConfig`

## Install

<Note>
  Install the PyPDF loader optional dependency group:

  ```bash theme={null}
  uv pip install "upsonic[pdf-loader]"
  ```
</Note>

## Examples

```python theme={null}
from upsonic import Agent, Task, KnowledgeBase
from upsonic.loaders.pdf import PdfLoader
from upsonic.loaders.config import PdfLoaderConfig
from upsonic.embeddings import OpenAIEmbedding, OpenAIEmbeddingConfig
from upsonic.text_splitter.recursive import RecursiveChunker, RecursiveChunkingConfig
from upsonic.vectordb import ChromaProvider, ChromaConfig, ConnectionConfig, Mode

# Configure loader
loader_config = PdfLoaderConfig(
    extraction_mode="hybrid",
    start_page=1,
    end_page=10
)
loader = PdfLoader(loader_config)

# Setup KnowledgeBase
embedding = OpenAIEmbedding(OpenAIEmbeddingConfig())
chunker = RecursiveChunker(RecursiveChunkingConfig())
vectordb = ChromaProvider(ChromaConfig(
    collection_name="pdf_docs",
    vector_size=1536,
    connection=ConnectionConfig(mode=Mode.IN_MEMORY)
))

kb = KnowledgeBase(
    sources=["document.pdf"],
    embedding_provider=embedding,
    vectordb=vectordb,
    loaders=[loader],
    splitters=[chunker]
)

# Query with Agent
agent = Agent("anthropic/claude-sonnet-4-5")
task = Task("What is the main topic?", context=[kb])
result = agent.do(task)
print(result)
```

## Parameters

| Parameter                  | Type                                    | Description                              | Default  | Source   |
| -------------------------- | --------------------------------------- | ---------------------------------------- | -------- | -------- |
| `encoding`                 | `str \| None`                           | File encoding (auto-detected if None)    | None     | Base     |
| `error_handling`           | `"ignore" \| "warn" \| "raise"`         | How to handle loading errors             | "warn"   | Base     |
| `include_metadata`         | `bool`                                  | Whether to include file metadata         | True     | Base     |
| `custom_metadata`          | `dict`                                  | Additional metadata to include           | {}       | Base     |
| `max_file_size`            | `int \| None`                           | Maximum file size in bytes               | None     | Base     |
| `skip_empty_content`       | `bool`                                  | Skip documents with empty content        | True     | Base     |
| `extraction_mode`          | `"hybrid" \| "text_only" \| "ocr_only"` | Content extraction strategy              | "hybrid" | Specific |
| `start_page`               | `int \| None`                           | First page to process (1-indexed)        | None     | Specific |
| `end_page`                 | `int \| None`                           | Last page to process (inclusive)         | None     | Specific |
| `clean_page_numbers`       | `bool`                                  | Remove page numbers from headers/footers | True     | Specific |
| `page_num_start_format`    | `str \| None`                           | Format string for page start markers     | None     | Specific |
| `page_num_end_format`      | `str \| None`                           | Format string for page end markers       | None     | Specific |
| `extra_whitespace_removal` | `bool`                                  | Normalize whitespace                     | True     | Specific |
| `pdf_password`             | `str \| None`                           | Password for encrypted PDFs              | None     | Specific |
