Documentation Index
Fetch the complete documentation index at: https://docs.upsonic.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
What is Unified OCR?
The Unified OCR system in Upsonic provides a consistent interface for optical character recognition across multiple OCR engines. Instead of learning different APIs for each OCR provider, you use a single OCR class that works seamlessly with EasyOCR, RapidOCR, Tesseract, DeepSeek, and PaddleOCR.
The OCR class serves as a high-level orchestrator that:
- Manages multiple OCR provider backends with a unified API
- Handles image preprocessing (rotation correction, contrast enhancement, noise reduction)
- Converts PDFs to images with configurable DPI
- Tracks confidence scores and bounding box detection
- Collects performance metrics and processing statistics
- Provides provider-specific features and optimizations
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR
# Create OCR instance
ocr = OCR(EasyOCR, languages=['en'], rotation_fix=True)
# Extract text
text = ocr.get_text('document.pdf')
print(text)
How Unified OCR Works
The OCR system follows a clear processing pipeline:
- File Preparation: Validates file existence and format (supports .png, .jpg, .jpeg, .bmp, .tiff, .tif, .webp, .pdf)
- PDF Conversion: If the file is a PDF, converts each page to images at the specified DPI
- Image Preprocessing: Optionally applies rotation correction, contrast enhancement, and noise reduction
- OCR Processing: Processes each image through the selected provider’s engine
- Result Aggregation: Combines results from multiple pages, calculates average confidence scores
- Metrics Tracking: Updates processing statistics for performance analysis
from upsonic.ocr import OCR
from upsonic.ocr.rapidocr import RapidOCR
# Create OCR with preprocessing
ocr = OCR(
RapidOCR,
languages=['en'],
rotation_fix=True,
enhance_contrast=True,
pdf_dpi=300
)
# Process file - returns detailed results
result = ocr.process_file('document.pdf')
print(f"Text: {result.text}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Pages: {result.page_count}")
print(f"Processing time: {result.processing_time_ms:.2f}ms")
Attributes
The OCR system is configured through OCRConfig, which provides the following attributes:
| Attribute | Type | Default | Description |
|---|
languages | List[str] | ['en'] | Languages to detect (e.g., [‘en’, ‘zh’, ‘ja’]) |
confidence_threshold | float | 0.0 | Minimum confidence threshold (0.0-1.0) for accepting OCR results |
rotation_fix | bool | False | Enable automatic rotation correction for skewed images |
enhance_contrast | bool | False | Enhance image contrast before OCR processing |
remove_noise | bool | False | Apply noise reduction filter to improve text clarity |
pdf_dpi | int | 300 | DPI resolution for PDF rendering (higher = better quality, slower) |
preserve_formatting | bool | True | Try to preserve text formatting (line breaks, spacing) |
Configuration Example:
from upsonic.ocr import OCR
from upsonic.ocr.base import OCRConfig
from upsonic.ocr.tesseract import TesseractOCR
# Method 1: Using OCRConfig
config = OCRConfig(
languages=['eng', 'fra'],
confidence_threshold=0.6,
rotation_fix=True,
enhance_contrast=True,
remove_noise=True,
pdf_dpi=300,
preserve_formatting=True
)
ocr = OCR(TesseractOCR, config=config)
# Method 2: Direct parameters
ocr = OCR(
TesseractOCR,
languages=['eng', 'fra'],
confidence_threshold=0.6,
rotation_fix=True,
enhance_contrast=True
)
Providers
EasyOCR
Ready-to-use OCR with 80+ supported languages using deep learning models. Best for multi-language support with high accuracy.
Usage:
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR
# Create OCR with EasyOCR
ocr = OCR(EasyOCR, languages=['en', 'zh'], gpu=True, rotation_fix=True)
# Extract text
text = ocr.get_text('document.pdf')
print(text)
# Get detailed results
result = ocr.process_file('image.png')
print(f"Confidence: {result.confidence:.2%}")
for block in result.blocks:
print(f"Text: {block.text}, Confidence: {block.confidence:.2%}")
Parameters:
| Parameter | Type | Default | Description |
|---|
languages | List[str] | ['en'] | List of language codes to detect |
gpu | bool | False | Enable GPU acceleration for faster processing |
rotation_fix | bool | False | Auto-detect and fix image rotation |
enhance_contrast | bool | False | Enhance image contrast |
remove_noise | bool | False | Apply noise reduction |
confidence_threshold | float | 0.0 | Minimum confidence for text blocks |
paragraph | bool | False | Group text into paragraphs |
min_size | int | 10 | Minimum text region size |
text_threshold | float | 0.7 | Text detection threshold |
Supported Languages: 80+ languages including English, Chinese, Japanese, Korean, Thai, Vietnamese, Arabic, Russian, and most European languages.
RapidOCR
Lightweight OCR based on ONNX Runtime for fast inference. Best for speed and lightweight deployment.
Usage:
from upsonic.ocr import OCR
from upsonic.ocr.rapidocr import RapidOCR
# Create OCR with RapidOCR
ocr = OCR(RapidOCR, languages=['en', 'ch'], confidence_threshold=0.5)
# Extract text from image
text = ocr.get_text('invoice.png')
print(text)
# Process PDF
result = ocr.process_file('document.pdf')
print(f"Extracted {len(result.text)} characters from {result.page_count} pages")
Parameters:
| Parameter | Type | Default | Description |
|---|
languages | List[str] | ['en'] | List of language codes (primarily ‘en’ and ‘ch’) |
confidence_threshold | float | 0.0 | Minimum confidence for text blocks |
rotation_fix | bool | False | Auto-detect and fix image rotation |
enhance_contrast | bool | False | Enhance image contrast |
remove_noise | bool | False | Apply noise reduction |
pdf_dpi | int | 300 | DPI for PDF rendering |
Supported Languages: English, Chinese (simplified and traditional), Japanese, Korean, and several other scripts including Tamil, Telugu, Arabic, Cyrillic, and Devanagari.
Tesseract
Google’s open-source OCR engine with 100+ language support. Best for traditional OCR with extensive language coverage.
Usage:
from upsonic.ocr import OCR
from upsonic.ocr.tesseract import TesseractOCR
# Create OCR with Tesseract
ocr = OCR(TesseractOCR, languages=['eng', 'fra'], enhance_contrast=True)
# Extract text
text = ocr.get_text('receipt.jpg')
print(text)
# Custom Tesseract configuration
result = ocr.process_file('document.pdf', psm=3, oem=3)
print(f"Text: {result.text}")
Parameters:
| Parameter | Type | Default | Description |
|---|
languages | List[str] | ['eng'] | List of Tesseract language codes |
tesseract_cmd | str | None | Path to tesseract executable |
confidence_threshold | float | 0.0 | Minimum confidence for text blocks |
rotation_fix | bool | False | Auto-detect and fix image rotation |
enhance_contrast | bool | False | Enhance image contrast |
remove_noise | bool | False | Apply noise reduction |
preserve_formatting | bool | True | Preserve text layout and formatting |
psm | int | 3 | Page segmentation mode (0-13) |
oem | int | 3 | OCR Engine Mode (0-3) |
custom_config | str | '' | Additional Tesseract configuration string |
Supported Languages: 100+ languages including all major languages. Requires language packs to be installed separately.
Installation Note: Tesseract must be installed on the system:
- Ubuntu/Debian:
sudo apt-get install tesseract-ocr
- macOS:
brew install tesseract
- Windows: Download installer from GitHub
PaddleOCR
Comprehensive OCR with multiple specialized pipelines for advanced document understanding.
Usage:
from upsonic.ocr import OCR
from upsonic.ocr.paddleocr import PaddleOCR, PPStructureV3, PPChatOCRv4, PaddleOCRVL
# General OCR (PP-OCRv5)
ocr = OCR(PaddleOCR, lang='en', ocr_version='PP-OCRv5')
text = ocr.get_text('document.pdf')
# Advanced document structure recognition
ocr_structure = OCR(
PPStructureV3,
use_table_recognition=True,
use_formula_recognition=True
)
result = ocr_structure.process_file('research_paper.pdf')
# Chat-based document understanding
ocr_chat = OCR(
PPChatOCRv4,
use_table_recognition=True,
use_seal_recognition=True
)
# Vision-Language document understanding
ocr_vl = OCR(
PaddleOCRVL,
use_layout_detection=True,
use_chart_recognition=True,
format_block_content=True
)
Parameters:
PaddleOCR (General OCR):
| Parameter | Type | Default | Description |
|---|
lang | str | 'en' | Language code |
ocr_version | str | 'PP-OCRv5' | OCR version (‘PP-OCRv3’, ‘PP-OCRv4’, ‘PP-OCRv5’) |
use_doc_orientation_classify | bool | None | Enable document orientation classification |
use_doc_unwarping | bool | None | Enable document unwarping |
use_textline_orientation | bool | None | Enable text line orientation detection |
text_det_limit_side_len | int | None | Limit on detection input side length |
text_rec_score_thresh | float | None | Text recognition score threshold |
return_word_box | bool | None | Return word-level bounding boxes |
PPStructureV3 (Document Structure):
| Parameter | Type | Default | Description |
|---|
use_table_recognition | bool | None | Enable table recognition |
use_formula_recognition | bool | None | Enable formula recognition |
use_seal_recognition | bool | None | Enable seal text recognition |
use_chart_recognition | bool | None | Enable chart recognition |
layout_threshold | float | None | Layout detection score threshold |
lang | str | 'en' | Language code |
PPChatOCRv4 (Chat-based OCR):
| Parameter | Type | Default | Description |
|---|
use_table_recognition | bool | None | Enable table recognition |
use_seal_recognition | bool | None | Enable seal recognition |
mllm_chat_bot_config | dict | None | Multimodal LLM configuration |
retriever_config | dict | None | Retriever configuration for vector search |
PaddleOCRVL (Vision-Language):
| Parameter | Type | Default | Description |
|---|
use_layout_detection | bool | None | Enable layout detection |
use_chart_recognition | bool | None | Enable chart recognition |
format_block_content | bool | None | Format content as Markdown |
vl_rec_backend | str | 'local' | VL recognition backend |
temperature | float | None | Sampling temperature for VLM |
Supported Languages: 40+ languages for PP-OCRv5, with extensive support in PP-OCRv3 for Asian, European, and Middle Eastern languages.
Enabling Metrics
The OCR system automatically tracks metrics for all operations. Metrics include files processed, pages, characters, confidence scores, and processing time.
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR
# Create OCR instance
ocr = OCR(EasyOCR, languages=['en'])
# Process multiple files
ocr.get_text('document1.pdf')
ocr.get_text('document2.pdf')
ocr.get_text('image.png')
# Get metrics
metrics = ocr.get_metrics()
print(f"Files processed: {metrics.files_processed}")
print(f"Total pages: {metrics.total_pages}")
print(f"Total characters: {metrics.total_characters}")
print(f"Average confidence: {metrics.average_confidence:.2%}")
print(f"Total processing time: {metrics.processing_time_ms:.2f}ms")
print(f"Provider: {metrics.provider}")
# Reset metrics for new batch
ocr.reset_metrics()
Use metrics to analyze and optimize OCR performance across different providers and configurations.
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR
from upsonic.ocr.rapidocr import RapidOCR
from upsonic.ocr.tesseract import TesseractOCR
def benchmark_providers(file_path):
"""Compare performance of different OCR providers."""
providers = [
('EasyOCR', EasyOCR, {'languages': ['en'], 'gpu': False}),
('RapidOCR', RapidOCR, {'languages': ['en']}),
('Tesseract', TesseractOCR, {'languages': ['eng']})
]
results = {}
for name, provider_class, params in providers:
ocr = OCR(provider_class, **params)
ocr.reset_metrics()
# Process file
result = ocr.process_file(file_path)
# Get metrics
metrics = ocr.get_metrics()
results[name] = {
'confidence': result.confidence,
'processing_time_ms': result.processing_time_ms,
'characters': len(result.text)
}
# Print comparison
print("Provider Performance Comparison:")
for name, data in results.items():
print(f"\n{name}:")
print(f" Confidence: {data['confidence']:.2%}")
print(f" Time: {data['processing_time_ms']:.2f}ms")
print(f" Characters: {data['characters']}")
return results
# Run benchmark
benchmark_providers('test_document.pdf')
Advanced Features
Provider Selection Helper
Use the infer_provider function to create OCR instances by provider name without importing provider classes.
from upsonic.ocr import infer_provider
# Create OCR by provider name
ocr = infer_provider('easyocr', languages=['en'], rotation_fix=True)
text = ocr.get_text('document.pdf')
# Available provider names:
# 'easyocr', 'rapidocr', 'tesseract', 'deepseek', 'deepseek_ocr'
# 'paddleocr', 'paddle', 'ppstructurev3', 'ppchatocrv4', 'paddleocrvl'
Batch Processing with DeepSeek
DeepSeek OCR provides optimized batch processing for multi-page PDFs, processing all pages in a single batch for better performance.
from upsonic.ocr import OCR
from upsonic.ocr.deepseek import DeepSeekOCR
# Create DeepSeek OCR
ocr = OCR(
DeepSeekOCR,
model_name="deepseek-ai/DeepSeek-OCR",
temperature=0.0,
max_tokens=8192
)
# Automatically uses batch processing for PDFs
result = ocr.process_file('multi_page_document.pdf')
print(f"Processed {result.page_count} pages")
Advanced PaddleOCR Features
PaddleOCR providers offer specialized features for complex document understanding.
Structure Recognition with PPStructureV3:
from upsonic.ocr import OCR
from upsonic.ocr.paddleocr import PPStructureV3
# Create structure-aware OCR
ocr = OCR(
PPStructureV3,
use_table_recognition=True,
use_formula_recognition=True,
use_chart_recognition=True
)
# Extract structured content
result = ocr.provider.predict('research_paper.pdf')
# Get markdown representation
markdown_text = ocr.provider.concatenate_markdown_pages(result)
print(markdown_text)
Information Extraction with PPChatOCRv4:
from upsonic.ocr import OCR
from upsonic.ocr.paddleocr import PPChatOCRv4
# Create chat-based OCR
ocr = OCR(
PPChatOCRv4,
use_table_recognition=True,
use_seal_recognition=True
)
# Extract visual information
visual_result = ocr.provider.visual_predict('invoice.pdf')
# Build vector embeddings for retrieval
vector_info = ocr.provider.build_vector(
visual_result,
min_characters=3500,
block_size=300
)
# Extract specific fields using chat interface
invoice_data = ocr.provider.chat(
key_list=['invoice_number', 'date', 'total_amount', 'vendor_name'],
visual_info=visual_result,
use_vector_retrieval=True,
vector_info=vector_info
)
print(f"Invoice Number: {invoice_data.get('invoice_number')}")
print(f"Date: {invoice_data.get('date')}")
print(f"Total: {invoice_data.get('total_amount')}")
Image Preprocessing
Apply preprocessing to improve OCR accuracy for low-quality images.
from upsonic.ocr import OCR
from upsonic.ocr.tesseract import TesseractOCR
# Create OCR with all preprocessing enabled
ocr = OCR(
TesseractOCR,
languages=['eng'],
rotation_fix=True, # Fix skewed/rotated images
enhance_contrast=True, # Improve text clarity
remove_noise=True, # Remove background noise
pdf_dpi=300 # High quality PDF rendering
)
# Process low-quality image
text = ocr.get_text('skewed_noisy_image.jpg')
Examples
Basic OCR Example
About the Example:
This example demonstrates a complete document processing pipeline using the Unified OCR system. It processes all PDF documents in a directory, extracts text with preprocessing, saves results and metadata, and generates a processing summary with metrics.
Unified OCR Configuration:
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR
# Configure OCR with preprocessing for best results
ocr = OCR(
EasyOCR,
languages=['en'],
confidence_threshold=0.6,
rotation_fix=True,
enhance_contrast=True,
remove_noise=True,
pdf_dpi=250,
gpu=True
)
Full Code:
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR
from upsonic.ocr.exceptions import OCRError
from pathlib import Path
import json
def process_documents(directory: str, output_dir: str):
"""Process all PDF documents in a directory."""
# Create OCR instance with optimal configuration
ocr = OCR(
EasyOCR,
languages=['en'],
confidence_threshold=0.6,
rotation_fix=True,
enhance_contrast=True,
remove_noise=True,
pdf_dpi=250,
gpu=True
)
# Create output directory
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
# Process each PDF
input_path = Path(directory)
for pdf_file in input_path.glob('*.pdf'):
try:
print(f"Processing {pdf_file.name}...")
# Extract text with detailed results
result = ocr.process_file(pdf_file)
# Save extracted text
text_file = output_path / f"{pdf_file.stem}.txt"
text_file.write_text(result.text)
# Save metadata
metadata_file = output_path / f"{pdf_file.stem}_metadata.json"
metadata_file.write_text(json.dumps(result.to_dict(), indent=2))
# Log results
print(f" ✓ Extracted {len(result.text)} characters")
print(f" ✓ Confidence: {result.confidence:.2%}")
print(f" ✓ Pages: {result.page_count}")
print(f" ✓ Time: {result.processing_time_ms:.0f}ms")
print(f" ✓ Blocks: {len(result.blocks)}")
except OCRError as e:
print(f" ✗ Error: {e}")
continue
# Print summary
metrics = ocr.get_metrics()
print(f"\n=== Summary ===")
print(f"Files processed: {metrics.files_processed}")
print(f"Total pages: {metrics.total_pages}")
print(f"Total characters: {metrics.total_characters}")
print(f"Average confidence: {metrics.average_confidence:.2%}")
print(f"Total time: {metrics.processing_time_ms / 1000:.2f}s")
if __name__ == "__main__":
process_documents('input_pdfs', 'output_text')
Multi-Language Document Processing
About the Example:
Process documents containing multiple languages using EasyOCR’s multi-language support. This example shows how to handle mixed-language content and analyze confidence scores per text block.
Unified OCR Configuration:
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR
# Configure for multiple languages
ocr = OCR(
EasyOCR,
languages=['en', 'zh', 'ja', 'ko'],
gpu=True,
confidence_threshold=0.5
)
Full Code:
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR
# Create multi-language OCR
ocr = OCR(
EasyOCR,
languages=['en', 'zh', 'ja', 'ko'],
gpu=True,
confidence_threshold=0.5
)
# Process mixed-language document
result = ocr.process_file('multilingual_doc.pdf')
# Analyze results
print(f"Extracted text:\n{result.text}\n")
print(f"Overall confidence: {result.confidence:.2%}")
# Check per-block confidence
low_confidence_blocks = [
block for block in result.blocks
if block.confidence < 0.6
]
print(f"Low confidence blocks: {len(low_confidence_blocks)}")
# Show detailed block analysis
for i, block in enumerate(result.blocks[:5], 1):
print(f"\nBlock {i}:")
print(f" Text: {block.text[:50]}...")
print(f" Confidence: {block.confidence:.2%}")
if block.bbox:
print(f" Position: ({block.bbox.x:.0f}, {block.bbox.y:.0f})")
About the Example:
Extract structured information from invoices using PPChatOCRv4’s advanced features including table recognition, seal recognition, and key-value extraction.
Unified OCR Configuration:
from upsonic.ocr import OCR
from upsonic.ocr.paddleocr import PPChatOCRv4
# Configure for invoice processing
ocr = OCR(
PPChatOCRv4,
use_table_recognition=True,
use_seal_recognition=True,
lang='en'
)
Full Code:
from upsonic.ocr import OCR
from upsonic.ocr.paddleocr import PPChatOCRv4
# Create OCR with table and seal recognition
ocr = OCR(
PPChatOCRv4,
use_table_recognition=True,
use_seal_recognition=True,
lang='en'
)
# Extract visual information
visual_result = ocr.provider.visual_predict('invoice.pdf')
# Build vector index for retrieval
vector_info = ocr.provider.build_vector(
visual_result,
min_characters=3500,
block_size=300
)
# Extract specific fields
invoice_data = ocr.provider.chat(
key_list=[
'invoice_number',
'invoice_date',
'vendor_name',
'total_amount',
'tax_amount',
'line_items'
],
visual_info=visual_result,
use_vector_retrieval=True,
vector_info=vector_info
)
# Display extracted information
print(f"Invoice Number: {invoice_data.get('invoice_number')}")
print(f"Date: {invoice_data.get('invoice_date')}")
print(f"Vendor: {invoice_data.get('vendor_name')}")
print(f"Total: {invoice_data.get('total_amount')}")
print(f"Tax: {invoice_data.get('tax_amount')}")
print(f"\nLine Items: {invoice_data.get('line_items')}")