Skip to main content

Overview

What is Unified OCR?

The Unified OCR system in Upsonic provides a consistent interface for optical character recognition across multiple OCR engines. Instead of learning different APIs for each OCR provider, you use a single OCR class that works seamlessly with EasyOCR, RapidOCR, Tesseract, DeepSeek, and PaddleOCR. The OCR class serves as a high-level orchestrator that:
  • Manages multiple OCR provider backends with a unified API
  • Handles image preprocessing (rotation correction, contrast enhancement, noise reduction)
  • Converts PDFs to images with configurable DPI
  • Tracks confidence scores and bounding box detection
  • Collects performance metrics and processing statistics
  • Provides provider-specific features and optimizations
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR

# Create OCR instance
ocr = OCR(EasyOCR, languages=['en'], rotation_fix=True)

# Extract text
text = ocr.get_text('document.pdf')
print(text)

How Unified OCR Works

The OCR system follows a clear processing pipeline:
  1. File Preparation: Validates file existence and format (supports .png, .jpg, .jpeg, .bmp, .tiff, .tif, .webp, .pdf)
  2. PDF Conversion: If the file is a PDF, converts each page to images at the specified DPI
  3. Image Preprocessing: Optionally applies rotation correction, contrast enhancement, and noise reduction
  4. OCR Processing: Processes each image through the selected provider’s engine
  5. Result Aggregation: Combines results from multiple pages, calculates average confidence scores
  6. Metrics Tracking: Updates processing statistics for performance analysis
from upsonic.ocr import OCR
from upsonic.ocr.rapidocr import RapidOCR

# Create OCR with preprocessing
ocr = OCR(
    RapidOCR,
    languages=['en'],
    rotation_fix=True,
    enhance_contrast=True,
    pdf_dpi=300
)

# Process file - returns detailed results
result = ocr.process_file('document.pdf')

print(f"Text: {result.text}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Pages: {result.page_count}")
print(f"Processing time: {result.processing_time_ms:.2f}ms")

Attributes

The OCR system is configured through OCRConfig, which provides the following attributes:
AttributeTypeDefaultDescription
languagesList[str]['en']Languages to detect (e.g., [‘en’, ‘zh’, ‘ja’])
confidence_thresholdfloat0.0Minimum confidence threshold (0.0-1.0) for accepting OCR results
rotation_fixboolFalseEnable automatic rotation correction for skewed images
enhance_contrastboolFalseEnhance image contrast before OCR processing
remove_noiseboolFalseApply noise reduction filter to improve text clarity
pdf_dpiint300DPI resolution for PDF rendering (higher = better quality, slower)
preserve_formattingboolTrueTry to preserve text formatting (line breaks, spacing)
Configuration Example:
from upsonic.ocr import OCR
from upsonic.ocr.base import OCRConfig
from upsonic.ocr.tesseract import TesseractOCR

# Method 1: Using OCRConfig
config = OCRConfig(
    languages=['eng', 'fra'],
    confidence_threshold=0.6,
    rotation_fix=True,
    enhance_contrast=True,
    remove_noise=True,
    pdf_dpi=300,
    preserve_formatting=True
)
ocr = OCR(TesseractOCR, config=config)

# Method 2: Direct parameters
ocr = OCR(
    TesseractOCR,
    languages=['eng', 'fra'],
    confidence_threshold=0.6,
    rotation_fix=True,
    enhance_contrast=True
)

Providers

EasyOCR

Ready-to-use OCR with 80+ supported languages using deep learning models. Best for multi-language support with high accuracy. Usage:
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR

# Create OCR with EasyOCR
ocr = OCR(EasyOCR, languages=['en', 'zh'], gpu=True, rotation_fix=True)

# Extract text
text = ocr.get_text('document.pdf')
print(text)

# Get detailed results
result = ocr.process_file('image.png')
print(f"Confidence: {result.confidence:.2%}")
for block in result.blocks:
    print(f"Text: {block.text}, Confidence: {block.confidence:.2%}")
Parameters:
ParameterTypeDefaultDescription
languagesList[str]['en']List of language codes to detect
gpuboolFalseEnable GPU acceleration for faster processing
rotation_fixboolFalseAuto-detect and fix image rotation
enhance_contrastboolFalseEnhance image contrast
remove_noiseboolFalseApply noise reduction
confidence_thresholdfloat0.0Minimum confidence for text blocks
paragraphboolFalseGroup text into paragraphs
min_sizeint10Minimum text region size
text_thresholdfloat0.7Text detection threshold
Supported Languages: 80+ languages including English, Chinese, Japanese, Korean, Thai, Vietnamese, Arabic, Russian, and most European languages.

RapidOCR

Lightweight OCR based on ONNX Runtime for fast inference. Best for speed and lightweight deployment. Usage:
from upsonic.ocr import OCR
from upsonic.ocr.rapidocr import RapidOCR

# Create OCR with RapidOCR
ocr = OCR(RapidOCR, languages=['en', 'ch'], confidence_threshold=0.5)

# Extract text from image
text = ocr.get_text('invoice.png')
print(text)

# Process PDF
result = ocr.process_file('document.pdf')
print(f"Extracted {len(result.text)} characters from {result.page_count} pages")
Parameters:
ParameterTypeDefaultDescription
languagesList[str]['en']List of language codes (primarily ‘en’ and ‘ch’)
confidence_thresholdfloat0.0Minimum confidence for text blocks
rotation_fixboolFalseAuto-detect and fix image rotation
enhance_contrastboolFalseEnhance image contrast
remove_noiseboolFalseApply noise reduction
pdf_dpiint300DPI for PDF rendering
Supported Languages: English, Chinese (simplified and traditional), Japanese, Korean, and several other scripts including Tamil, Telugu, Arabic, Cyrillic, and Devanagari.

Tesseract

Google’s open-source OCR engine with 100+ language support. Best for traditional OCR with extensive language coverage. Usage:
from upsonic.ocr import OCR
from upsonic.ocr.tesseract import TesseractOCR

# Create OCR with Tesseract
ocr = OCR(TesseractOCR, languages=['eng', 'fra'], enhance_contrast=True)

# Extract text
text = ocr.get_text('receipt.jpg')
print(text)

# Custom Tesseract configuration
result = ocr.process_file('document.pdf', psm=3, oem=3)
print(f"Text: {result.text}")
Parameters:
ParameterTypeDefaultDescription
languagesList[str]['eng']List of Tesseract language codes
tesseract_cmdstrNonePath to tesseract executable
confidence_thresholdfloat0.0Minimum confidence for text blocks
rotation_fixboolFalseAuto-detect and fix image rotation
enhance_contrastboolFalseEnhance image contrast
remove_noiseboolFalseApply noise reduction
preserve_formattingboolTruePreserve text layout and formatting
psmint3Page segmentation mode (0-13)
oemint3OCR Engine Mode (0-3)
custom_configstr''Additional Tesseract configuration string
Supported Languages: 100+ languages including all major languages. Requires language packs to be installed separately. Installation Note: Tesseract must be installed on the system:
  • Ubuntu/Debian: sudo apt-get install tesseract-ocr
  • macOS: brew install tesseract
  • Windows: Download installer from GitHub

PaddleOCR

Comprehensive OCR with multiple specialized pipelines for advanced document understanding. Usage:
from upsonic.ocr import OCR
from upsonic.ocr.paddleocr import PaddleOCR, PPStructureV3, PPChatOCRv4, PaddleOCRVL

# General OCR (PP-OCRv5)
ocr = OCR(PaddleOCR, lang='en', ocr_version='PP-OCRv5')
text = ocr.get_text('document.pdf')

# Advanced document structure recognition
ocr_structure = OCR(
    PPStructureV3,
    use_table_recognition=True,
    use_formula_recognition=True
)
result = ocr_structure.process_file('research_paper.pdf')

# Chat-based document understanding
ocr_chat = OCR(
    PPChatOCRv4,
    use_table_recognition=True,
    use_seal_recognition=True
)

# Vision-Language document understanding
ocr_vl = OCR(
    PaddleOCRVL,
    use_layout_detection=True,
    use_chart_recognition=True,
    format_block_content=True
)
Parameters: PaddleOCR (General OCR):
ParameterTypeDefaultDescription
langstr'en'Language code
ocr_versionstr'PP-OCRv5'OCR version (‘PP-OCRv3’, ‘PP-OCRv4’, ‘PP-OCRv5’)
use_doc_orientation_classifyboolNoneEnable document orientation classification
use_doc_unwarpingboolNoneEnable document unwarping
use_textline_orientationboolNoneEnable text line orientation detection
text_det_limit_side_lenintNoneLimit on detection input side length
text_rec_score_threshfloatNoneText recognition score threshold
return_word_boxboolNoneReturn word-level bounding boxes
PPStructureV3 (Document Structure):
ParameterTypeDefaultDescription
use_table_recognitionboolNoneEnable table recognition
use_formula_recognitionboolNoneEnable formula recognition
use_seal_recognitionboolNoneEnable seal text recognition
use_chart_recognitionboolNoneEnable chart recognition
layout_thresholdfloatNoneLayout detection score threshold
langstr'en'Language code
PPChatOCRv4 (Chat-based OCR):
ParameterTypeDefaultDescription
use_table_recognitionboolNoneEnable table recognition
use_seal_recognitionboolNoneEnable seal recognition
mllm_chat_bot_configdictNoneMultimodal LLM configuration
retriever_configdictNoneRetriever configuration for vector search
PaddleOCRVL (Vision-Language):
ParameterTypeDefaultDescription
use_layout_detectionboolNoneEnable layout detection
use_chart_recognitionboolNoneEnable chart recognition
format_block_contentboolNoneFormat content as Markdown
vl_rec_backendstr'local'VL recognition backend
temperaturefloatNoneSampling temperature for VLM
Supported Languages: 40+ languages for PP-OCRv5, with extensive support in PP-OCRv3 for Asian, European, and Middle Eastern languages.

Metrics and Performance

Enabling Metrics

The OCR system automatically tracks metrics for all operations. Metrics include files processed, pages, characters, confidence scores, and processing time.
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR

# Create OCR instance
ocr = OCR(EasyOCR, languages=['en'])

# Process multiple files
ocr.get_text('document1.pdf')
ocr.get_text('document2.pdf')
ocr.get_text('image.png')

# Get metrics
metrics = ocr.get_metrics()

print(f"Files processed: {metrics.files_processed}")
print(f"Total pages: {metrics.total_pages}")
print(f"Total characters: {metrics.total_characters}")
print(f"Average confidence: {metrics.average_confidence:.2%}")
print(f"Total processing time: {metrics.processing_time_ms:.2f}ms")
print(f"Provider: {metrics.provider}")

# Reset metrics for new batch
ocr.reset_metrics()

Analyzing Performance

Use metrics to analyze and optimize OCR performance across different providers and configurations.
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR
from upsonic.ocr.rapidocr import RapidOCR
from upsonic.ocr.tesseract import TesseractOCR

def benchmark_providers(file_path):
    """Compare performance of different OCR providers."""
    providers = [
        ('EasyOCR', EasyOCR, {'languages': ['en'], 'gpu': False}),
        ('RapidOCR', RapidOCR, {'languages': ['en']}),
        ('Tesseract', TesseractOCR, {'languages': ['eng']})
    ]
    
    results = {}
    
    for name, provider_class, params in providers:
        ocr = OCR(provider_class, **params)
        ocr.reset_metrics()
        
        # Process file
        result = ocr.process_file(file_path)
        
        # Get metrics
        metrics = ocr.get_metrics()
        
        results[name] = {
            'confidence': result.confidence,
            'processing_time_ms': result.processing_time_ms,
            'characters': len(result.text)
        }
    
    # Print comparison
    print("Provider Performance Comparison:")
    for name, data in results.items():
        print(f"\n{name}:")
        print(f"  Confidence: {data['confidence']:.2%}")
        print(f"  Time: {data['processing_time_ms']:.2f}ms")
        print(f"  Characters: {data['characters']}")
    
    return results

# Run benchmark
benchmark_providers('test_document.pdf')

Advanced Features

Provider Selection Helper

Use the infer_provider function to create OCR instances by provider name without importing provider classes.
from upsonic.ocr import infer_provider

# Create OCR by provider name
ocr = infer_provider('easyocr', languages=['en'], rotation_fix=True)
text = ocr.get_text('document.pdf')

# Available provider names:
# 'easyocr', 'rapidocr', 'tesseract', 'deepseek', 'deepseek_ocr'
# 'paddleocr', 'paddle', 'ppstructurev3', 'ppchatocrv4', 'paddleocrvl'

Batch Processing with DeepSeek

DeepSeek OCR provides optimized batch processing for multi-page PDFs, processing all pages in a single batch for better performance.
from upsonic.ocr import OCR
from upsonic.ocr.deepseek import DeepSeekOCR

# Create DeepSeek OCR
ocr = OCR(
    DeepSeekOCR,
    model_name="deepseek-ai/DeepSeek-OCR",
    temperature=0.0,
    max_tokens=8192
)

# Automatically uses batch processing for PDFs
result = ocr.process_file('multi_page_document.pdf')
print(f"Processed {result.page_count} pages")

Advanced PaddleOCR Features

PaddleOCR providers offer specialized features for complex document understanding. Structure Recognition with PPStructureV3:
from upsonic.ocr import OCR
from upsonic.ocr.paddleocr import PPStructureV3

# Create structure-aware OCR
ocr = OCR(
    PPStructureV3,
    use_table_recognition=True,
    use_formula_recognition=True,
    use_chart_recognition=True
)

# Extract structured content
result = ocr.provider.predict('research_paper.pdf')

# Get markdown representation
markdown_text = ocr.provider.concatenate_markdown_pages(result)
print(markdown_text)
Information Extraction with PPChatOCRv4:
from upsonic.ocr import OCR
from upsonic.ocr.paddleocr import PPChatOCRv4

# Create chat-based OCR
ocr = OCR(
    PPChatOCRv4,
    use_table_recognition=True,
    use_seal_recognition=True
)

# Extract visual information
visual_result = ocr.provider.visual_predict('invoice.pdf')

# Build vector embeddings for retrieval
vector_info = ocr.provider.build_vector(
    visual_result,
    min_characters=3500,
    block_size=300
)

# Extract specific fields using chat interface
invoice_data = ocr.provider.chat(
    key_list=['invoice_number', 'date', 'total_amount', 'vendor_name'],
    visual_info=visual_result,
    use_vector_retrieval=True,
    vector_info=vector_info
)

print(f"Invoice Number: {invoice_data.get('invoice_number')}")
print(f"Date: {invoice_data.get('date')}")
print(f"Total: {invoice_data.get('total_amount')}")

Image Preprocessing

Apply preprocessing to improve OCR accuracy for low-quality images.
from upsonic.ocr import OCR
from upsonic.ocr.tesseract import TesseractOCR

# Create OCR with all preprocessing enabled
ocr = OCR(
    TesseractOCR,
    languages=['eng'],
    rotation_fix=True,        # Fix skewed/rotated images
    enhance_contrast=True,    # Improve text clarity
    remove_noise=True,        # Remove background noise
    pdf_dpi=300              # High quality PDF rendering
)

# Process low-quality image
text = ocr.get_text('skewed_noisy_image.jpg')

Examples

Basic OCR Example

About the Example: This example demonstrates a complete document processing pipeline using the Unified OCR system. It processes all PDF documents in a directory, extracts text with preprocessing, saves results and metadata, and generates a processing summary with metrics. Unified OCR Configuration:
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR

# Configure OCR with preprocessing for best results
ocr = OCR(
    EasyOCR,
    languages=['en'],
    confidence_threshold=0.6,
    rotation_fix=True,
    enhance_contrast=True,
    remove_noise=True,
    pdf_dpi=250,
    gpu=True
)
Full Code:
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR
from upsonic.ocr.exceptions import OCRError
from pathlib import Path
import json

def process_documents(directory: str, output_dir: str):
    """Process all PDF documents in a directory."""
    
    # Create OCR instance with optimal configuration
    ocr = OCR(
        EasyOCR,
        languages=['en'],
        confidence_threshold=0.6,
        rotation_fix=True,
        enhance_contrast=True,
        remove_noise=True,
        pdf_dpi=250,
        gpu=True
    )
    
    # Create output directory
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    # Process each PDF
    input_path = Path(directory)
    for pdf_file in input_path.glob('*.pdf'):
        try:
            print(f"Processing {pdf_file.name}...")
            
            # Extract text with detailed results
            result = ocr.process_file(pdf_file)
            
            # Save extracted text
            text_file = output_path / f"{pdf_file.stem}.txt"
            text_file.write_text(result.text)
            
            # Save metadata
            metadata_file = output_path / f"{pdf_file.stem}_metadata.json"
            metadata_file.write_text(json.dumps(result.to_dict(), indent=2))
            
            # Log results
            print(f"  ✓ Extracted {len(result.text)} characters")
            print(f"  ✓ Confidence: {result.confidence:.2%}")
            print(f"  ✓ Pages: {result.page_count}")
            print(f"  ✓ Time: {result.processing_time_ms:.0f}ms")
            print(f"  ✓ Blocks: {len(result.blocks)}")
            
        except OCRError as e:
            print(f"  ✗ Error: {e}")
            continue
    
    # Print summary
    metrics = ocr.get_metrics()
    print(f"\n=== Summary ===")
    print(f"Files processed: {metrics.files_processed}")
    print(f"Total pages: {metrics.total_pages}")
    print(f"Total characters: {metrics.total_characters}")
    print(f"Average confidence: {metrics.average_confidence:.2%}")
    print(f"Total time: {metrics.processing_time_ms / 1000:.2f}s")

if __name__ == "__main__":
    process_documents('input_pdfs', 'output_text')

Multi-Language Document Processing

About the Example: Process documents containing multiple languages using EasyOCR’s multi-language support. This example shows how to handle mixed-language content and analyze confidence scores per text block. Unified OCR Configuration:
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR

# Configure for multiple languages
ocr = OCR(
    EasyOCR,
    languages=['en', 'zh', 'ja', 'ko'],
    gpu=True,
    confidence_threshold=0.5
)
Full Code:
from upsonic.ocr import OCR
from upsonic.ocr.easyocr import EasyOCR

# Create multi-language OCR
ocr = OCR(
    EasyOCR,
    languages=['en', 'zh', 'ja', 'ko'],
    gpu=True,
    confidence_threshold=0.5
)

# Process mixed-language document
result = ocr.process_file('multilingual_doc.pdf')

# Analyze results
print(f"Extracted text:\n{result.text}\n")
print(f"Overall confidence: {result.confidence:.2%}")

# Check per-block confidence
low_confidence_blocks = [
    block for block in result.blocks 
    if block.confidence < 0.6
]
print(f"Low confidence blocks: {len(low_confidence_blocks)}")

# Show detailed block analysis
for i, block in enumerate(result.blocks[:5], 1):
    print(f"\nBlock {i}:")
    print(f"  Text: {block.text[:50]}...")
    print(f"  Confidence: {block.confidence:.2%}")
    if block.bbox:
        print(f"  Position: ({block.bbox.x:.0f}, {block.bbox.y:.0f})")

Invoice Data Extraction with PaddleOCR

About the Example: Extract structured information from invoices using PPChatOCRv4’s advanced features including table recognition, seal recognition, and key-value extraction. Unified OCR Configuration:
from upsonic.ocr import OCR
from upsonic.ocr.paddleocr import PPChatOCRv4

# Configure for invoice processing
ocr = OCR(
    PPChatOCRv4,
    use_table_recognition=True,
    use_seal_recognition=True,
    lang='en'
)
Full Code:
from upsonic.ocr import OCR
from upsonic.ocr.paddleocr import PPChatOCRv4

# Create OCR with table and seal recognition
ocr = OCR(
    PPChatOCRv4,
    use_table_recognition=True,
    use_seal_recognition=True,
    lang='en'
)

# Extract visual information
visual_result = ocr.provider.visual_predict('invoice.pdf')

# Build vector index for retrieval
vector_info = ocr.provider.build_vector(
    visual_result,
    min_characters=3500,
    block_size=300
)

# Extract specific fields
invoice_data = ocr.provider.chat(
    key_list=[
        'invoice_number',
        'invoice_date',
        'vendor_name',
        'total_amount',
        'tax_amount',
        'line_items'
    ],
    visual_info=visual_result,
    use_vector_retrieval=True,
    vector_info=vector_info
)

# Display extracted information
print(f"Invoice Number: {invoice_data.get('invoice_number')}")
print(f"Date: {invoice_data.get('invoice_date')}")
print(f"Vendor: {invoice_data.get('vendor_name')}")
print(f"Total: {invoice_data.get('total_amount')}")
print(f"Tax: {invoice_data.get('tax_amount')}")
print(f"\nLine Items: {invoice_data.get('line_items')}")