docTR OCR

How docTR Works

docTR is a modern OCR (Optical Character Recognition) library that uses deep learning to extract text from documents and images. It operates using a two-stage process that separates detection from recognition:

Detection Stage

The detection model examines a document image and identifies where text appears. It draws invisible bounding boxes around text regions, such as individual words, complete lines, or entire paragraphs. This stage handles complex layouts like multi-column documents, tables, and mixed formatting by precisely locating text zones while filtering out decorative elements, images, and visual noise.

Recognition Stage

Once text regions are located, the recognition model analyzes the visual patterns within each detected area. Using neural networks trained to understand character shapes, fonts, and writing styles, it converts these visual patterns into actual letters, numbers, and words. This stage excels at handling distorted characters, unusual fonts, and faint or degraded text.

This separation allows docTR to achieve higher accuracy and flexibility than single-stage systems. Developers can mix and match different detection and recognition models to optimize for specific tasks, hardware constraints, or document types.

Model Presets in Assistivox AI

Assistivox AI provides three carefully selected preset combinations that balance performance, accuracy, and resource requirements:

Preset F (Fast - Low Resource)

Models: db_mobilenet_v3_large (detection) + crnn_mobilenet_v3_small (recognition)
Tagline: "Quickest, uses least memory; ideal for slow/old machines. May miss complex or small/faint text."

This combination prioritizes speed and minimal resource usage, making it ideal for older computers, low-memory systems, or situations where processing many documents quickly is more important than perfect accuracy. The trade-off is that it may struggle with small fonts, faint text, or complex layouts like dense tables or intricate formatting.

Preset B (Balanced - Recommended)

Models: db_resnet34 (detection) + sar_resnet31 (recognition)
Tagline: "Good speed and accuracy; best for daily use, textbooks, columns, and newsletters."

This balanced approach provides excellent reliability for most everyday documents while maintaining reasonable processing speed. It handles typical business documents, academic papers, multi-column layouts, and varied fonts effectively. This is the recommended choice for general use where both quality and efficiency matter.

Preset A (Accurate - High Detail)

Models: db_resnet50 (detection) + master (recognition)
Tagline: "Best for complex, messy, or academic documents. Slowest, needs more memory/CPU/GPU, but highest OCR accuracy."

This high-accuracy combination excels on challenging documents with complex layouts, degraded image quality, unusual fonts, or dense technical content. It requires more processing power and memory but delivers the most comprehensive text extraction possible, making it ideal for archival work, research documents, or projects where missing any text would be problematic.

Performance and Quality

docTR demonstrates state-of-the-art performance among open-source OCR solutions, often matching or exceeding commercial services for many document types.

Accuracy Metrics

Character-level accuracy: Typically 98-99% on clean documents
Recall rates: Around 73-74% for detecting all text regions in a document

Understanding These Numbers

The high character-level accuracy means that nearly every letter and number in the detected text regions is transcribed correctly. However, the lower recall rate indicates that some text areas might be missed entirely, particularly small, faint, or margin-positioned text. In practical terms, the text you receive is highly accurate for what docTR detects, but some content might not be captured.

Training Foundation

docTR models are trained on large, diverse benchmark datasets including FUNSD, CORD, SROIE, IIIT-5k, SynthText, Street View Text, and COCO-Text. This extensive training foundation encompasses millions of text samples across varied document types, layouts, fonts, and real-world conditions, enabling docTR to generalize well to documents it has never seen before.

docTR vs. Tesseract

Understanding when to use docTR versus Tesseract helps you choose the right tool for your specific needs:

Performance Comparison

docTR consistently outperforms Tesseract on: - Scanned documents with artifacts or image degradation - Complex multi-column layouts and tables
- Screenshots and digital document captures - Documents with unusual fonts or formatting - Images with mixed text orientations or sizes

Tesseract remains competitive for: - Simple, clean printed documents - Scenarios requiring extensive language support - Systems with severe hardware constraints

Accuracy Differences

docTR: 98-99% character accuracy, 73-74% recall
Tesseract: 95%+ character accuracy on clean text, 65-85% recall depending on document complexity

Training Approaches

docTR leverages millions of diverse document samples from public benchmarks, providing broad coverage of real-world document variance. Tesseract's training focuses on hundreds of thousands of text lines primarily from printed materials, offering excellent font diversity but less coverage of challenging, real-world scenarios.