Faster Whisper engine settings

About Faster Whisper

Whisper is an automatic speech recognition system from OpenAI that transcribes spoken language into text using a transformer-based encoder-decoder architecture (in contrast to Vosk which uses a decoder-only architecture). It processes audio by converting it into a log-Mel spectrogram, which is passed through neural layers to extract features and generate text output. The Whisper models are trained on large and diverse data sets to produce dictation for different languages and accents, and in noisy audio conditions.

Faster Whisper is a reimplementation of Whisper that uses the CTranslate2 inference engine to delivering up to 4 times faster performance with significantly reduced memory usage. Both versions produce identical transcription results, but Faster Whisper is better suited for real-time dictation and resource-constrained environments.

Model Sizes for Dictation

Tiny Model

The most lightweight option for instant dictation with minimal system requirements.

Parameters: 39 million
Storage: 155 MB
Training: Basic multilingual dataset

System Requirements for Real-Time Dictation: - RAM: Less than 1 GB during operation - CPU: Any modern dual-core processor - GPU: Optional - any GPU with minimal VRAM

The Tiny model delivers the fastest dictation response with virtually no delay, making it ideal for quick note-taking and simple speech input. Best suited for clear speech in quiet environments, though accuracy may be reduced with background noise or varied speaking styles.

Base Model

A balanced option offering improved accuracy while maintaining fast dictation speeds.

Parameters: 74 million
Storage: 280 MB
Training: Enhanced multilingual dataset with more speech variations

System Requirements for Real-Time Dictation: - RAM: 1-2 GB during operation - CPU: Dual-core processor or better - GPU: Optional - basic consumer GPU

The Base model provides noticeably better accuracy than Tiny while remaining fast enough for smooth real-time dictation. Handles casual conversation and varied speaking paces with fewer transcription errors.

Small Model

The recommended choice for most dictation tasks, offering strong accuracy with reasonable resource requirements.

Parameters: 244 million
Storage: 900 MB
Training: Larger, more diverse speech dataset

System Requirements for Real-Time Dictation: - RAM: 2-4 GB during operation - CPU: Quad-core processor recommended - GPU: Optional - modern consumer GPU with 2+ GB VRAM

The Small model delivers significantly higher transcription precision compared to Base, making it suitable for professional writing, meetings, and dictation in moderately noisy environments. Provides excellent balance between accuracy and speed for daily use.

Medium Model

Advanced accuracy for challenging dictation scenarios with higher hardware requirements.

Parameters: 769 million
Storage: 3 GB
Training: Much larger and more varied speech dataset

System Requirements for Real-Time Dictation: - RAM: 2-6 GB during operation - CPU: Powerful multi-core processor - GPU: CUDA-capable GPU with 6+ GB VRAM recommended

The Medium model excels in noisy environments, with accented speech, and during rapid dictation. Trained on broader speech data for improved handling of complex audio conditions while maintaining practical real-time performance on capable hardware.

Large Model (Large-v3)

Maximum accuracy for professional-grade dictation requiring the highest precision.

Parameters: 1.55 billion
Storage: 6 GB
Training: The largest, most comprehensive speech dataset

System Requirements for Real-Time Dictation: - RAM: 3-10 GB during operation - CPU: High-end multi-core processor - GPU: Modern GPU with 8-10 GB VRAM strongly recommended

The Large model provides the best possible transcription accuracy, especially for challenging conditions like heavy accents, technical terminology, or poor audio quality. Requires powerful hardware to maintain real-time dictation speeds, but delivers professional-grade results for critical transcription needs.

Settings

The Faster Whisper dictation settings page provides the following options:

Available Models: All five model sizes (Tiny, Base, Small, Medium, Large) are listed with selection options
Model Selection: Click on any installed model to select it as the active dictation engine
Download Option: If a model is not installed, a "Download" button appears next to it for easy installation
GPU Acceleration: Toggle GPU usage on or off when a CUDA-compatible GPU is detected
Auto Sentence Format: Enable automatic capitalization and punctuation formatting for dictated text

Note: GPU acceleration requires a CUDA-compatible NVIDIA graphics card. The system will automatically detect compatible hardware and enable the GPU option when available.