Kokoro engine settings

About Kokoro TTS

Kokoro TTS is a neural text-to-speech engine built with StyleTTS 2, which uses transformer-based text processing. The audio is produced with iSTFTNet, a type of vocoder that turns predicted speech features into natural-sounding sound. Kokoro runs locally without an Internet connection.

System Requirements

The following is recommended:

A dual-core CPU
8 GB of RAM
500 MB or storage
CUDA-compatible GPU for optional GPU acceleration

Settings

Voice Groups

The Kokoro TTS settings page displays voices organized into 4 voice groups with 41 voices total. All voices come pre-installed with Assistivox AI and do not require downloading.

Voice Group Interface

Expanding Groups: Click on a voice group to expand and show the voices within that group
Voice Selection: Click on any voice to select it as the active speaking voice
Voice Preview: Clicking on a voice plays a test message in that voice for preview

Docker Configuration

Assistivox AI uses Docker for running the Kokoro TTS engine. You may configure the Docker port for Kokoro TTS (default: 8880).

GPU Settings

When a CUDA-compatible GPU is detected, you can enable or disable GPU acceleration.

Note: GPU acceleration requires a CUDA-compatible NVIDIA graphics card. The system will automatically detect compatible hardware and enable the GPU option when available.

Definitions

Transformer-based text processing: An AI method that helps understand the structure and context of text.
Decoder-only: The model directly generates audio features from text, simplifying and speeding up the process.
Vocoder: A component that turns intermediate audio features into actual sound waves—in this case, optimized for quality and efficiency.)