OmniVoice

last releaseApril 28, 2026

goblin vibe check:

solid if you need tts in languages the big services ignore but expect some setup time

massively multilingual zero-shot tts model for generating natural speech and cloning voices across hundreds of languages.

speed

40x

realtime

600+ language zero-shot TTS3-second voice cloningInstruction-based voice designNon-verbal expression tags like [laughter] and [sigh]

key features

600+ language zero-shot TTS3-second voice cloningInstruction-based voice designNon-verbal expression tags like [laughter] and [sigh]

spec & usage

Diffusion-style non-autoregressive architecture initialized from a pretrained language model

Trained on a 581,000-hour multilingual dataset and outputs at 24 kHz

Apache 2.0 release that can run locally on consumer GPUs such as RTX 4090 and 5090 cards

Supports explicit phonetic overrides with Pinyin or CMU dictionary controls

limitations

Full inference still wants roughly 6.5GB of VRAM

Fine-tuning for bespoke voices is weaker than its zero-shot performance

scope:

audiovoiceresearchlocalopen-sourcefreereal-time

launchearly April 2026

last releaseApril 28, 2026