Supertonic Review: 99M-Parameter On-Device TTS in 31 Languages via ONNX (2026)
Supertonic (9.9K+ GitHub stars) by Supertone Inc. is a lightning-fast multilingual text-to-speech model that runs locally on CPU via ONNX Runtime — no cloud, no API, no GPU required. 99M parameters, 31 languages including Korean/Japanese/Vietnamese/Chinese, 44.1kHz studio audio, 10 expression tags, runtimes for Python, Node.js, browser (WebGPU/WASM), iOS, Android, Rust, Flutter. Full feature breakdown, install, code example, and 2026 on-device TTS landscape comparison.
- ⭐ 9900
- MIT (code) / OpenRAIL-M (model)
- Updated 2026-05-23
The On-Device TTS Problem #
For years, “good” multilingual text-to-speech meant calling someone else’s cloud API — Google Cloud TTS, Amazon Polly, ElevenLabs, OpenAI Voice. The voice was natural, the latency was reasonable on broadband, and the per-character cost was small enough that nobody noticed until invoice day.
The cracks showed up in three places. Privacy — sending every script to a third party isn’t an option for healthcare, legal, or anything regulated. Latency variance — when the network blips, the voice stutters. Cost at scale — once you’re synthesizing more than ~100 hours of audio a month, the per-character bills add up. And offline use — anything in a car, a flight, a remote facility, or a kiosk needs local inference, full stop.
Open-source on-device TTS has been catching up, but the trade-offs were stark: either tiny English-only models (Piper, Coqui’s smaller variants) or massive multilingual models that needed a GPU to be practical (XTTS-v2, Bark). Nothing hit the sweet spot of “fast, multilingual, lightweight, true open weights.”
Supertonic (GitHub: supertone-inc/supertonic, 9,900+ stars) by Korean speech-AI company Supertone Inc. is the most credible 2026 candidate to close that gap. 99M parameters, 31 languages, ONNX runtime, runs comfortably on a CPU — including, the README claims, a 0.3× real-time factor on an e-reader in airplane mode.
What Supertonic Is #
A flow-matching text-to-latent module paired with a speech autoencoder, exported to ONNX. Concretely:
- 99M parameters total — small enough to load in seconds and run real-time on a modest CPU. For reference, XTTS-v2 is ~1.5B and Bark is ~900M.
- 31 languages out of the box: Arabic, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese.
- 44.1kHz audio output — true studio sample rate, not the 22kHz that most “good enough” TTS settles for.
- 10 expression tags —
<laugh>,<breath>,<sigh>, etc. Embed them inline in the text to coax more natural delivery without retraining a voice clone. lang="na"mode — language-agnostic generation when you don’t want to pick a language code.
License: MIT for the code, OpenRAIL-M for the model weights. The split matters: OpenRAIL-M is a “responsible AI” license that restricts certain harmful uses but otherwise allows commercial deployment. Read the model card before shipping a product.
Performance Claims #
The numbers Supertone Inc. cites in their benchmarks and README:
| Metric | Supertonic | Typical baseline |
|---|---|---|
| Parameter count | 99M | 0.7B–2B |
| Reading accuracy (WER/CER on Minimax-MLS-test) | Competitive vs much larger | — |
| Memory at runtime | Substantially less than GPU baselines | — |
| RTF on Onyx Boox Go 6 e-reader (airplane mode) | 0.3× | n/a (not runnable) |
| Latency (CPU) | Competitive with A100 GPU baselines | — |
The e-reader benchmark is the headline number — it’s the kind of figure that signals “yes, this really does run anywhere.” A modern phone CPU should be effortless by comparison.
Runtime Coverage #
Supertonic is one of the few open TTS projects that ships actual SDK bindings rather than just “you can probably wrap it.” As of v2.0.0:
- Python (
pip install supertonic) — primary integration - Node.js — server and Electron apps
- Browser — WebGPU when available, WebAssembly as fallback
- Java — Android and JVM backends
- C++, C#, Go, Rust — systems integration
- Swift / iOS — first-party native binding
- Flutter — cross-platform mobile
That covers basically every place an application developer in 2026 might want to embed TTS. The ONNX runtime is doing the heavy lifting; Supertonic adds the model-specific glue.
Quick Setup (Python) #
pip install supertonic
That’s it for the dependency. The model downloads on first call:
from supertonic import TTS
tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")
text = "Supertonic is a lightning fast, on-device TTS system."
wav, duration = tts.synthesize(
text=text,
lang="en",
voice_style=style,
total_steps=8,
speed=1.05,
)
tts.save_audio(wav, "output.wav")
For Korean, swap lang="en" → lang="ko". Same for ja, vi, zh. The voice style (M1 here) is consistent across languages — useful if you’re building a multilingual character voice.
For expression tags:
text = "I can't believe it. <laugh> That's incredible. <breath> Let me explain."
The model interprets the tags inline and produces the expression in audio.
How It Compares #
The 2026 on-device TTS landscape, ranked by what they actually deliver:
vs. Piper (40K+ stars) #
Piper is the longstanding on-device favorite. Piper wins: smaller models per voice (a few MB), simpler deployment for English-only use cases. Supertonic wins: many more languages, much better expression control, higher sample rate, single model handles all languages instead of one per language.
vs. XTTS-v2 (Coqui) #
XTTS-v2 has voice cloning, which Supertonic doesn’t market. XTTS-v2 wins: voice cloning quality. Supertonic wins: practicality on CPU, multi-runtime SDKs, model size, license clarity.
vs. Bark (Suno) #
Bark is impressive for non-speech audio (music, sound effects). Bark wins: stylistic range beyond speech. Supertonic wins: speed, deployability, and 31 languages vs Bark’s English focus.
vs. ElevenLabs / OpenAI / Google Cloud #
Cloud TTS still wins on voice cloning fidelity and on pure naturalness of the top-tier voices. Supertonic wins: no API key, no per-character bill, no network dependency, full privacy.
What Supertonic Doesn’t Do #
To set expectations:
- No voice cloning from a sample. You pick from the included voice styles. If you need cloning, look at XTTS-v2 or commercial APIs.
- No streaming token-by-token synthesis in the public release — synthesis is segment-level.
- Limited fine-tuning tooling. The model weights are open under OpenRAIL-M, but the training pipeline isn’t fully public.
- No 22kHz fallback. Always 44.1kHz output. If you need lower bandwidth, you resample yourself.
Real Use Cases Where Supertonic Shines #
- Mobile apps with voice features — onboarding narration, accessibility readouts, language learning. Ship a single ONNX file, support 31 languages, no API key in the binary.
- Healthcare and legal tools — voice readouts of sensitive documents without anything leaving the device.
- In-car and in-flight systems — full offline support, no graceful degradation needed.
- Korean / Japanese / Vietnamese / Chinese localization — the open-source TTS gap for Asian languages has been painful; Supertonic closes a big chunk of it in one model.
- Edge IoT devices — kiosks, signage, smart speakers without cloud connectivity.
Who Should Use This #
Install Supertonic if you:
- Ship an app that needs voice output and you don’t want a cloud bill that scales with usage.
- Need privacy (regulated industries) or offline (mobile, in-flight, edge).
- Localize for non-English markets and would rather have one model than thirty.
- Want studio-quality 44.1kHz audio without a GPU.
Stick with cloud TTS if you:
- Need voice cloning from a 30-second sample.
- Produce hyper-realistic single-voice content where the top of the ElevenLabs lineup is still ahead.
- Need streaming partial audio (the public Supertonic release doesn’t expose this yet).
Verdict #
Supertonic is the most credible “one model for everywhere” open TTS released in 2026. The combination of 99M-parameter footprint, 31 languages, multi-runtime SDKs, and a 0.3× RTF on an e-reader puts it firmly in the “yes, you can ship this in a mobile app” category that almost no prior open TTS quite hit.
For developers in Korea, Japan, Vietnam, or any other under-served TTS language market, the bigger story is that the open-source TTS quality gap with the cloud APIs has narrowed dramatically. Five years ago, English was the only language where open TTS was production-viable. In 2026, with Supertonic, that list now genuinely includes most of the world.
Pair it with an on-device LLM runtime for the prompt side, and you have a fully local voice agent stack with zero cloud dependency.
GitHub: supertone-inc/supertonic · License: MIT (code) / OpenRAIL-M (weights) · Latest: v2.0.0 (2026-01-06) · Stars: 9.9K+ · Maintainer: Supertone Inc.
💬 Discussion