Supertonic Review: 99M-Parameter On-Device TTS in 31 Languages via ONNX (2026)

Supertonic (9.9K+ GitHub stars) by Supertone Inc. is a lightning-fast multilingual text-to-speech model that runs locally on CPU via ONNX Runtime — no cloud, no API, no GPU required. 99M parameters, 31 languages including Korean/Japanese/Vietnamese/Chinese, 44.1kHz studio audio, 10 expression tags, runtimes for Python, Node.js, browser (WebGPU/WASM), iOS, Android, Rust, Flutter. Full feature breakdown, install, code example, and 2026 on-device TTS landscape comparison.

  • ⭐ 9900
  • MIT (code) / OpenRAIL-M (model)
  • Updated 2026-05-23

The On-Device TTS Problem #

For years, “good” multilingual text-to-speech meant calling someone else’s cloud API — Google Cloud TTS, Amazon Polly, ElevenLabs, OpenAI Voice. The voice was natural, the latency was reasonable on broadband, and the per-character cost was small enough that nobody noticed until invoice day.

The cracks showed up in three places. Privacy — sending every script to a third party isn’t an option for healthcare, legal, or anything regulated. Latency variance — when the network blips, the voice stutters. Cost at scale — once you’re synthesizing more than ~100 hours of audio a month, the per-character bills add up. And offline use — anything in a car, a flight, a remote facility, or a kiosk needs local inference, full stop.

Open-source on-device TTS has been catching up, but the trade-offs were stark: either tiny English-only models (Piper, Coqui’s smaller variants) or massive multilingual models that needed a GPU to be practical (XTTS-v2, Bark). Nothing hit the sweet spot of “fast, multilingual, lightweight, true open weights.”

Supertonic (GitHub: supertone-inc/supertonic, 9,900+ stars) by Korean speech-AI company Supertone Inc. is the most credible 2026 candidate to close that gap. 99M parameters, 31 languages, ONNX runtime, runs comfortably on a CPU — including, the README claims, a 0.3× real-time factor on an e-reader in airplane mode.


What Supertonic Is #

A flow-matching text-to-latent module paired with a speech autoencoder, exported to ONNX. Concretely:

  • 99M parameters total — small enough to load in seconds and run real-time on a modest CPU. For reference, XTTS-v2 is ~1.5B and Bark is ~900M.
  • 31 languages out of the box: Arabic, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese.
  • 44.1kHz audio output — true studio sample rate, not the 22kHz that most “good enough” TTS settles for.
  • 10 expression tags<laugh>, <breath>, <sigh>, etc. Embed them inline in the text to coax more natural delivery without retraining a voice clone.
  • lang="na" mode — language-agnostic generation when you don’t want to pick a language code.

License: MIT for the code, OpenRAIL-M for the model weights. The split matters: OpenRAIL-M is a “responsible AI” license that restricts certain harmful uses but otherwise allows commercial deployment. Read the model card before shipping a product.


Performance Claims #

The numbers Supertone Inc. cites in their benchmarks and README:

Metric Supertonic Typical baseline
Parameter count 99M 0.7B–2B
Reading accuracy (WER/CER on Minimax-MLS-test) Competitive vs much larger
Memory at runtime Substantially less than GPU baselines
RTF on Onyx Boox Go 6 e-reader (airplane mode) 0.3× n/a (not runnable)
Latency (CPU) Competitive with A100 GPU baselines

The e-reader benchmark is the headline number — it’s the kind of figure that signals “yes, this really does run anywhere.” A modern phone CPU should be effortless by comparison.


Runtime Coverage #

Supertonic is one of the few open TTS projects that ships actual SDK bindings rather than just “you can probably wrap it.” As of v2.0.0:

  • Python (pip install supertonic) — primary integration
  • Node.js — server and Electron apps
  • Browser — WebGPU when available, WebAssembly as fallback
  • Java — Android and JVM backends
  • C++, C#, Go, Rust — systems integration
  • Swift / iOS — first-party native binding
  • Flutter — cross-platform mobile

That covers basically every place an application developer in 2026 might want to embed TTS. The ONNX runtime is doing the heavy lifting; Supertonic adds the model-specific glue.


Quick Setup (Python) #

pip install supertonic

That’s it for the dependency. The model downloads on first call:

from supertonic import TTS

tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")

text = "Supertonic is a lightning fast, on-device TTS system."

wav, duration = tts.synthesize(
    text=text,
    lang="en",
    voice_style=style,
    total_steps=8,
    speed=1.05,
)
tts.save_audio(wav, "output.wav")

For Korean, swap lang="en"lang="ko". Same for ja, vi, zh. The voice style (M1 here) is consistent across languages — useful if you’re building a multilingual character voice.

For expression tags:

text = "I can't believe it. <laugh> That's incredible. <breath> Let me explain."

The model interprets the tags inline and produces the expression in audio.


How It Compares #

The 2026 on-device TTS landscape, ranked by what they actually deliver:

vs. Piper (40K+ stars) #

Piper is the longstanding on-device favorite. Piper wins: smaller models per voice (a few MB), simpler deployment for English-only use cases. Supertonic wins: many more languages, much better expression control, higher sample rate, single model handles all languages instead of one per language.

vs. XTTS-v2 (Coqui) #

XTTS-v2 has voice cloning, which Supertonic doesn’t market. XTTS-v2 wins: voice cloning quality. Supertonic wins: practicality on CPU, multi-runtime SDKs, model size, license clarity.

vs. Bark (Suno) #

Bark is impressive for non-speech audio (music, sound effects). Bark wins: stylistic range beyond speech. Supertonic wins: speed, deployability, and 31 languages vs Bark’s English focus.

vs. ElevenLabs / OpenAI / Google Cloud #

Cloud TTS still wins on voice cloning fidelity and on pure naturalness of the top-tier voices. Supertonic wins: no API key, no per-character bill, no network dependency, full privacy.


What Supertonic Doesn’t Do #

To set expectations:

  • No voice cloning from a sample. You pick from the included voice styles. If you need cloning, look at XTTS-v2 or commercial APIs.
  • No streaming token-by-token synthesis in the public release — synthesis is segment-level.
  • Limited fine-tuning tooling. The model weights are open under OpenRAIL-M, but the training pipeline isn’t fully public.
  • No 22kHz fallback. Always 44.1kHz output. If you need lower bandwidth, you resample yourself.

Real Use Cases Where Supertonic Shines #

  • Mobile apps with voice features — onboarding narration, accessibility readouts, language learning. Ship a single ONNX file, support 31 languages, no API key in the binary.
  • Healthcare and legal tools — voice readouts of sensitive documents without anything leaving the device.
  • In-car and in-flight systems — full offline support, no graceful degradation needed.
  • Korean / Japanese / Vietnamese / Chinese localization — the open-source TTS gap for Asian languages has been painful; Supertonic closes a big chunk of it in one model.
  • Edge IoT devices — kiosks, signage, smart speakers without cloud connectivity.

Who Should Use This #

Install Supertonic if you:

  • Ship an app that needs voice output and you don’t want a cloud bill that scales with usage.
  • Need privacy (regulated industries) or offline (mobile, in-flight, edge).
  • Localize for non-English markets and would rather have one model than thirty.
  • Want studio-quality 44.1kHz audio without a GPU.

Stick with cloud TTS if you:

  • Need voice cloning from a 30-second sample.
  • Produce hyper-realistic single-voice content where the top of the ElevenLabs lineup is still ahead.
  • Need streaming partial audio (the public Supertonic release doesn’t expose this yet).

Verdict #

Supertonic is the most credible “one model for everywhere” open TTS released in 2026. The combination of 99M-parameter footprint, 31 languages, multi-runtime SDKs, and a 0.3× RTF on an e-reader puts it firmly in the “yes, you can ship this in a mobile app” category that almost no prior open TTS quite hit.

For developers in Korea, Japan, Vietnam, or any other under-served TTS language market, the bigger story is that the open-source TTS quality gap with the cloud APIs has narrowed dramatically. Five years ago, English was the only language where open TTS was production-viable. In 2026, with Supertonic, that list now genuinely includes most of the world.

Pair it with an on-device LLM runtime for the prompt side, and you have a fully local voice agent stack with zero cloud dependency.


GitHub: supertone-inc/supertonic · License: MIT (code) / OpenRAIL-M (weights) · Latest: v2.0.0 (2026-01-06) · Stars: 9.9K+ · Maintainer: Supertone Inc.

📦 Featured in collections

💬 Discussion