How many languages does Supertonic support?

Supertonic supports 31 languages out of the box, including English, Korean, Japanese, Vietnamese, Chinese, Arabic, Hindi, French, German, Spanish, Russian, and Ukrainian. A single 99M-parameter model handles all of them, and a lang="na" mode allows language-agnostic generation.

Does Supertonic require a GPU or internet connection to run?

No. Supertonic runs entirely on-device on a CPU via ONNX Runtime, with no cloud, no API key, and no GPU required. The README claims a 0.3x real-time factor on an Onyx Boox Go 6 e-reader in airplane mode, so it runs fully offline.

Can Supertonic clone a voice from an audio sample?

No. Supertonic does not offer voice cloning from a sample; you choose from its included voice styles. If you need voice cloning, the article recommends XTTS-v2 or a commercial cloud API instead.

Which programming languages and platforms can run Supertonic?

As of v2.0.0, Supertonic ships SDK bindings for Python, Node.js, the browser (WebGPU with WebAssembly fallback), Java (Android/JVM), C++, C#, Go, Rust, Swift/iOS, and Flutter. Python is the primary integration via 'pip install supertonic'.

Supertonic Review: 99M-Parameter On-Device TTS in 31 Languages

Q: What is Supertonic's license and can it be used commercially?

Supertonic uses a split license: MIT for the code and OpenRAIL-M for the model weights. OpenRAIL-M is a responsible-AI license that restricts certain harmful uses but otherwise permits commercial deployment, so you should read the model card before shipping a product.

Supertonic Review: 99M-Parameter On-Device TTS in 31 Languages via ONNX (2026) — dibi8.com

The On-Device TTS Problem #

For years, “good” multilingual text-to-speech meant calling someone else’s cloud API — Google Cloud TTS, Amazon Polly, ElevenLabs, OpenAI Voice. The voice was natural, the latency was reasonable on broadband, and the per-character cost was small enough that nobody noticed until invoice day.

The cracks showed up in three places. Privacy — sending every script to a third party isn’t an option for healthcare, legal, or anything regulated. Latency variance — when the network blips, the voice stutters. Cost at scale — once you’re synthesizing more than ~100 hours of audio a month, the per-character bills add up. And offline use — anything in a car, a flight, a remote facility, or a kiosk needs local inference, full stop.

Open-source on-device TTS has been catching up, but the trade-offs were stark: either tiny English-only models (Piper, Coqui’s smaller variants) or massive multilingual models that needed a GPU to be practical (XTTS-v2, Bark). Nothing hit the sweet spot of “fast, multilingual, lightweight, true open weights.”

Supertonic (GitHub: supertone-inc/supertonic, 11,551+ stars) by Korean speech-AI company Supertone Inc. is the most credible 2026 candidate to close that gap. 99M parameters, 31 languages, ONNX runtime, runs comfortably on a CPU — including, the README claims, a 0.3× real-time factor on an e-reader in airplane mode.

What Supertonic Is #

A flow-matching text-to-latent module paired with a speech autoencoder, exported to ONNX. Concretely:

99M parameters total — small enough to load in seconds and run real-time on a modest CPU. For reference, XTTS-v2 is ~1.5B and Bark is ~900M.
31 languages out of the box: Arabic, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese.
44.1kHz audio output — true studio sample rate, not the 22kHz that most “good enough” TTS settles for.
10 expression tags — <laugh>, <breath>, <sigh>, etc. Embed them inline in the text to coax more natural delivery without retraining a voice clone.
lang="na" mode — language-agnostic generation when you don’t want to pick a language code.

License: MIT for the code, OpenRAIL-M for the model weights. The split matters: OpenRAIL-M is a “responsible AI” license that restricts certain harmful uses but otherwise allows commercial deployment. Read the model card before shipping a product.

Performance Claims #

The numbers Supertone Inc. cites in their benchmarks and README:

Metric	Supertonic	Typical baseline
Parameter count	99M	0.7B–2B
Reading accuracy (WER/CER on Minimax-MLS-test)	Competitive vs much larger	—
Memory at runtime	Substantially less than GPU baselines	—
RTF on Onyx Boox Go 6 e-reader (airplane mode)	0.3×	n/a (not runnable)
Latency (CPU)	Competitive with A100 GPU baselines	—

The e-reader benchmark is the headline number — it’s the kind of figure that signals “yes, this really does run anywhere.” A modern phone CPU should be effortless by comparison.

Runtime Coverage #

Supertonic is one of the few open TTS projects that ships actual SDK bindings rather than just “you can probably wrap it.” As of v2.0.0:

Python (pip install supertonic) — primary integration
Node.js — server and Electron apps
Browser — WebGPU when available, WebAssembly as fallback
Java — Android and JVM backends
C++, C#, Go, Rust — systems integration
Swift / iOS — first-party native binding
Flutter — cross-platform mobile

That covers basically every place an application developer in 2026 might want to embed TTS. The ONNX runtime is doing the heavy lifting; Supertonic adds the model-specific glue.

Quick Setup (Python) #

pip install supertonic

That’s it for the dependency. The model downloads on first call:

from supertonic import TTS

tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")

text = "Supertonic is a lightning fast, on-device TTS system."

wav, duration = tts.synthesize(
    text=text,
    lang="en",
    voice_style=style,
    total_steps=8,
    speed=1.05,
)
tts.save_audio(wav, "output.wav")

For Korean, swap lang="en" → lang="ko". Same for ja, vi, zh. The voice style (M1 here) is consistent across languages — useful if you’re building a multilingual character voice.

For expression tags:

text = "I can't believe it. <laugh> That's incredible. <breath> Let me explain."

The model interprets the tags inline and produces the expression in audio.

How It Compares #

The 2026 on-device TTS landscape, ranked by what they actually deliver:

vs. Piper (40K+ stars) #

Piper is the longstanding on-device favorite. Piper wins: smaller models per voice (a few MB), simpler deployment for English-only use cases. Supertonic wins: many more languages, much better expression control, higher sample rate, single model handles all languages instead of one per language.

vs. XTTS-v2 (Coqui) #

XTTS-v2 has voice cloning, which Supertonic doesn’t market. XTTS-v2 wins: voice cloning quality. Supertonic wins: practicality on CPU, multi-runtime SDKs, model size, license clarity.

vs. Bark (Suno) #

Bark is impressive for non-speech audio (music, sound effects). Bark wins: stylistic range beyond speech. Supertonic wins: speed, deployability, and 31 languages vs Bark’s English focus.

vs. ElevenLabs / OpenAI / Google Cloud #

Cloud TTS still wins on voice cloning fidelity and on pure naturalness of the top-tier voices. Supertonic wins: no API key, no per-character bill, no network dependency, full privacy.

What Supertonic Doesn’t Do #

To set expectations:

No voice cloning from a sample. You pick from the included voice styles. If you need cloning, look at XTTS-v2 or commercial APIs.
No streaming token-by-token synthesis in the public release — synthesis is segment-level.
Limited fine-tuning tooling. The model weights are open under OpenRAIL-M, but the training pipeline isn’t fully public.
No 22kHz fallback. Always 44.1kHz output. If you need lower bandwidth, you resample yourself.

Real Use Cases Where Supertonic Shines #

Mobile apps with voice features — onboarding narration, accessibility readouts, language learning. Ship a single ONNX file, support 31 languages, no API key in the binary.
Healthcare and legal tools — voice readouts of sensitive documents without anything leaving the device.
In-car and in-flight systems — full offline support, no graceful degradation needed.
Korean / Japanese / Vietnamese / Chinese localization — the open-source TTS gap for Asian languages has been painful; Supertonic closes a big chunk of it in one model.
Edge IoT devices — kiosks, signage, smart speakers without cloud connectivity.

Who Should Use This #

Install Supertonic if you:

Ship an app that needs voice output and you don’t want a cloud bill that scales with usage.
Need privacy (regulated industries) or offline (mobile, in-flight, edge).
Localize for non-English markets and would rather have one model than thirty.
Want studio-quality 44.1kHz audio without a GPU.

Stick with cloud TTS if you:

Need voice cloning from a 30-second sample.
Produce hyper-realistic single-voice content where the top of the ElevenLabs lineup is still ahead.
Need streaming partial audio (the public Supertonic release doesn’t expose this yet).

Verdict #

Supertonic is the most credible “one model for everywhere” open TTS released in 2026. The combination of 99M-parameter footprint, 31 languages, multi-runtime SDKs, and a 0.3× RTF on an e-reader puts it firmly in the “yes, you can ship this in a mobile app” category that almost no prior open TTS quite hit.

For developers in Korea, Japan, Vietnam, or any other under-served TTS language market, the bigger story is that the open-source TTS quality gap with the cloud APIs has narrowed dramatically. Five years ago, English was the only language where open TTS was production-viable. In 2026, with Supertonic, that list now genuinely includes most of the world.

Pair it with an on-device LLM runtime for the prompt side, and you have a fully local voice agent stack with zero cloud dependency.

GitHub: supertone-inc/supertonic · License: MIT (code) / OpenRAIL-M (weights) · Latest: v2.0.0 (2026-01-06) · Stars: 9.9K+ · Maintainer: Supertone Inc.

Recommended Infrastructure for Self-Hosting #

If you want to run this stack reliably 24/7, infrastructure choice matters:

DigitalOcean — $200 free credit for 60 days across 14+ global regions. Default choice for indie devs running open-source AI tools.
HTStack — Hong Kong VPS with low-latency access from mainland China. dibi8.com is hosted here — battle-tested in production.

Affiliate links — they do not cost you extra and help keep dibi8.com running.

Supertonic Review: 99M-Parameter On-Device TTS in 31 Languages

The On-Device TTS Problem #

What Supertonic Is #

Performance Claims #

Runtime Coverage #

Quick Setup (Python) #

How It Compares #

vs. Piper (40K+ stars) #

vs. XTTS-v2 (Coqui) #

vs. Bark (Suno) #

vs. ElevenLabs / OpenAI / Google Cloud #

What Supertonic Doesn’t Do #

Real Use Cases Where Supertonic Shines #

Who Should Use This #

Verdict #

Recommended Infrastructure for Self-Hosting #

📦 Featured in collections

💬 Discussion

The On-Device TTS Problem #

What Supertonic Is #

Performance Claims #

Runtime Coverage #

Quick Setup (Python) #

How It Compares #

vs. Piper (40K+ stars) #

vs. XTTS-v2 (Coqui) #

vs. Bark (Suno) #

vs. ElevenLabs / OpenAI / Google Cloud #

What Supertonic Doesn’t Do #

Real Use Cases Where Supertonic Shines #

Who Should Use This #

Verdict #

Recommended Infrastructure for Self-Hosting #

🔗 Related Resources

📦 Featured in collections

💬 Discussion