What is ChatTTS and how is it different from other open-source TTS?

ChatTTS is an open-source generative speech model by 2noise (39.3k GitHub stars) purpose-built for dialogue rather than narration. Unlike older neural TTS like Tacotron or VITS that sound monotone, it uses autoregressive generation with explicit prosodic control tokens for laughter, pauses, and conversational style.

Can I use ChatTTS commercially?

Not directly with the open weights. ChatTTS code is AGPL-3.0 and the model weights are CC BY-NC 4.0 (non-commercial only). For commercial production you must contact 2noise for commercial licensing or train your own model from scratch on the architecture; hobby, research, and internal tooling use is fine.

How much VRAM does ChatTTS need and how fast is it?

ChatTTS needs 4 GB of VRAM minimum for 30-second clips and runs comfortably at 8 GB. It achieves a Real-Time Factor of about 0.3 on an RTX 4090 (faster than real-time), generating a 30-second clip in roughly 3 seconds; CPU-only takes 3-5 minutes and is impractical.

How do I keep the same voice across multiple ChatTTS generations?

By default ChatTTS produces a different speaker on every call. To keep a stable voice, generate a speaker embedding once with chat.sample_random_speaker(), save it with torch.save, then load it and pass it via InferCodeParams(spk_emb=spk) on later runs. Pre-generating 5-10 distinct embeddings during setup is the recommended pattern.

What are the prosody control tokens in ChatTTS?

ChatTTS supports inline tags including [laugh] (and [laugh_0] to [laugh_2] for intensity), [uv_break] for umm-style filler pauses, [lbreak] for longer sentence pauses, [oral_0] to [oral_9] for conversational casualness, [speed_0] to [speed_9] for speed, and [break_0] to [break_7] for discrete pause durations. Use them sparingly, since over-tagging sounds rehearsed.

ChatTTS 2026: 39.3k-Star Open-Source Dialogue TTS with Laughter

Most open-source TTS in 2026 still sounds like “1990s GPS narrator with extra reverb.” ChatTTS is the first widely-adopted exception — a 39.3k-star generative speech model specifically trained for dialogue, not narration, with token-level control over laughter, pauses, interjections, and prosody that finally crosses the “doesn’t make you wince” threshold.

If you’re building voice agents, AI podcasts, multi-character TTS for games, or any voice product where flat narration kills the experience — ChatTTS is the default open-source pick in 2026.

TL;DR #

What: Open-source dialogue-optimized TTS by 2noise (Chinese + English)
GitHub: 39.3k stars
License: AGPL-3.0 (code) + CC BY-NC 4.0 (model) — non-commercial use only on the open weights
VRAM: 4 GB minimum for 30-second clips; comfortable at 8 GB
Speed: RTF (Real-Time Factor) ~0.3 on RTX 4090 (faster than real-time)
Training data: 40,000+ hours open-source / 100,000+ hours full model

1. Why ChatTTS Beats Traditional TTS for Dialogue #

The legacy split:

Concatenative TTS (festival, etc.) — mechanical, no prosody, dying
Neural TTS (Tacotron / FastSpeech / VITS) — fluid but monotone, optimized for narration
Commercial APIs (ElevenLabs / OpenAI TTS) — natural but $0.18-0.50/1000 chars and closed

ChatTTS sits in a new fourth category: autoregressive generative TTS with explicit prosodic control tokens. You don’t just type text — you can mark up [laugh], [uv_break] (umm), [lbreak] (long pause), and the model produces a vocal performance, not just speech.

For dialogue use cases (voice agents, AI podcasts, NPC dialogue in games), this is the difference between “obvious robot” and “could be a real person on a bad phone line.” For narration, classical neural TTS is still often better.

2. Hardware Requirements (Realistic Numbers) #

| Hardware | 30s clip generation time | Practical use | |||| | 4 GB GPU (GTX 1650 / 3050) | ~25 sec | Hobby, single-clip | | 8 GB GPU (RTX 3060 / 4060) | ~10 sec | Solo dev, batch jobs | | 12 GB GPU (RTX 3060 12GB / 4070) | ~5 sec | Production, low concurrent load | | 24 GB GPU (RTX 4090 / A5000) | ~3 sec | Production, high concurrent | | CPU-only | ~3-5 min | Not practical for interactive use |

For self-hosted production, the entry point is a $0.30-0.50/hr GPU cloud (Vast.ai, RunPod, or a DigitalOcean GPU droplet ) — much cheaper than ElevenLabs at any meaningful volume.

3. Quick Install (10 minutes on a GPU machine) #

git clone https://github.com/2noise/ChatTTS
cd ChatTTS
pip install -r requirements.txt
# Or via pip:
pip install ChatTTS

Hello world:

import ChatTTS
import torchaudio
import torch

chat = ChatTTS.Chat()
chat.load(compile=False)  # Set True for ~30% speedup after first run

texts = ["Hello, this is a dialogue TTS test [uv_break] does it sound natural?"]
wavs = chat.infer(texts)

torchaudio.save("out.wav", torch.from_numpy(wavs[0]), 24000)

First run downloads ~2 GB of model weights. Subsequent runs are instant.

4. The Prosody Control Tokens (the killer feature) #

The reason ChatTTS feels alive — these tags work mid-text:

| Tag | Effect | ||| | [laugh] | Inserts laughter | | [laugh_0] to [laugh_2] | Laughter intensity levels | | [uv_break] | Umm-style filler pause | | [lbreak] | Longer pause (sentence-style) | | [oral_0] to [oral_9] | Conversational style intensity (higher = more casual) | | [speed_0] to [speed_9] | Speech speed (5 = normal) | | [break_0] to [break_7] | Discrete pause durations |

Example:

text = "So I told him [uv_break] there's no way that's true [laugh] [lbreak] but he kept insisting."
wavs = chat.infer([text])

This is what closes the “robot vs human” gap. Use sparingly; over-tagging sounds rehearsed.

5. Multi-Speaker — Stable Voices Across Sessions #

ChatTTS generates a different “speaker” each invocation by default. For consistent characters (NPC voice, persistent agent personality), seed a speaker once and reuse:

# Generate and save a stable speaker
rand_spk = chat.sample_random_speaker()
torch.save(rand_spk, "speaker_alice.pt")

# Use in subsequent runs
spk = torch.load("speaker_alice.pt")
params_infer_code = ChatTTS.Chat.InferCodeParams(spk_emb=spk)
wavs = chat.infer(texts, params_infer_code=params_infer_code)

Pattern: pre-generate 5-10 distinct speaker embeddings during setup. Pick the one matching each character. Voice stays stable across the entire production.

6. Licensing Caveat (Read This Before Production) #

Two licenses, two different obligations:

Code: AGPL-3.0 — copyleft, derivative works must open-source under AGPL
Model weights: CC BY-NC 4.0 — non-commercial use only

For commercial production: contact 2noise for commercial model licensing, OR fine-tune your own model from scratch on the open ChatTTS architecture (significant effort but legally clean).

For hobby, research, internal tooling, and most agent prototyping: the NC license is fine.

This is the only friction point on adoption. If your product directly monetizes generated voices, factor in licensing cost (or pick commercial-friendly alternatives like Coqui XTTS-v2).

7. Production Pattern #

For agent voice / podcast pipeline:

   Text input (from LLM agent / script generator)
            │
            ▼
   Light preprocessing (add prosody tags via rules)
            │
            ▼
   ChatTTS inference (GPU-backed FastAPI service)
            │
            ▼
   Output audio (WAV / Opus / streaming)
            │
            ▼
   Optional post-processing (loudness normalize, denoise)

Stand it up on a GPU-equipped HTStack Hong Kong VPS with GPU or a Vast.ai instance, expose via FastAPI, and your stack has voice for ~$0.001 per minute generated (vs ElevenLabs at ~$0.30/min).

8. When to Use ChatTTS vs Alternatives #

| Use case | Pick | ||| | Dialogue / multi-character / agent voice | ChatTTS | | Audiobook narration (single voice, long-form) | Coqui XTTS-v2 or commercial | | Voice cloning from a 30-second sample | OpenVoice or Coqui XTTS-v2 | | Lowest possible latency (<200ms) for live agents | Commercial (Deepgram Aura, ElevenLabs Flash) | | Highest quality narration regardless of cost | ElevenLabs | | Commercial use, must own the model | Fine-tuned Coqui XTTS-v2 |

9. Pitfalls #

Over-tagging prosody — sprinkling [laugh] and [uv_break] everywhere sounds rehearsed. Less is more
Forgetting to seed speakers — every call without spk_emb is a different voice. Always pre-generate
Running on CPU and complaining about speed — RTF is 30-100× worse without GPU. Just rent a GPU
Ignoring the NC license — using ChatTTS for paid voice products is a legal risk

TL;DR #

ChatTTS = first open-source TTS that handles dialogue convincingly. 39.3k stars, 4 GB VRAM minimum, prosody-aware via control tokens, stable speakers via embedding seeds. Use for voice agents, AI podcasts, NPCs. Watch the CC BY-NC license for commercial use.

Spin up a GPU instance, run the 10-line install in section 3, and you’ll hear in 5 minutes why this displaced every other open-source TTS in the discussion.

Part of dibi8’s multi-modal content stack — see the upcoming Multi-Modal Content Pipeline collection for ChatTTS + Whisper + Stable Diffusion + ComfyUI as a full audio/visual creator pipeline.

ChatTTS 2026: 39.3k-Star Open-Source Dialogue TTS with Laughter

TL;DR #

1. Why ChatTTS Beats Traditional TTS for Dialogue #

2. Hardware Requirements (Realistic Numbers) #

3. Quick Install (10 minutes on a GPU machine) #

4. The Prosody Control Tokens (the killer feature) #

5. Multi-Speaker — Stable Voices Across Sessions #

6. Licensing Caveat (Read This Before Production) #

7. Production Pattern #

8. When to Use ChatTTS vs Alternatives #

9. Pitfalls #

TL;DR #

References & Sources #

📦 Featured in collections

💬 Discussion

TL;DR #

1. Why ChatTTS Beats Traditional TTS for Dialogue #

2. Hardware Requirements (Realistic Numbers) #

3. Quick Install (10 minutes on a GPU machine) #

4. The Prosody Control Tokens (the killer feature) #

5. Multi-Speaker — Stable Voices Across Sessions #

6. Licensing Caveat (Read This Before Production) #

7. Production Pattern #

8. When to Use ChatTTS vs Alternatives #

9. Pitfalls #

TL;DR #

References & Sources #

🔗 Related Resources

📦 Featured in collections

💬 Discussion