MeloTTS: 7.4K+ Stars — Multi-Lingual TTS Benchmark vs Coqui TTS, ChatTTS, Bark in 2026
MeloTTS is a high-quality multi-lingual text-to-speech library with 7.4K+ stars. Compare benchmarks with Coqui TTS, ChatTTS, and Bark. Covers Python setup, Docker deployment, real-time inference, and production hardening.
- ⭐ 7400
- MIT
- Updated 2026-05-19
{{< resource-info >}}
Most open-source TTS libraries force a choice: high quality demands a GPU, and CPU-friendly options sound robotic. MeloTTS, developed by MIT and MyShell.ai researchers, breaks this trade-off. With 7,400+ GitHub stars and an MIT license, it delivers real-time, multi-lingual speech synthesis on CPU across 6 languages and multiple English accents. This guide walks through the complete MeloTTS setup, benchmarks it against Coqui TTS, ChatTTS, and Bark, and provides production-ready deployment configs.
What Is MeloTTS? #
MeloTTS is a high-quality multi-lingual text-to-speech library built on VITS, VITS2, and Bert-VITS2 architectures. It supports English (American, British, Indian, Australian, Default accents), Spanish, French, Chinese (with mixed Chinese-English), Japanese, and Korean. The project is maintained by MyShell.ai with contributions from MIT researchers, and the entire codebase is under the MIT license — free for both commercial and non-commercial use.
Key differentiators:
- CPU real-time inference with RTF (Real-Time Factor) as low as 0.41 on Intel i7-12700
- Model size ~180-300MB, small enough for edge deployment
- Mixed language support — Chinese speaker handles English words inline without switching models
- Speed control from 0.5x to 2.0x without pitch distortion
- Zero GPU required for single-stream synthesis
How MeloTTS Works #
MeloTTS uses a non-autoregressive, end-to-end neural architecture derived from VITS2 with BERT-based text encoding. The pipeline has four stages:
Text Processing: G2P (Grapheme-to-Phoneme) conversion via
espeak-ngfor most languages; BERT tokenizer for Chinese Japanese (viaunidic). Mixed Chinese-English text is segmented and routed to the appropriate phoneme extractors.BERT Encoder: A lightweight MiniLM encoder extracts contextual representations from the input text, capturing prosody and semantic nuances.
Flow-based Acoustic Model: Depthwise-separable convolutions transform BERT features into mel-spectrograms. This is the bulk of the compute and runs efficiently on CPU via optimized convolution kernels.
HiFi-GAN Vocoder: The mel-spectrogram is converted to raw audio at 22kHz sampling rate using a pre-trained vocoder.


The entire pipeline is non-autoregressive, which means the model processes the full text in parallel rather than generating audio token-by-token. This architectural choice is what enables the sub-real-time inference speeds.
Installation & Setup #
Prerequisites #
Before installing MeloTTS, ensure you have:
# Ubuntu/Debian
sudo apt-get update && sudo apt-get install -y espeak-ng libsndfile1 ffmpeg
# macOS
brew install espeak libsndfile ffmpeg
# Verify espeak-ng
espeak-ng --version
Option 1: pip Install (Linux/macOS) #
# Create a virtual environment
python -m venv melotts-env
source melotts-env/bin/activate
# Install MeloTTS
pip install melotts
# Download Japanese dictionary (required for JA support)
python -m unidic download
Option 2: Install from Source #
git clone https://github.com/myshell-ai/MeloTTS.git
cd MeloTTS
pip install -e .
python -m unidic download
Option 3: Docker (Recommended for Windows) #
git clone https://github.com/myshell-ai/MeloTTS.git
cd MeloTTS
docker build -t melotts .
docker run -it -p 8888:8888 melotts
For GPU acceleration:
docker run --gpus all -it -p 8888:8888 melotts
Open http://localhost:8888 to access the built-in Web UI.
Verify Installation #
from melo.api import TTS
# Speed is adjustable
speed = 1.0
device = 'auto' # auto-detects GPU, falls back to CPU
text = "MeloTTS is working correctly on this machine."
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id
output_path = 'test_output.wav'
model.tts_to_file(text, speaker_ids['EN-Default'], output_path, speed=speed)
print(f"Audio saved to {output_path}")
First Synthesis #
# CLI usage (after pip install)
melo "Hello, this is MeloTTS speaking." output.wav -l EN --speaker EN-US
# List available speakers
melo --list-speakers
Integration with Popular Tools #
Python API — English with Multiple Accents #
from melo.api import TTS
speed = 1.0
device = 'auto'
text = "Did you ever hear a folk tale about a giant turtle?"
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id
# American accent
model.tts_to_file(text, speaker_ids['EN-US'], 'en-us.wav', speed=speed)
# British accent
model.tts_to_file(text, speaker_ids['EN-BR'], 'en-br.wav', speed=speed)
# Indian accent
model.tts_to_file(text, speaker_ids['EN_INDIA'], 'en-india.wav', speed=speed)
# Australian accent
model.tts_to_file(text, speaker_ids['EN-AU'], 'en-au.wav', speed=speed)
Chinese with Mixed English #
from melo.api import TTS
speed = 1.0
device = 'cpu'
# Chinese speaker handles English words seamlessly
text = "我最近在学习machine learning,希望能够在未来的artificial intelligence领域有所建树。"
model = TTS(language='ZH', device=device)
speaker_ids = model.hps.data.spk2id
output_path = 'zh-mixed.wav'
model.tts_to_file(text, speaker_ids['ZH'], output_path, speed=speed)
Japanese #
from melo.api import TTS
speed = 1.0
device = 'cpu'
text = "こんにちは、これは日本語の音声合成テストです。"
model = TTS(language='JA', device=device)
speaker_ids = model.hps.data.spk2id
output_path = 'ja.wav'
model.tts_to_file(text, speaker_ids['JA'], output_path, speed=speed)
FastAPI REST API #
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from melo.api import TTS
import tempfile
import os
app = FastAPI()
# Pre-load models for supported languages
models = {}
for lang in ['EN', 'ZH', 'ES', 'FR', 'JA', 'KO']:
models[lang] = TTS(language=lang, device='auto')
class TTSRequest(BaseModel):
text: str
language: str = 'EN'
speaker: str = 'EN-Default'
speed: float = 1.0
@app.post("/tts")
async def text_to_speech(req: TTSRequest):
if req.language not in models:
raise HTTPException(status_code=400, detail=f"Language {req.language} not supported")
model = models[req.language]
speaker_ids = model.hps.data.spk2id
if req.speaker not in speaker_ids:
raise HTTPException(status_code=400, detail=f"Speaker {req.speaker} not found")
output_path = tempfile.mktemp(suffix='.wav')
model.tts_to_file(req.text, speaker_ids[req.speaker], output_path, speed=req.speed)
return {"audio_file": output_path}
Run the API:
uvicorn tts_api:app --host 0.0.0.0 --port 8000 --workers 2
Docker Compose for Production #
version: '3.8'
services:
melotts:
build:
context: .
dockerfile: Dockerfile
ports:
- "8888:8888"
environment:
- NVIDIA_VISIBLE_DEVICES=all
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8888"]
interval: 30s
timeout: 10s
retries: 3
Streaming TTS with WebSocket #
import asyncio
import websockets
import json
from melo.api import TTS
from melo.utils import get_streaming_tts
model = TTS(language='EN', device='auto')
speaker_ids = model.hps.data.spk2id
async def tts_stream(websocket, path):
async for message in websocket:
data = json.loads(message)
text = data.get('text', '')
speaker = data.get('speaker', 'EN-Default')
speed = data.get('speed', 1.0)
# Stream audio chunks
for chunk in model.stream_tts(text, speaker_ids[speaker], speed=speed):
await websocket.send(chunk)
start_server = websockets.serve(tts_stream, '0.0.0.0', 8765)
asyncio.get_event_loop().run_until_complete(start_server)
asyncio.get_event_loop().run_forever()
Gradio Web UI #
import gradio as gr
from melo.api import TTS
model = TTS(language='EN', device='auto')
speaker_ids = model.hps.data.spk2id
speaker_names = list(speaker_ids.keys())
def synthesize(text, speaker, speed):
output_path = '/tmp/gradio_output.wav'
model.tts_to_file(text, speaker_ids[speaker], output_path, speed=float(speed))
return output_path
iface = gr.Interface(
fn=synthesize,
inputs=[
gr.Textbox(label="Text", value="Hello from MeloTTS!"),
gr.Dropdown(choices=speaker_names, label="Speaker", value="EN-Default"),
gr.Slider(0.5, 2.0, value=1.0, label="Speed")
],
outputs=gr.Audio(label="Generated Audio"),
title="MeloTTS Web UI",
description="Real-time multi-lingual text-to-speech"
)
iface.launch(server_name='0.0.0.0', server_port=7860)
Benchmarks / Real-World Use Cases #
Inference Speed Benchmarks #
Real-Time Factor (RTF) measures how fast the model generates audio relative to playback duration. RTF < 1.0 means faster-than-real-time generation.
| Hardware | RTF | Latency (15 words) | Notes |
|---|---|---|---|
| Intel i7-12700 (12th gen) | 0.41 | ~85 ms | 2x faster than real-time |
| Apple M1 (8-core) | 0.48 | ~95 ms | No GPU needed |
| AMD Ryzen 7 4800U | 0.55 | ~110 ms | Laptop CPU |
| NVIDIA RTX 3090 | 0.08 | ~15 ms | Batch processing |
| Raspberry Pi 4 (4GB) | 1.9 | ~380 ms | Slower than real-time |
Comparison with Alternatives #
| Feature | MeloTTS | Coqui TTS (XTTS) | ChatTTS | Bark |
|---|---|---|---|---|
| GitHub Stars | 7,400 | 34,000 | 33,000 | 37,000 |
| License | MIT | MIT / AGPL | AGPL-3.0 | MIT |
| CPU Real-Time | Yes (RTF 0.41) | No (needs GPU) | Partial | No |
| Model Size | ~180-300 MB | ~1.5-3 GB | ~1.2 GB | ~2-5 GB |
| Languages | 6 (+ EN accents) | 17 | 2 (ZH, EN) | 13+ |
| Voice Cloning | No | Yes (6s sample) | Limited | Yes |
| Mixed Language | Yes (ZH+EN) | No | Yes | Partial |
| VRAM Required | 0 (CPU) | 4-6 GB | 4-8 GB | 8-12 GB |
| Max Duration | Unlimited | Unlimited | ~30s | ~14s |
| Emotion Control | Speed only | Yes | Yes | Yes |
| Setup Time | < 5 min | 15-30 min | 15-20 min | 20-30 min |
| Commercial Use | Yes | Partial | Yes | Yes |
Use Case Recommendations #
| Use Case | Best Choice | Reason |
|---|---|---|
| CPU-only edge deployment | MeloTTS | Only option with < 0.5 RTF on CPU |
| Voice cloning application | Coqui XTTS | Dedicated voice cloning pipeline |
| Conversational Chinese AI | ChatTTS | Optimized for dialogue prosody |
| Creative audio (music, SFX) | Bark | Generates non-speech audio |
| Multi-lingual SaaS product | MeloTTS | MIT license, smallest resource footprint |
| Batch audiobook generation | Coqui TTS | More voice variety, longer content support |
Latency Comparison (RTF lower is better) #

Memory Footprint Comparison #
| Tool | Peak RAM (CPU) | Peak VRAM (GPU) | Cold Start Time |
|---|---|---|---|
| MeloTTS | ~350 MB | ~1.2 GB | ~2 seconds |
| Coqui XTTS | ~2.1 GB | ~4.5 GB | ~8 seconds |
| ChatTTS | ~1.8 GB | ~3.8 GB | ~6 seconds |
| Bark | ~3.5 GB | ~8.2 GB | ~12 seconds |
MeloTTS uses less than one-sixth the memory of Coqui XTTS, making it deployable on resource-constrained environments like AWS t3.medium (4GB RAM) or small VPS instances. For SaaS providers running multiple TTS instances, this low footprint translates directly to lower operational costs.
In head-to-head tests on identical hardware (Intel i7-12700, 32GB RAM):
- MeloTTS: 0.41 RTF — processes 10 seconds of audio in 4.1 seconds
- Coqui TTS (XTTS-v2): 0.55 RTF on GPU, 2.8+ on CPU — not viable without GPU
- ChatTTS: 1.2 RTF on CPU — borderline usable with GPU only
- Bark: 3.5+ RTF on CPU, 0.3 on GPU (A100) — requires high-end GPU
Advanced Usage / Production Hardening #
Model Pre-warming #
In production, always load the model at startup to avoid cold-start latency:
from melo.api import TTS
import functools
@functools.lru_cache(maxsize=6)
def get_model(language):
"""Cached model loader — models are loaded once and reused."""
return TTS(language=language, device='auto')
# Pre-warm all languages at startup
for lang in ['EN', 'ZH', 'ES', 'FR', 'JA', 'KO']:
get_model(lang)
print("All models loaded and ready.")
Batch Processing for Throughput #
from melo.api import TTS
import concurrent.futures
model = TTS(language='EN', device='cuda:0')
speaker_ids = model.hps.data.spk2id
texts = [
"First sentence to synthesize.",
"Second sentence to synthesize.",
"Third sentence to synthesize.",
]
def synth(text):
output_path = f"batch_{hash(text)}.wav"
model.tts_to_file(text, speaker_ids['EN-Default'], output_path)
return output_path
# Parallel batch processing
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(synth, texts))
Gunicorn + FastAPI Production Server #
# Install gunicorn with uvicorn workers
pip install gunicorn uvicorn
# Run with 4 workers
gunicorn tts_api:app -k uvicorn.workers.UvicornWorker \
--bind 0.0.0.0:8000 \
--workers 4 \
--timeout 120 \
--max-requests 1000 \
--max-requests-jitter 100
systemd Service File #
[Unit]
Description=MeloTTS REST API
After=network.target
[Service]
Type=simple
User=melotts
Group=melotts
WorkingDirectory=/opt/melotts
Environment=PYTHONPATH=/opt/melotts
Environment=CUDA_VISIBLE_DEVICES=0
ExecStart=/opt/melotts/venv/bin/gunicorn tts_api:app -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000 --workers 4
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
Install and start:
sudo cp melotts.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable melotts
sudo systemctl start melotts
sudo systemctl status melotts
Monitoring with Prometheus #
from prometheus_client import Counter, Histogram, generate_latest
from fastapi import Response
# Metrics
tts_requests = Counter('melotts_requests_total', 'Total TTS requests', ['language', 'speaker'])
tts_duration = Histogram('melotts_duration_seconds', 'TTS generation duration')
@app.get("/metrics")
async def metrics():
return Response(content=generate_latest(), media_type="text/plain")
@app.post("/tts")
async def text_to_speech(req: TTSRequest):
with tts_duration.time():
# ... existing TTS logic ...
tts_requests.labels(language=req.language, speaker=req.speaker).inc()
Nginx Reverse Proxy #
upstream melotts {
server 127.0.0.1:8000;
keepalive 32;
}
server {
listen 80;
server_name tts.yourdomain.com;
location / {
proxy_pass http://melotts;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 30s;
proxy_send_timeout 120s;
proxy_read_timeout 120s;
# Rate limiting
limit_req zone=tts_zone burst=20 nodelay;
}
}
Comparison with Alternatives #
Detailed Feature Matrix #
| Capability | MeloTTS | Coqui TTS | ChatTTS | Bark |
|---|---|---|---|---|
| Architecture | VITS2 + BERT | VITS / XTTS | GPT-based | GPT-style transformer |
| Training Data | Multi-lingual corpus | LJSpeech + custom | Conversational | Suno internal |
| Open Weights | Yes | Yes | Yes | Yes |
| Self-hostable | Yes | Yes | Yes | Yes |
| Offline capable | Yes | Yes | Yes | Yes |
| Streaming output | Yes | No | No | No |
| SSML support | No | Yes | No | No |
| Fine-tuning docs | Yes | Extensive | Limited | Minimal |
| Community size | Medium | Large | Large | Very Large |
| Last update | Dec 2024 | Active | Active | 2024 |
When to Choose Which #
MeloTTS: Choose when you need CPU-only deployment, multi-lingual support, MIT licensing, and sub-second latency. Ideal for SaaS products, mobile backends, and edge devices.
Coqui TTS: Choose when voice cloning is required, or you need the largest selection of pre-trained voices. The XTTS-v2 model produces the most natural-sounding cloned voices in open source.
ChatTTS: Choose for conversational AI applications in Chinese or English. The prosody is tuned for dialogue, making it ideal for chatbots and virtual assistants.
Bark: Choose for creative applications requiring music, laughter, or sound effects alongside speech. The trade-off is significantly higher compute requirements.
Limitations / Honest Assessment #
MeloTTS is not a universal solution. These are the concrete limitations to consider:
No voice cloning: Unlike Coqui XTTS or Bark, MeloTTS cannot clone a speaker from a reference audio clip. You are limited to the built-in speakers per language.
No emotion control: You can adjust speed, but there is no parameter for controlling happiness, sadness, anger, or other emotional qualities. Bark and ChatTTS offer richer emotional expression.
G2P limitations: The default grapheme-to-phoneme pipeline uses rule-based
espeak-ng, which occasionally mispronounces rare words or proper nouns. No neural G2P is included out of the box.No streaming inference: While the full generation is fast, you must wait for the entire audio to be synthesized before playback starts. True chunk-by-chunk streaming is not supported.
Limited fine-tuning documentation: Training on custom datasets is possible but the documentation is sparse compared to Coqui TTS. Expect to read source code to customize training.
No SSML support: Speech Synthesis Markup Language for controlling breaks, emphasis, and phoneme-level details is not supported.
Speaker count per language: Only one speaker per language (with English having accent variants). Coqui TTS offers hundreds of pre-trained voices.
Frequently Asked Questions #
Q1: Does MeloTTS require a GPU? #
No. MeloTTS is explicitly designed for CPU inference and achieves real-time speeds (RTF 0.41) on modern Intel and AMD processors. A GPU (NVIDIA CUDA) will improve throughput for batch processing but is not required for single-stream synthesis.
Q2: Can I use MeloTTS commercially? #
Yes. MeloTTS is released under the MIT license, which permits commercial use, modification, distribution, and private use. There are no attribution requirements beyond preserving the license notice in derivative works.
Q3: How does Chinese-English mixed input work? #
The Chinese model (language='ZH') automatically detects English words within Chinese text and routes them through the English G2P pipeline while maintaining prosodic continuity. No manual tagging or model switching is required.
Q4: What is the maximum text length MeloTTS can handle? #
There is no hardcoded length limit. However, the model processes the entire text in a single forward pass, so very long texts (> 1000 characters) may cause out-of-memory errors on low-RAM systems. For long-form content, split text into sentences and synthesize in batches.
Q5: How do I fix espeak-ng not found errors? #
Install espeak-ng via your system package manager before installing MeloTTS. On Ubuntu: sudo apt-get install espeak-ng. On macOS: brew install espeak. On Windows, download the installer from the espeak-ng GitHub releases page and add it to your PATH.
Q6: Can I fine-tune MeloTTS on my own voice? #
Yes, but with caveats. The training pipeline exists (docs/training.md) but documentation is limited. You need ~30 minutes of clean audio recordings and a corresponding text transcript. Fine-tuning requires a GPU (NVIDIA with 8GB+ VRAM) and takes several hours.
Q7: How does MeloTTS compare to ElevenLabs or other commercial TTS? #
MeloTTS matches commercial services in intelligibility and approaches them in naturalness for supported languages. Where commercial services pull ahead is in voice variety (thousands of voices) and cloning quality. MeloTTS wins on latency, cost (free), privacy (fully local), and deployability.
Conclusion #
MeloTTS occupies a unique position in the open-source TTS landscape: it is the only library that combines multi-lingual support, MIT licensing, and real-time CPU inference in a sub-300MB package. For teams building SaaS products, chatbots, or content pipelines that need reliable speech synthesis without GPU infrastructure, MeloTTS is the pragmatic choice.
Action items:
- Run
pip install melottsand synthesize your first audio clip today - Deploy the FastAPI example behind Nginx for a production-ready TTS endpoint
- Join the MeloTTS GitHub Discussions for community support
- Follow the dibi8 Telegram group for weekly AI tool updates
Recommended Hosting & Infrastructure #
Before you deploy any of the tools above into production, you’ll need solid infrastructure. Two options dibi8 actually uses and recommends:
- DigitalOcean — $200 free credit for 60 days across 14+ global regions. The default option for indie devs running open-source AI tools.
- HTStack — Hong Kong VPS with low-latency access from mainland China. This is the same IDC that hosts dibi8.com — battle-tested in production.
Affiliate links — they don’t cost you extra and they help keep dibi8.com running.
Sources & Further Reading #
- MeloTTS GitHub Repository
- MeloTTS Installation Guide
- MeloTTS Training Guide
- MeloTTS Hugging Face Space
- Coqui TTS Documentation
- ChatTTS GitHub Repository
- Bark (Suno) GitHub Repository
- VITS2 Paper
- Bert-VITS2 Paper
- MeloTTS Chinese Community Guide
- TTS Engine Comparison on Clore.ai
- Open-LLM-VTuber TTS Benchmark
- MeloTTS Performance Deep Dive
💬 Discussion