What is Coqui TTS and is it still maintained?

Coqui TTS is an open-source deep learning toolkit for text-to-speech, forked from Mozilla TTS and licensed under MPL-2.0. After the Coqui AI company closed in December 2023, the project is now community-maintained by the Idiap Research Institute via the fork at github.com/idiap/coqui-ai-TTS.

What is the difference between VITS and XTTS v2 in Coqui TTS?

VITS is an end-to-end single-speaker model optimized for speed, reaching about 67x real-time (0.08 RTF) on GPU. XTTS v2 is a GPT-based multi-speaker model that adds zero-shot voice cloning across 17 languages with sub-200 ms streaming, at the cost of higher VRAM (around 4.1 GB).

How much reference audio does Coqui XTTS v2 need for voice cloning?

XTTS v2 performs zero-shot voice cloning with as little as 3 seconds of reference audio, and achieves 85-95% speaker similarity (measured via ECAPA-TDNN cosine similarity) with about 6 seconds. Passing multiple reference files improves cloning consistency.

Can I use Coqui TTS commercially?

The framework itself is MPL-2.0 and can be used commercially. However, the XTTS v2 model is released under the Coqui Public Model License (CPML), which permits commercial use but adds attribution and redistribution restrictions, so the model license should be reviewed separately before shipping.

What hardware does Coqui TTS need to run XTTS v2?

XTTS v2 inference runs comfortably on an 8 GB VRAM GPU such as an RTX 3060 Ti or better, with roughly 4 GB VRAM budgeted per active model instance for concurrent serving. Lighter models like VITS and FastSpeech2 can run CPU-only, though at 5-10x slower real-time factor.

Coqui TTS：45.3K+ 星标 — 2026 年深度学习 TTS 工具包基准对比 ChatTTS、MeloTTS、Bark

{{< 资源信息 >}} ＃＃介绍选择用于生产的文本转语音引擎是一个雷区。大多数演示在桌面 GPU 上听起来都很棒，但在并发负载下会崩溃，将 Docker 映像膨胀到 10 GB，或者在从英语切换到普通话时失败。本 coqui tts 教程 介绍了经过生产强化的 文本转语音设置，针对 ChatTTS、MeloTTS 和 Bark 进行了基准测试，并分享了我们用于每天处理 5000 多个请求的配置文件。在评估了用于多语言客户服务部署的六个开源 TTS 框架后，Coqui TTS 成为唯一涵盖所有基础的工具包：通过 Fairseq 的 1100 多种语言、使用 XTTS v2 的不到 200 毫秒的流式传输以及实际上在 30 秒内启动的 coqui tts docker 图像。 ## Coqui TTS 是什么？ Coqui TTS 是一个用于文本转语音合成的开源深度学习工具包，从 Mozilla TTS 分叉出来，并在最初的 Coqui AI 公司于 2023 年 12 月关闭后由社区维护。它在 GitHub 上有 45,300 颗星，是最广泛采用的神经 TTS 库之一。该项目将训练配方、预训练模型和推理 API 捆绑在一个 Python 包下，支持从 Tacotron2 到 VITS 的架构，再到处理 17 种语言的语音克隆的旗舰 XTTS v2 模型。 ## Coqui TTS 的工作原理 Coqui TTS 将合成管道分为三个可互换的阶段：文本到频谱图模型、扬声器编码器和声码器。这种模块化设计使您可以更换组件，而无需重新训练整个堆栈。

下面的架构图展示了从原始文本到音频输出的数据流：

核心概念： 核心概念： - 频谱图模型 — Tacotron2、Glow-TTS、FastSpeech2 和 VITS 将原始文本转换为梅尔频谱图。 VITS 是端到端的，并跳过单独的声码器步骤，这就是它在 GPU 上达到 67 倍实时系数的原因。 - 扬声器编码器 - 根据参考音频计算扬声器嵌入。 XTTS v2 使用此功能进行零样本语音克隆，参考音频仅需 3 秒。 - 声码器 — HiFi-GAN、MelGAN 和 ParallelWaveGAN 将梅尔频谱图转换为原始音频波形。 HiFi-GAN 是生产部署的默认设置，因为它平衡了速度和质量。 - XTTS v2 — 基于 GPT 的旗舰架构，将文本解析、扬声器调节和音频生成统一在单个前向传递中。它支持 17 种语言和流，首块延迟低于 200 毫秒。 可用型号类别： | 类别 | 型号| 使用案例| |

Coqui TTS：45.3K+ 星标 — 2026 年深度学习 TTS 工具包基准对比 ChatTTS、MeloTTS、Bark

📦 出现在以下合集中

💬 留言讨论

🔗 相关资源推荐

📦 出现在以下合集中

💬 留言讨论