RVC: Deploy AI Voice Conversion with 35K+ Stars — 10-Minute Training Setup for 2026
RVC (Retrieval-based Voice Conversion) is a VITS-based voice conversion framework compatible with GPT-SoVITS, Coqui TTS, and demucs. This tutorial covers Docker deployment, training pipelines, API integration, and production hardening.
- ⭐ 35700
- MIT
- Updated 2026-05-19
{{< resource-info >}}

Introduction #
You need a voice conversion pipeline that trains in 10 minutes, runs on a single GPU, and produces broadcast-quality output. The open-source ecosystem has produced dozens of voice cloning tools, but most require hours of training, massive datasets, or cloud APIs with per-minute billing. RVC (Retrieval-based Voice Conversion), a VITS-based framework with 35,700+ GitHub stars, cuts training time to under 10 minutes with as little as 10 minutes of clean audio. This guide walks through a production-ready RVC setup — Docker deployment, training pipelines, API integration, and the hardening steps you need before shipping to users.
What Is RVC? #
RVC is an open-source voice conversion framework that converts one person’s voice into another’s while preserving speech content, intonation, and rhythm. Built on VITS with a retrieval-based feature matching module, it achieves training times under 10 minutes on consumer GPUs and supports real-time inference with latencies as low as 90ms.
How RVC Works #
RVC’s architecture combines four core modules:
Content Feature Extraction — Uses ContentVec (a disentangled variant of HuBERT) to extract speaker-invariant phonetic and linguistic features from source audio. ContentVec strips speaker identity while preserving content information, making it ideal for voice conversion tasks.
Pitch Extraction — Employs RMVPE (Robust Model for Vocal Pitch Estimation), presented at Interspeech 2023, to extract fundamental frequency (F0). RMVPE handles polyphonic audio and performs accurately even when source separation is imperfect.
Acoustic Modeling — Built on VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), a conditional VAE augmented with normalizing flows. VITS generates high-fidelity audio through adversarial training between a generator and multi-period discriminators.
Retrieval Module — RVC’s signature innovation. During training, content features are indexed in a Faiss vector database. During inference, source features are replaced with top-K nearest neighbors from the training set (K=8 by default), dramatically reducing timbre leakage from the source speaker. An index_rate parameter (α, typically 0.3) controls the blend between retrieved and source features.

Installation & Setup #
Prerequisites #
RVC runs on Linux, macOS, and Windows. For training, an NVIDIA GPU with at least 4GB VRAM is required (8GB+ recommended). For inference only, CPU works with acceptable latency.
Minimum hardware:
- GPU: NVIDIA GTX 1660 6GB / RTX 2060 8GB (training); 4GB VRAM (inference only)
- CPU: 4-core Intel/AMD processor
- RAM: 8GB minimum, 16GB recommended
- Storage: 10GB free space for models and dependencies
Method 1: Docker Deployment (Recommended for Production) #
The official Dockerfile uses CUDA 11.6.2 on Ubuntu 20.04 with Python 3.9:
# Clone the repository
git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git
cd Retrieval-based-Voice-Conversion-WebUI
# Build the Docker image
docker build -t rvc-webui:latest .
# Run with GPU support and volume mounts
docker run -d --name rvc \
--gpus all \
-p 7865:7865 \
-v $(pwd)/weights:/app/weights \
-v $(pwd)/opt:/app/opt \
rvc-webui:latest
For docker-compose users:
version: '3.8'
services:
rvc:
build: .
container_name: rvc-webui
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
ports:
- "7865:7865"
volumes:
- ./weights:/app/weights
- ./opt:/app/opt
- ./assets:/app/assets
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
# Start with docker-compose
docker-compose up -d
# Check logs
docker-compose logs -f rvc
Method 2: Local Python Setup #
# Clone repository
git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git
cd Retrieval-based-Voice-Conversion-WebUI
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Download pretrained models
python tools/download_models.py
# Or manually download from HuggingFace
wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/D40k.pth -P assets/pretrained_v2/
wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/G40k.pth -P assets/pretrained_v2/
wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/f0D40k.pth -P assets/pretrained_v2/
wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/pretrained_v2/f0G40k.pth -P assets/pretrained_v2/
wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt -P assets/hubert/
wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/rmvpe.pt -P assets/rmvpe/
Method 3: AMD GPU Setup (ROCm) #
# Install ROCm dependencies (Ubuntu/Debian)
sudo apt install rocm-hip-sdk rocm-opencl-sdk
# Set environment variables
export ROCM_PATH=/opt/rocm
export HSA_OVERRIDE_GFX_VERSION=10.3.0
# Add user to render and video groups
sudo usermod -aG render $USER
sudo usermod -aG video $USER
# Install AMD-specific requirements
pip install -r requirements-amd.txt
Starting the WebUI #
# Start the Gradio web interface
python infer-web.py
# The WebUI will be available at http://localhost:7865
Training Pipeline #
Step 1: Prepare Your Dataset #
RVC requires clean, monophonic audio. For best results:
- Duration: 10–30 minutes of clean speech (minimum 1 minute works)
- Format: WAV, 16-bit or 24-bit, 22050Hz or 40000Hz sampling rate
- Content: Single speaker, minimal background noise, no music or reverb
- Silence: Remove long silent segments (> 3 seconds)
Use UVR5 (included) for source separation:
# Separate vocals from background music
python tools/uvr5/uvr5_cli.py \
--input_path ./raw_audio/song_with_music.wav \
--output_path ./dataset/ \
--model_name "HP2-人声vocals+非人声instrumentals"
Step 2: Preprocess and Extract Features #
In the WebUI Train tab:
- Set Experiment Name (e.g.,
my_voice_v2) - Set Target Sampling Rate to 40kHz (recommended)
- Set RVC Version to v2
- Set Model Architecture to
rmvpe_gpu - Set Dataset Path to your audio folder
- Click One-Click Training
Or via the command line:
# Step 1: Preprocess (resample, slice, remove silence)
python trainset_preprocess_pipeline_print.py \
./dataset/my_voice \
40000 \
8 # number of CPU threads
# Step 2: Extract features with ContentVec
python extract_feature_print.py \
--model_name my_voice_v2 \
--sample_rate 40000 \
--pitch_extractor rmvpe \
--gpu 0
# Step 3: Train the model
python train_nsf_sim_cache_sid_load_pretrain.py \
--model_name my_voice_v2 \
--sample_rate 40000 \
--batch_size 8 \
--total_epoch 200 \
--save_every_epoch 5 \
--pretrained_G assets/pretrained_v2/f0G40k.pth \
--pretrained_D assets/pretrained_v2/f0D40k.pth \
--gpu 0
Step 3: Build the Feature Index #
# Generate the Faiss index for retrieval
python tools/infer/train_index.py \
--model_name my_voice_v2 \
--sample_rate 40000
Training output locations:
logs/
└── my_voice_v2/
├── added_IVF512_Flat_nprobe_1.index # Faiss retrieval index
├── G_*.pth # Generator checkpoints
├── D_*.pth # Discriminator checkpoints
└── config.json # Model configuration

Training Benchmarks #
| Hardware | Dataset Size | Epochs | Training Time | Output Quality |
|---|---|---|---|---|
| RTX 3090 (24GB) | 10 min audio | 200 | ~18 min | Excellent |
| RTX 4090 (24GB) | 10 min audio | 200 | ~12 min | Excellent |
| RTX 3060 (12GB) | 10 min audio | 200 | ~35 min | Very Good |
| GTX 1660 (6GB) | 10 min audio | 200 | ~90 min | Good |
| Colab T4 (16GB) | 10 min audio | 200 | ~40 min | Very Good |
Integration with Popular Tools #
Integration 1: GPT-SoVITS (TTS + RVC Pipeline) #
GPT-SoVITS generates speech from text; RVC converts it to a target voice. Together they form a complete text-to-speech cloning pipeline:
# gpt_sovits_rvc_pipeline.py
import subprocess
import requests
import os
def tts_then_convert(text: str, speaker_wav: str, rvc_model: str):
"""GPT-SoVITS TTS → RVC voice conversion pipeline"""
# Step 1: Generate speech with GPT-SoVITS
tts_response = requests.post("http://localhost:9880/tts", json={
"text": text,
"refer_wav_path": speaker_wav,
"prompt_text": "Reference prompt text",
"prompt_language": "en",
"text_language": "en"
})
with open("/tmp/tts_output.wav", "wb") as f:
f.write(tts_response.content)
# Step 2: Convert voice with RVC API
rvc_response = requests.post("http://localhost:7865/voice_conversion", json={
"input_audio": "/tmp/tts_output.wav",
"model_name": rvc_model,
"pitch_shift": 0,
"index_rate": 0.75,
"filter_radius": 3,
"volume_envelope": 0.25
})
return rvc_response.json()["output_path"]
result = tts_then_convert(
text="Hello, this is a cloned voice speaking.",
speaker_wav="./reference.wav",
rvc_model="my_voice_v2"
)
print(f"Converted audio saved to: {result}")
Integration 2: Coqui TTS #
# coqui_rvc_bridge.py
from TTS.api import TTS
import requests
def coqui_to_rvc(text: str, rvc_model: str, output_path: str):
# Generate with Coqui XTTS v2
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
tts.tts_to_file(
text=text,
speaker_wav="reference.wav",
language="en",
file_path="/tmp/coqui_out.wav"
)
# Convert through RVC
with open("/tmp/coqui_out.wav", "rb") as f:
files = {"file": f}
data = {
"model_name": rvc_model,
"pitch": 0,
"index_rate": 0.5
}
response = requests.post(
"http://localhost:7865/api/voice_conversion",
files=files,
data=data
)
with open(output_path, "wb") as f:
f.write(response.content)
return output_path
Integration 3: demucs (Advanced Source Separation) #
For production-grade vocal isolation before training:
# Install demucs
pip install demucs
# Separate vocals with demucs (better quality than UVR5 for complex mixes)
demucs --two-stems=vocals --mp3 --mp3-bitrate 320 input_song.mp3
# Use the separated vocal track for RVC training
mv separated/htdemucs/input_song/vocals.wav ./dataset/clean_voice.wav
Integration 4: Real-Time Voice Conversion GUI #

RVC includes a real-time voice conversion GUI for live applications:
# Start the real-time GUI
python gui_v1.py
# Or with DirectML for AMD/Intel GPUs
python gui_v1.py --dml
# Key parameters for low latency:
# - Block time: 0.25s (lower = less latency, more CPU)
# - Crossfade: 0.05s
# - Extra time: 2.5s
# - Pitch extractor: fcpe (fastest) or rmvpe (best quality)
Configuration for streaming (90ms end-to-end latency with ASIO):
# gui_config.py example
config = {
"block_time": 0.1, # 100ms blocks for lower latency
"crossfade_time": 0.04,
"extra_time": 2.0,
"f0method": "fcpe", # Fastest pitch extractor
"rms_mix_rate": 0.25,
"index_rate": 0.3,
"pitch": 0,
"I_noise_reduce": True,
"O_noise_reduce": False
}
Integration 5: API Server (FastAPI) #
RVC provides a FastAPI-based REST API for production deployments:
# Start the API server
python api_240604.py
# The API will be available at http://localhost:7865
# Client example for API inference
import requests
# Load model first
requests.post("http://localhost:7865/load_model", json={
"pth_path": "./weights/my_voice_v2.pth",
"index_path": "./logs/my_voice_v2/added_IVF512_Flat_nprobe_1.index"
})
# Perform voice conversion
with open("input_audio.wav", "rb") as f:
response = requests.post(
"http://localhost:7865/voice_conversion",
files={"file": f},
data={
"pitch": 0,
"index_rate": 0.75,
"filter_radius": 3,
"volume_envelope": 0.25,
"protect": 0.33
}
)
with open("converted_output.wav", "wb") as f:
f.write(response.content)
Benchmarks / Real-World Use Cases #
Objective Quality Metrics #
| Metric | RVC v2 | So-VITS-SVC 4.1 | GPT-SoVITS (SVC) | DDSP-SVC |
|---|---|---|---|---|
| Speaker Similarity (cosine) | 0.85 | 0.79 | 0.82 | 0.71 |
| PESQ (quality, /4.5) | 3.6 | 3.3 | 3.4 | 2.8 |
| UTMOS (naturalness, /5) | 4.19 | 3.95 | 4.05 | 3.45 |
| Training Time (10min data, RTX 3090) | ~18 min | ~2 hours | ~45 min | ~15 min |
| Min Training Data | 1 min | 10 min | 1 min | 5 min |
| Real-time Factor (inference) | 0.05x | 0.15x | 0.08x | 0.02x |
Production Use Cases #
AI Music Covers — Converting vocal tracks to mimic specific artist voices for entertainment content. RVC’s RMVPE pitch extraction maintains accurate melody reproduction even through complex vocal runs.
Live Streaming Voice Modification — Real-time voice changing for content creators. With the GUI and ASIO drivers, end-to-end latency reaches 90ms, imperceptible to most audiences.
Privacy Protection — Anonymizing speaker identity in recorded interviews, call center audio, and sensitive voice data while preserving emotional content and intelligibility.
Content Localization — Dubbing video content while maintaining the original speaker’s vocal characteristics across languages when combined with TTS pipelines.
Voice Dataset Augmentation — Generating synthetic training data for ASR systems in low-resource languages, as demonstrated in academic research achieving KL divergence loss of 0.68 after 200 epochs.
Advanced Usage / Production Hardening #
Security Considerations #
# api_production.py — Hardened API wrapper
from fastapi import FastAPI, HTTPException, Depends
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import hashlib
security = HTTPBearer()
def verify_token(credentials: HTTPAuthorizationCredentials):
"""Verify API token for production deployments"""
expected = hashlib.sha256(TOKEN.encode()).hexdigest()
if credentials.credentials != expected:
raise HTTPException(status_code=401, detail="Invalid token")
return True
@app.post("/voice_conversion")
async def secure_convert(
file: UploadFile,
credentials: HTTPAuthorizationCredentials = Depends(security)
):
verify_token(credentials)
# ... conversion logic
return {"output_url": signed_url}
Model Management #
# Organize multiple voice models
models/
├── celeb_voice_a/
│ ├── model.pth
│ ├── index.faiss
│ └── config.json
├── celeb_voice_b/
│ ├── model.pth
│ ├── index.faiss
│ └── config.json
└── custom_brand_voice/
├── model.pth
├── index.faiss
└── config.json
# Dynamic model loader for multi-tenant deployments
import os
import glob
def list_available_models(models_dir="./models"):
"""List all available voice models"""
models = []
for model_dir in glob.glob(os.path.join(models_dir, "*/")):
name = os.path.basename(os.path.dirname(model_dir))
pth_files = glob.glob(os.path.join(model_dir, "*.pth"))
index_files = glob.glob(os.path.join(model_dir, "*.faiss")) + \
glob.glob(os.path.join(model_dir, "*.index"))
if pth_files and index_files:
models.append({
"name": name,
"pth": pth_files[0],
"index": index_files[0]
})
return models
Monitoring and Logging #
# monitoring.py — Prometheus-compatible metrics
from prometheus_client import Counter, Histogram, start_http_server
import time
conversion_count = Counter('rvc_conversions_total', 'Total conversions')
conversion_duration = Histogram('rvc_conversion_seconds', 'Conversion latency')
error_count = Counter('rvc_errors_total', 'Total errors', ['error_type'])
def monitored_convert(audio_path, model_name):
start = time.time()
try:
result = perform_conversion(audio_path, model_name)
conversion_count.inc()
return result
except Exception as e:
error_count.labels(error_type=type(e).__name__).inc()
raise
finally:
conversion_duration.observe(time.time() - start)
# Start metrics endpoint
start_http_server(9090)
ONNX Export for Faster Inference #
# Export trained model to ONNX for CPU/GPU-agnostic inference
python tools/export_onnx.py \
--checkpoint_path ./logs/my_voice_v2/G_12000.pth \
--output_path ./models/my_voice_v2/model.onnx \
--sample_rate 40000
Comparison with Alternatives #
| Feature | RVC v2 | GPT-SoVITS | So-VITS-SVC 4.1 | DDSP-SVC |
|---|---|---|---|---|
| Primary Purpose | Voice Conversion | TTS + Voice Cloning | Singing Voice Conversion | Singing Voice Conversion |
| Training Time (10min data) | ~18 min (RTX 3090) | ~45 min | ~2 hours | ~15 min |
| Min GPU VRAM (training) | 4GB | 8GB | 8GB | 4GB |
| Min Training Data | 1 min | 1 min | 10 min | 5 min |
| Pitch Extractor | RMVPE / FCPE | RMVPE | Harvest / Crepe | Harvest / Crepe |
| Content Encoder | ContentVec (768-dim) | HuBERT / ContentVec | ContentVec | ContentVec |
| Retrieval Module | Faiss top-K (K=8) | No retrieval | Faiss top-K (added in 4.1) | No retrieval |
| Real-time Inference | Yes (90ms latency) | No | Yes (with w-okada) | Yes |
| Vocoder | NSF-HiFi-GAN | NSF-HiFi-GAN | NSF-HiFi-GAN | DDSP + HiFi-GAN |
| Model Fusion | Yes (ckpt merge) | No | No | No |
| UVR5 Integration | Built-in | Separate | Separate | Separate |
| License | MIT | MIT | BSD-3-Clause | Apache-2.0 |
| GitHub Stars | 35,700+ | 43,000+ | 21,200+ | 2,800+ |
| Last Update | 2024-06 | 2025-04 | 2023-12 (archived) | 2024-03 |
When to choose RVC: You need fast training, real-time conversion, or retrieval-based feature matching for minimal timbre leakage. RVC is the pragmatic choice for production voice conversion pipelines.
When to choose GPT-SoVITS: You need text-to-speech with voice cloning (not audio-to-audio conversion). GPT-SoVITS excels at generating speech from text with 1-minute reference audio.
When to choose So-VITS-SVC: Legacy singing voice conversion with shallow diffusion post-processing. Note: the project is archived and no longer maintained.
When to choose DDSP-SVC: Lightweight singing conversion with minimal resource usage. Lower quality than RVC but runs on weaker hardware.
Limitations / Honest Assessment #
RVC is a capable tool, but it is not the right choice for every voice application:
No Text-to-Speech. RVC converts audio to audio. It cannot generate speech from text. Combine it with GPT-SoVITS, Coqui TTS, or Edge-TTS for a full TTS pipeline.
Speaker Similarity Ceiling. While RVC produces convincing conversions, it does not match the fidelity of commercial solutions like ElevenLabs Voice Cloning or Microsoft Azure Speech Studio. For enterprise-grade voice cloning, paid APIs still lead.
Emotional Control Limitations. RVC preserves the source audio’s emotion and prosody but cannot manipulate emotional expression independently. Tools like CosyVoice 3 or Seed-VC offer more granular prosody control.
Hardware Requirements for Training. Despite being more efficient than alternatives, training still requires a discrete NVIDIA GPU. CPU training is impractical (50x+ slower). Cloud GPU instances add operational cost.
Ethical and Legal Risks. Voice cloning technology can be misused for deepfake audio, fraud, and impersonation. RVC includes no built-in watermarking or safety mechanisms. Production deployments should implement consent verification, audit logging, and synthetic audio watermarking.
Language Support Gaps. RVC works well for major languages (English, Chinese, Japanese, Korean) but quality degrades for tonal languages (Thai, Vietnamese) and low-resource languages with limited pretrained model coverage.
Frequently Asked Questions #
How much training data does RVC need? #
RVC can train with as little as 1 minute of clean audio, but 10–30 minutes yields significantly better results. The key is audio quality, not quantity. A 10-minute studio recording outperforms 2 hours of noisy phone calls. Keep background noise, music, and reverb to a minimum.
Can RVC run on CPU only? #
Yes, for inference only. Training requires a CUDA-capable GPU. CPU inference is 10–20x slower than GPU but works for batch processing tasks where latency is not critical. For real-time conversion, a GPU with at least 4GB VRAM is strongly recommended.
What is the difference between RVC v1 and v2? #
RVC v2 changes the content encoder from 9-layer HuBERT with 256-dimensional features to 12-layer HuBERT with 768-dimensional features, and adds 3 period discriminators for improved audio quality. All new projects should use v2 models exclusively. The v2 pretrained models are not backward compatible with v1.
How do I reduce timbre leakage in converted audio? #
Adjust the index_rate parameter. A higher value (0.7–1.0) increases reliance on the retrieval index, pulling more features from the training set and less from the source. Start with 0.75 and adjust based on output quality. If the voice sounds artificial, lower it to 0.3–0.5.
Can I use RVC for real-time voice changing in Discord/Zoom/Game? #
Yes, through the RVC real-time GUI (gui_v1.py). Route your microphone through a virtual audio cable (VB-Cable on Windows, BlackHole on macOS, or PulseAudio on Linux), set RVC as the input device, and configure your application to use the virtual cable output. With ASIO drivers and a modern GPU, latency stays under 100ms.
What file formats does RVC support? #
RVC supports WAV, MP3, FLAC, OGG, and M4A for input. Output is always WAV at the target sampling rate (32kHz, 40kHz, or 48kHz). For best quality, use lossless WAV or FLAC as input and avoid re-encoding MP3 files multiple times.
How do I fuse two voice models together? #
Use the ckpt merge tab in the WebUI. Load two .pth model files and set the blend ratio (0.0 = 100% model A, 1.0 = 100% model B, 0.5 = equal blend). The merged model inherits characteristics from both voices. This works well for creating hybrid voices without retraining.
Why does my converted voice sound robotic or distorted? #
Common causes: (1) insufficient training data or epochs — train for at least 200 epochs; (2) noisy input audio — use UVR5 or demucs for source separation; (3) incorrect pitch extractor — use rmvpe for speech, harvest for singing; (4) mismatched sampling rate — ensure inference matches training sample rate.
Conclusion #
RVC delivers production-grade voice conversion with training times under 20 minutes on mid-range hardware. Its retrieval-based architecture produces cleaner voice separation than direct feature mapping, and the FastAPI server enables straightforward integration into existing audio pipelines. For developers building voice-enabled applications, RVC offers the best balance of quality, speed, and deployment flexibility in the open-source ecosystem.
Next steps: Clone the repository, run the Docker setup, and train your first model with 10 minutes of clean audio. Join the RVC community on Telegram to share models and get support: t.me/RVCSpace.
Recommended Hosting & Infrastructure #
Before you deploy any of the tools above into production, you’ll need solid infrastructure. Two options dibi8 actually uses and recommends:
- DigitalOcean — $200 free credit for 60 days across 14+ global regions. The default option for indie devs running open-source AI tools.
- HTStack — Hong Kong VPS with low-latency access from mainland China. This is the same IDC that hosts dibi8.com — battle-tested in production.
Affiliate links — they don’t cost you extra and they help keep dibi8.com running.
Sources & Further Reading #
- RVC GitHub Repository
- RVC Official Documentation
- Understanding RVC Architecture
- RVC Docker Setup Guide
- RVC API Reference (api_240604.py)
- GPT-SoVITS Repository
- ContentVec Paper
- RMVPE Pitch Extraction Paper
- VITS Paper
- DDSP-SVC Repository
- So-VITS-SVC 4.1 (Archived)
- Custom Data Augmentation for Low-Resource ASR Using RVC
- LLVC: Low-Latency Real-Time Voice Conversion on CPU
- RVC Inference Settings Reference
- PetVocalia: Zero-Shot SVC Benchmark (IJCAI 2025)
💬 Discussion