VoiceBox: The Open-Source AI Voice Studio #

VoiceBox is a comprehensive, open-source AI voice studio that enables voice cloning, speech generation, and dictation — all running locally on your machine. With 33,745 GitHub stars and an active development community, it has become the go-to solution for developers, content creators, and privacy-conscious users who need powerful voice AI without relying on cloud APIs.

This article covers installation, voice cloning, dictation mode, API usage, hardware requirements, and practical applications.

TL;DR #

VoiceBox provides a complete voice AI stack running entirely on your hardware. It supports voice cloning from as little as 3 seconds of audio, real-time dictation into any application, and high-quality text-to-speech generation. With support for both NVIDIA CUDA and Apple Silicon (MLX), it adapts to your hardware while maintaining privacy — your voice data never leaves your machine.

What Is VoiceBox? #

VoiceBox is a self-hosted voice AI platform that combines several cutting-edge technologies into a single, unified interface. Unlike commercial voice services that require uploading your audio to the cloud, VoiceBox processes everything locally, giving you complete control over your voice data.

The platform supports three primary modes of operation:

Voice Cloning: Record or upload a short audio sample and create a digital voice model that can generate speech in that voice
Dictation: Use your microphone to dictate text into any application on your system, with real-time transcription
Text-to-Speech: Generate natural-sounding speech from text using cloned voices or built-in voice models

Built on top of modern open-source models including Qwen3-TTS, Whisper, and various voice cloning architectures, VoiceBox provides enterprise-grade voice AI capabilities at zero cost.

Installation Guide #

Prerequisites #

VoiceBox supports multiple hardware configurations:

GPU Accelerated (Recommended):

NVIDIA GPU with 8GB+ VRAM (RTX 3060 or better)
CUDA 12.x toolkit installed
16GB system RAM
Linux (Ubuntu 22.04+) or Windows 11

Apple Silicon:

M1/M2/M3 chip with 16GB+ unified memory
macOS 14+ (Sonoma or newer)
MLX framework installed

CPU-Only (Slower but functional):

16GB+ system RAM
8+ CPU cores
Any modern operating system

Option 1: Quick Install with Pip #

# Install VoiceBox from PyPI
pip install voicebox-ai

# Verify installation
voicebox --version

# Initialize the application
voicebox init --model qwen3-tts

Option 2: From Source (Latest Features) #

# Clone the repository
git clone https://github.com/jamiepine/voicebox.git
cd voicebox

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install the package in development mode
pip install -e .

# Download the default voice models
voicebox download-models --all

Option 3: Docker Deployment #

# Pull the official image
docker pull jamiepine/voicebox:latest

# Run with GPU support (NVIDIA)
docker run -d \
  --name voicebox \
  --gpus all \
  -p 8000:8000 \
  -v ${HOME}/voicebox-data:/data \
  -e VOICEBOX_MODEL=qwen3-tts \
  jamiepine/voicebox:latest

# Run on Apple Silicon (no GPU flag needed)
docker run -d \
  --name voicebox \
  -p 8000:8000 \
  -v ${HOME}/voicebox-data:/data \
  -e VOICEBOX_MODEL=qwen3-tts \
  jamiepine/voicebox:latest

Option 4: Windows Installation #

# Install Python 3.11+ from microsoft store
# Then install VoiceBox
pip install voicebox-ai

# For GPU acceleration, install CUDA toolkit
# Download from: https://developer.nvidia.com/cuda-downloads

# Initialize VoiceBox
voicebox init --gpu cuda

Voice Cloning #

Recording a Voice Sample #

To clone a voice, you need at least 3 seconds of clear audio. For best results, provide 30-60 seconds of speech:

# Record audio using the built-in recorder
voicebox record --output sample.wav --duration 30

# Or upload an existing audio file
voicebox clone --audio my_voice_sample.mp3 --name "my-voice"

# VoiceBox automatically processes the audio and extracts voice characteristics

Voice Processing Pipeline #

The voice cloning pipeline consists of several stages:

from voicebox.engine import VoiceCloner
from voicebox.audio import AudioProcessor

# Initialize the cloner
cloner = VoiceCloner(model="qwen3-tts-voice-clone")

# Load and preprocess the reference audio
processor = AudioProcessor()
reference = processor.load_audio("sample.wav")
reference = processor.normalize(reference, target_rms=-20)
reference = processor.remove_noise(reference, method="spectral")

# Extract voice embeddings
embeddings = cloner.extract_embeddings(reference)

# Create the voice model
voice_model = cloner.create_voice(
    embeddings=embeddings,
    name="my-voice",
    quality="high"
)

# Test the cloned voice
output = voice_model.synthesize(
    text="Hello, this is my cloned voice speaking.",
    speed=1.0,
    emotion="neutral"
)
voice_model.save(output, "test_output.wav")

Advanced Voice Parameters #

VoiceBox exposes fine-grained control over voice synthesis:

# Control speech rate
voicebox synthesize --input script.txt --output speech.wav --speed 0.8

# Add emotional inflection
voicebox synthesize --input script.txt --output emotional.wav --emotion happy

# Adjust pitch
voicebox synthesize --input script.txt --output pitched.wav --pitch +200

# Combine multiple parameters
voicebox synthesize \
  --input script.txt \
  --output natural.wav \
  --speed 1.1 \
  --pitch +100 \
  --emotion confident \
  --clarity high

Multi-Voice Support #

You can create and manage multiple voice clones simultaneously:

from voicebox.engine import VoiceManager

manager = VoiceManager()

# List all cloned voices
voices = manager.list_voices()
for v in voices:
    print(f"{v.name}: {v.quality} ({v.duration}s of training data)")

# Switch between voices
manager.set_active_voice("my-voice")
output = manager.synthesize("Hello from my cloned voice!")

# Blend two voices for hybrid speech
hybrid = manager.blend_voices(
    voice_a="my-voice",
    voice_b="partner-voice",
    weight_a=0.7,
    weight_b=0.3
)
output = hybrid.synthesize("Blended voice output")

Dictation Mode #

VoiceBox’s dictation mode provides real-time speech-to-text transcription that works with any application on your system.

System-Wide Dictation Setup #

# Enable system-wide dictation
voicebox dictation --enable

# Choose the recognition model
voicebox dictation --model whisper-large-v3

# Set the output language
voicebox dictation --language en

# Configure hotkey
voicebox dictation --hotkey "ctrl+space"

Dictation API Usage #

from voicebox.dictation import DictationEngine

# Initialize the dictation engine
engine = DictationEngine(
    model="whisper-large-v3",
    language="auto",
    beam_size=5,
    vad_threshold=0.5
)

# Start listening
engine.start_listening(
    hotkey="ctrl+shift+d",
    output_mode="clipboard",
    append_mode=True
)

# Process a dictation session
result = await engine.listen_session(
    timeout=300,           # 5 minute session
    silence_threshold=1.5, # Stop after 1.5s of silence
    language="en"
)

print(f"Transcribed: {result.text}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Words: {result.word_count}")

Multi-Language Dictation #

VoiceBox supports simultaneous multi-language dictation with automatic language detection:

# Enable auto-detection
voicebox dictation --auto-detect

# Specify supported languages
voicebox dictation --languages en,zh,ko,ja,es,fr,de

# Set primary language (for better accuracy)
voicebox dictation --primary-language en

Text-to-Speech API #

VoiceBox exposes a full REST API for programmatic text-to-speech generation:

Basic TTS #

# Simple text-to-speech conversion
curl -X POST "https://your-voicebox/api/v1/tts" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a test of VoiceBox text-to-speech.",
    "voice": "default",
    "speed": 1.0,
    "output_format": "wav"
  }' \
  --output speech.wav

Streaming TTS #

For real-time audio streaming applications:

# Stream audio in chunks
curl -N -X POST "https://your-voicebox/api/v1/tts/stream" \
  -H "Content-Type: application/json" \
  -d '{"text": "This audio will stream in real-time...", "voice": "cloned-voice"}' \
  --output - | aplay

Batch Processing #

Process multiple texts simultaneously:

from voicebox.api import VoiceBoxClient

client = VoiceBoxClient("https://your-voicebox")

texts = [
    "First sentence for processing.",
    "Second sentence with different content.",
    "Third sentence in another voice.",
]

results = await client.tts.batch(
    texts=texts,
    voice="default",
    output_format="mp3",
    parallel_workers=4
)

for i, result in enumerate(results):
    print(f"Generated: speech_{i}.mp3 ({result.duration:.1f}s)")

Hardware Requirements and Performance #

GPU Performance Benchmarks #

Hardware	Model	Cloning Time	TTS Speed	Dictation Latency
RTX 4090	Qwen3-TTS	15 seconds	3x realtime	< 50ms
RTX 3060	Qwen3-TTS	45 seconds	2x realtime	< 80ms
M3 Max	Qwen3-TTS	30 seconds	2.5x realtime	< 60ms
M2 Base	Qwen3-TTS	90 seconds	1.2x realtime	< 150ms
CPU Only	Qwen3-TTS	5 minutes	0.3x realtime	< 500ms

Memory Requirements #

Operation	Minimum	Recommended
Basic TTS	4GB RAM	8GB RAM
Voice Cloning	8GB RAM	16GB RAM
Dictation	4GB RAM	8GB RAM
Multi-voice	12GB RAM	32GB RAM

Comparison: VoiceBox vs Commercial Alternatives #

Feature	VoiceBox	ElevenLabs	Amazon Polly	Google TTS
Price	Free	$5-50/mo	$4/million chars	$4/million chars
Voice Cloning	Yes (3s sample)	Yes (premium)	No	No
Local Processing	Yes	No	No	No
Open Source	Yes	No	No	No
Custom Voices	Unlimited	5 (starter)	1	1
Emotional Control	Yes	Partial	No	No
Real-Time	Yes	Yes	Yes	Yes
API Access	Full REST	REST	SDK	SDK
Multi-Language	30+	30+	40+	20+
Privacy	Full	Cloud	Cloud	Cloud

Integration Examples #

Python Library Integration #

import voicebox

# Quick TTS
result = voicebox.synthesize(
    text="Hello from VoiceBox!",
    voice="default",
    output_file="hello.wav"
)

# Voice cloning from audio file
cloned = voicebox.clone_voice(
    audio_file="sample.wav",
    voice_name="my-voice"
)

# Dictation into clipboard
voicebox.start_dictation(
    hotkey="cmd+space",
    target_app="any"
)

Command-Line Integration #

# Generate audio from a text file
voicebox tts --file script.txt --output narration.wav

# Clone a voice from a podcast episode
voicebox clone --audio podcast_ep1.mp3 --name "podcaster"

# Convert text to multiple languages
for lang in en zh ko vi; do
  voicebox tts --text "Hello world" --lang $lang --output greeting_$lang.wav
done

# Batch process a directory of text files
voicebox tts-batch --input ./scripts/ --output ./audio/ --voice default

Web Interface #

VoiceBox includes a built-in web interface accessible at http://localhost:8000:

Upload audio files for voice cloning
Type or paste text for TTS generation
Configure dictation hotkeys and languages
Monitor system resource usage
Export and manage voice models

Advanced Use Cases #

Podcast Production #

Use VoiceBox to clone your own voice and generate content in multiple languages:

# Clone your voice from existing podcast episodes
voicebox clone --audio ~/podcasts/episodes/*.mp3 --name "my-podcast-voice"

# Generate English version
voicebox tts --file article_en.txt --voice "my-podcast-voice" --output podcast_en.wav

# Generate Chinese version (requires translation first)
voicebox tts --file article_zh.txt --voice "my-podcast-voice" --output podcast_zh.wav

# Generate Korean version
voicebox tts --file article_ko.txt --voice "my-podcast-voice" --output podcast_ko.wav

Accessibility Applications #

VoiceBox can help users with speech difficulties communicate by cloning their original voice:

# Record a few seconds of natural speech
voicebox record --output baseline.wav --duration 10

# Clone the voice
voicebox clone --audio baseline.wav --name "accessible-voice"

# Use the cloned voice for text-to-speech
voicebox tts --text "I would like water please" --voice "accessible-voice" --output response.wav

Content Creation #

Generate voiceovers for videos, presentations, and social media content:

# Generate a voiceover for a video script
voicebox tts \
  --file video_script.txt \
  --voice "professional-narrator" \
  --speed 1.05 \
  --emotion engaging \
  --output voiceover.wav

# Add background music mix
ffmpeg -i voiceover.wav -i background_music.mp3 \
  -filter_complex "[0:a][1:a]amix=inputs=2:duration=first" \
  -output final_video_audio.mp3

Limitations #

Voice quality depends on training data: Noisy or short recordings produce lower-quality clones
GPU recommended for real-time use: CPU-only mode is functional but significantly slower
Not a replacement for professional voice acting: While impressive, synthetic voices lack the nuance of professional performers
Legal considerations: Ensure you have rights to clone any voice you use, including your own in some jurisdictions
Model updates: New voice models may require re-cloning existing voices for optimal quality

Getting Started Checklist #

# 1. Install VoiceBox
pip install voicebox-ai

# 2. Initialize with default model
voicebox init --model qwen3-tts

# 3. Download voice models
voicebox download-models --all

# 4. Test basic TTS
echo "Hello World" | voicebox tts --output test.wav

# 5. Set up dictation
voicebox dictation --enable --hotkey "ctrl+space"

# 6. Start web interface
voicebox web --port 8000

Conclusion #

VoiceBox democratizes voice AI technology by providing a complete, open-source voice studio that runs entirely on your hardware. Whether you need voice cloning for content creation, dictation for accessibility, or text-to-speech for applications, VoiceBox delivers professional-quality results at zero cost.

With support for both NVIDIA GPUs and Apple Silicon, multi-language capabilities, and a growing ecosystem of integrations, VoiceBox is positioned as the leading open-source alternative to commercial voice AI platforms. Its active community and rapid development cycle ensure that new features and improvements arrive regularly.

Sources #

CTA #

Get started with VoiceBox today by visiting the GitHub repository. For GPU-accelerated deployments, consider HTStack for affordable NVIDIA GPU instances, or DigitalOcean for their managed cloud platform.

FAQ #

q: How much audio do I need to clone a voice? #

a: VoiceBox can clone a voice from as little as 3 seconds of clear audio, but for best results, provide 30-60 seconds of natural speech. The more training data, the higher the quality of the cloned voice.

q: Does VoiceBox work offline? #

a: Yes. Once the models are downloaded, VoiceBox operates entirely offline. No internet connection is required for voice cloning, text-to-speech, or dictation mode. This makes it ideal for privacy-sensitive applications.

q: Can I use VoiceBox on multiple devices? #

a: Yes. Voice models are stored as files that can be copied between devices. Simply export your cloned voices from one device and import them on another. The web interface and API support remote access for multi-device setups.

q: What audio formats does VoiceBox support? #

a: VoiceBox supports input formats including WAV, MP3, FLAC, OGG, and AAC. Output is available in WAV, MP3, FLAC, and OGG formats. For dictation mode, any microphone input format is accepted.

q: Is there a limit on how many voices I can clone? #

a: No. VoiceBox has no artificial limit on the number of voice clones. The only constraint is available storage space and system memory. Each voice model typically requires 500MB-2GB depending on quality settings.

q: Can VoiceBox handle accents and dialects? #

a: Yes. VoiceBox’s models are trained on diverse speech data and can handle various accents and dialects. When cloning a voice, the system captures accent characteristics from the training audio. Multi-language support extends to regional variants within each language.