OpenAI Whisper: 99.8K+ Stars — Complete ASR Setup Tutorial vs WhisperX, faster-whisper in 2026

OpenAI Whisper (ASR) robust speech recognition via large-scale weak supervision. Compatible with WhisperX, faster-whisper, LibreTranslate. Covers whisper tutorial, whisper vs whisperx, speech recognition setup, whisper python, whisper docker.

  • ⭐ 99800
  • MIT
  • Updated 2026-05-19

{{< resource-info >}}

Introduction #

Speech recognition is the bridge between human conversation and machine-readable data, yet most developers have wrestled with APIs that charge per minute, miss domain terminology, or fail entirely on accented speech. In late 2022, OpenAI released Whisper as an open-source MIT-licensed alternative, and the uptake was immediate — 99,800 GitHub stars later, it is the most adopted open-source ASR system in production. This guide walks through a complete Whisper setup, compares it against WhisperX, faster-whisper, and DeepSpeech, and gives you production-hardened configs you can deploy today.

What Is OpenAI Whisper? #

OpenAI Whisper is a general-purpose automatic speech recognition (ASR) model trained on 680,000 hours of multilingual and multitask supervised data. It performs speech-to-text transcription, speech translation to English, spoken language identification, and timestamped segment alignment across 99 languages. Unlike cloud-only APIs, Whisper runs entirely offline on consumer hardware, making it the backbone of transcription pipelines in healthcare, media, call centers, and accessibility tools.

How Whisper Works #

Whisper follows an encoder-decoder Transformer architecture. The audio input is converted to a log-Mel spectrogram and passed through an encoder. A decoder then predicts text tokens autoregressively, conditioned on special task tokens that tell the model whether to transcribe, translate, or detect language.

Whisper architecture diagram

Core design decisions:

  • Large-scale weak supervision: Trained on diverse web-scale audio with noisy labels rather than small, pristine datasets
  • Multitask training: A single model handles transcription, translation, and language ID via task tokens
  • Chunked processing: Long audio is split into 30-second segments, processed independently, then reassembled
  • Conditioning on previous text: The decoder receives prior segment tokens for consistent formatting across boundaries
ModelParametersEnglish WERMultilingual WERVRAM (GPU)Relative Speed
tiny39M~7.6%~12%~1 GB~10x
base74M~5.0%~10%~1 GB~7x
small244M~3.4%~7%~2 GB~4x
medium769M~2.9%~5%~5 GB~2x
large-v31.55B~2.4%~3.5%~10 GB1x
turbo809M~2.5%~3.7%~6 GB~8x

Installation & Setup #

Python Installation #

python -m venv whisper-env
source whisper-env/bin/activate  # Linux/macOS
# whisper-env\Scripts\activate  # Windows

# Install OpenAI Whisper
pip install -U openai-whisper

# Verify installation
whisper --version

System Dependencies #

FFmpeg is required for audio preprocessing:

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Verify
ffmpeg -version | head -1

GPU Acceleration (CUDA) #

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# Install with CUDA 12 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# For CPU-only inference
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Docker Deployment #

# Pull and run the official image
docker pull openai/whisper:latest

# Transcribe a file via Docker
docker run --rm \
  --gpus all \
  -v $(pwd)/audio:/audio \
  openai/whisper:latest \
  /audio/interview.mp3 \
  --model large-v3 \
  --language en \
  --output_format json

# CPU-only Docker run
docker run --rm \
  -v $(pwd)/audio:/audio \
  openai/whisper:latest \
  /audio/podcast.mp3 \
  --model base \
  --device cpu

Quick First Transcription #

import whisper

# Load model (downloads on first run)
model = whisper.load_model("base")

# Transcribe audio file
result = model.transcribe("audio.mp3")
print(result["text"])

# Get segments with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")

CLI Usage Examples #

# Basic transcription
whisper audio.mp3 --model medium --language en

# Output all formats (JSON, SRT, VTT, TXT)
whisper podcast.mp3 --model large-v3 --output_format all

# Translate non-English audio to English text
whisper french_interview.mp3 --model large-v3 --task translate

# Detect language automatically
whisper unknown.mp3 --model base --task transcribe

WhisperX (Word-Level Timestamps + Diarization) #

WhisperX wraps faster-whisper and adds phoneme-level alignment and speaker diarization. It is the tool of choice for meeting transcripts and interview processing.

pip install whisperx
import whisperx
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
audio_file = "meeting.mp3"
batch_size = 16
compute_type = "float16"

# 1. Transcribe with faster-whisper backend
model = whisperx.load_model("large-v3", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# 2. Align for precise word-level timestamps
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"],
    device=device
)
result = whisperx.align(
    result["segments"],
    model_a,
    metadata,
    audio,
    device
)

# 3. Speaker diarization
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token="YOUR_HF_TOKEN",
    device=device
)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

# Print speaker-labeled transcript
for segment in result["segments"]:
    speaker = segment.get("speaker", "UNKNOWN")
    start = segment["start"]
    end = segment["end"]
    text = segment["text"]
    print(f"[{start:.2f}s - {end:.2f}s] {speaker}: {text}")

faster-whisper (Production Inference) #

faster-whisper re-implements Whisper using CTranslate2, delivering 4-8x speedup with quantization support. This is the default for production APIs.

pip install faster-whisper
from faster_whisper import WhisperModel

# Load with quantization for lower memory
model = WhisperModel(
    "large-v3",
    device="cuda",
    compute_type="float16",   # Options: int8, int8_float16, float16, float32
    num_workers=4,
    cpu_threads=8
)

# Transcribe with VAD filtering
segments, info = model.transcribe(
    "podcast.mp3",
    beam_size=5,
    vad_filter=True,
    vad_parameters={
        "threshold": 0.5,
        "min_speech_duration_ms": 250,
        "min_silence_duration_ms": 500
    },
    language="en",
    condition_on_previous_text=True
)

print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

LibreTranslate Integration (Translation Pipeline) #

import whisper
import requests

# Transcribe non-English audio
model = whisper.load_model("medium")
audio_path = "japanese_podcast.mp3"
result = model.transcribe(audio_path, language="ja")
japanese_text = result["text"]

# Translate via LibreTranslate API
def translate(text, source="ja", target="en"):
    response = requests.post(
        "http://localhost:5000/translate",
        headers={"Content-Type": "application/json"},
        json={"q": text, "source": source, "target": target}
    )
    return response.json()["translatedText"]

english_text = translate(japanese_text)
print(f"JA: {japanese_text}")
print(f"EN: {english_text}")

FastAPI Real-Time Transcription Server #

from fastapi import FastAPI, UploadFile, File
from faster_whisper import WhisperModel
import tempfile
import os

app = FastAPI()
model = WhisperModel("medium", device="cuda", compute_type="float16")

@app.post("/transcribe")
async def transcribe(file: UploadFile = File(...)):
    with tempfile.NamedTemporaryFile(delete=False, suffix=".mp3") as tmp:
        tmp.write(await file.read())
        tmp_path = tmp.name

    segments, info = model.transcribe(
        tmp_path,
        beam_size=5,
        vad_filter=True
    )

    os.unlink(tmp_path)

    results = [
        {
            "start": s.start,
            "end": s.end,
            "text": s.text,
            "confidence": s.words[0].probability if s.words else None
        }
        for s in segments
    ]

    return {
        "language": info.language,
        "language_probability": info.language_probability,
        "segments": results
    }

Run with: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2

Prometheus Monitoring Integration #

from prometheus_client import Counter, Histogram, start_http_server
import time

TRANSCRIPTION_COUNT = Counter(
    "whisper_transcriptions_total",
    "Total transcriptions",
    ["model", "language"]
)
TRANSCRIPTION_DURATION = Histogram(
    "whisper_transcription_duration_seconds",
    "Time spent transcribing",
    ["model"]
)

def transcribe_with_metrics(audio_path, model_name="medium"):
    start = time.time()
    segments, info = model.transcribe(audio_path)
    duration = time.time() - start

    TRANSCRIPTION_COUNT.labels(model=model_name, language=info.language).inc()
    TRANSCRIPTION_DURATION.labels(model=model_name).observe(duration)

    return segments, info

# Start metrics server on port 9090
start_http_server(9090)

Benchmarks / Real-World Use Cases #

Whisper language breakdown

Word Error Rate Comparison (LibriSpeech test-clean) #

Model / EngineWER (clean)WER (other)MultilingualYear
Whisper tiny7.6%12.0%12.0%2022
Whisper base5.0%8.1%10.0%2022
Whisper small3.4%5.8%7.0%2022
Whisper medium2.9%5.0%5.0%2022
Whisper large-v32.4%4.2%3.5%2024
Whisper turbo2.5%4.3%3.7%2024
faster-whisper (large-v3)2.4%4.2%3.5%2024
WhisperX (large-v3)2.4%4.2%3.5%2024
Mozilla DeepSpeech7.3%21.5%N/A (English only)2020

Inference Speed Benchmark (1-hour audio, NVIDIA RTX 4090) #

EngineModelTimeVRAMNotes
OpenAI Whisperlarge-v3~90 min~10 GBBaseline
faster-whisperlarge-v3~18 min~6 GBfloat16, 4-8x speedup
faster-whisperlarge-v3~12 min~4 GBint8 quantization
WhisperXlarge-v3~25 min~8 GBIncludes alignment
WhisperX (no diarize)large-v3~18 min~6 GBTranscription only
OpenAI Whisperturbo~12 min~6 GBDistilled decoder

Production Deployment Scenarios #

Use CaseRecommended ModelEngineHardwareDaily Volume
Podcast transcriptionlarge-v3faster-whisper1x A100500+ hours
Real-time meeting notesturbofaster-whisper1x RTX 4090200+ hours
Call center analyticsmediumfaster-whisper (int8)2x RTX 30801000+ hours
Mobile/edge devicetiny.enwhisper.cpp8 GB RAMOffline
Video subtitle generationlarge-v3WhisperX1x A100300+ hours

Advanced Usage / Production Hardening #

Model Quantization for Lower Memory #

from faster_whisper import WhisperModel

# INT8 quantization — 2x speed, 50% less VRAM
model_int8 = WhisperModel("large-v3", device="cuda", compute_type="int8")

# INT8 with float16 activations — balanced
model_hybrid = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")

# CPU with INT8
model_cpu = WhisperModel("medium", device="cpu", compute_type="int8", cpu_threads=8)

Batch Processing Pipeline #

import os
from concurrent.futures import ThreadPoolExecutor
from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")

def process_file(audio_path):
    segments, info = model.transcribe(
        audio_path,
        vad_filter=True,
        beam_size=5
    )
    text = " ".join([s.text for s in segments])
    output_path = audio_path.replace(".mp3", ".txt")
    with open(output_path, "w") as f:
        f.write(text)
    return output_path

# Process directory of audio files
audio_dir = "/data/audio/"
files = [os.path.join(audio_dir, f) for f in os.listdir(audio_dir) if f.endswith(".mp3")]

with ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(process_file, files))

print(f"Processed {len(results)} files")

NGINX Load Balancing (Multi-GPU) #

upstream whisper_backend {
    least_conn;
    server 10.0.1.10:8000 weight=1;  # GPU 0
    server 10.0.1.10:8001 weight=1;  # GPU 1
    server 10.0.1.11:8000 weight=1;  # GPU 2
    server 10.0.1.11:8001 weight=1;  # GPU 3
}

server {
    listen 80;
    location /transcribe {
        proxy_pass http://whisper_backend;
        proxy_read_timeout 300s;
        client_max_body_size 500M;
    }
}

Health Check Endpoint #

from fastapi import FastAPI, HTTPException
from faster_whisper import WhisperModel
import torch

app = FastAPI()
model = WhisperModel("medium", device="cuda", compute_type="float16")

@app.get("/health")
async def health():
    gpu_available = torch.cuda.is_available()
    gpu_memory = torch.cuda.get_device_properties(0).total_memory if gpu_available else 0
    return {
        "status": "healthy",
        "gpu_available": gpu_available,
        "gpu_memory_gb": gpu_memory / (1024**3),
        "model_loaded": model is not None
    }

Redis Queue for Async Processing #

import redis
import json
from faster_whisper import WhisperModel
import time

r = redis.Redis(host='localhost', port=6379, db=0)
model = WhisperModel("medium", device="cuda", compute_type="float16")

def worker():
    while True:
        job = r.blpop("transcription_queue", timeout=5)
        if job:
            _, data = job
            task = json.loads(data)
            segments, info = model.transcribe(task["file_path"])
            result = {
                "job_id": task["job_id"],
                "text": " ".join([s.text for s in segments]),
                "language": info.language
            }
            r.setex(f"result:{task['job_id']}", 3600, json.dumps(result))
        time.sleep(0.1)

if __name__ == "__main__":
    worker()

Comparison with Alternatives #

Whisper model variants comparison

FeatureOpenAI WhisperWhisperXfaster-whisperDeepSpeech
GitHub Stars99,80019,70020,40026,700 (archived)
LicenseMITBSD-2MITMPL-2.0
Speed vs Baseline1x (baseline)0.8-1x4-8x2x
GPU RequiredOptionalRecommendedOptionalOptional
QuantizationNoNoYes (INT8/FP16)No
Word TimestampsSegment-levelPhoneme-levelSegment-levelNo
Speaker DiarizationNoBuilt-inNoNo
Languages999999English only
WER (LibriSpeech clean)2.4% (large-v3)2.4%2.4%7.3%
Active DevelopmentYesYesYesNo (archived Jun 2025)
Best ForResearch, referenceMeeting transcriptsProduction APIsLegacy only

Limitations / Honest Assessment #

Whisper is not the right tool for every speech task. Here is what the README does not tell you:

  1. No streaming support: Whisper processes 30-second chunks; it is not designed for true real-time (<200ms latency) transcription. For streaming ASR, look at NVIDIA Parakeet or Moonshine v2.

  2. Hallucination on silence: Large-v3 occasionally generates hallucinated text on silent segments. Use VAD filtering (built into faster-whisper) to mitigate.

  3. English-centric training: While it supports 99 languages, performance on low-resource African and South Asian languages drops noticeably. Fine-tuning on target language data is often necessary.

  4. Memory footprint: Large-v3 requires ~10 GB VRAM at FP32. You need quantization (faster-whisper) or CPU offloading for consumer GPUs.

  5. No speaker diarization: The base Whisper cannot tell you who spoke. You need WhisperX or a separate diarization pipeline for multi-speaker identification.

  6. Translation limitations: The turbo model was not trained for translation. Use medium or large-v3 for translate-to-English tasks.

Frequently Asked Questions #

1. What is the difference between Whisper and faster-whisper? #

faster-whisper is a re-implementation of Whisper using CTranslate2, a C++ inference engine. It provides 4-8x speedup, INT8 quantization, and built-in VAD filtering while producing identical transcription results. Use faster-whisper for production; use OpenAI Whisper for research and experimentation.

2. Can I run Whisper on a CPU? #

Yes. All models except large-v3 run comfortably on modern CPUs. Use the tiny model for real-time-ish transcription on laptops, or medium with INT8 quantization for batch processing. Expect 3-5x slower inference compared to GPU.

3. Which model size should I choose? #

Start with base for English-only quick tasks, small for daily multilingual use, medium for professional accuracy, and large-v3 when maximum accuracy is non-negotiable. The turbo model is the sweet spot for latency-sensitive production workloads.

4. How do I handle long audio files efficiently? #

Use faster-whisper with vad_filter=True to skip silent segments. For files longer than 1 hour, split into chunks and process in parallel. WhisperX handles long files natively and is more stable on 3+ hour audio than base Whisper.

5. Is Whisper free for commercial use? #

Yes. Whisper is released under the MIT license, which permits commercial use, modification, and distribution. There are no API fees because you run it on your own hardware. Your only cost is compute (GPU/cloud instances).

6. How does Whisper compare to cloud APIs like Google Speech-to-Text? #

Whisper large-v3 achieves comparable WER to Google Speech-to-Text on English (2.4% vs 2.1% on LibriSpeech). Whisper wins on privacy (on-premise), cost (no per-minute fees), and language coverage (99 languages vs 125 for Google, but free). Cloud APIs win on integration ecosystem and managed scaling.

7. What hardware do I need for production? #

A single NVIDIA A100 (80 GB) can run 4 large-v3 instances with float16, processing ~200 hours of audio per day. For budget setups, an RTX 4090 (24 GB) with faster-whisper handles medium and large-v3 in float16 comfortably.

Conclusion #

OpenAI Whisper remains the pragmatic choice for production speech recognition in 2026. Its 99,800 GitHub stars reflect not just popularity but ecosystem maturity: faster-whisper gives you speed, WhisperX gives you diarization, and the core model gives you accuracy across 99 languages. Start with faster-whisper and the medium model, add WhisperX when you need speaker labels, and quantize to INT8 when GPU memory is tight.

Next steps:

  • Clone the repo: git clone https://github.com/openai/whisper
  • Join the dibi8 developer community on Telegram for deployment tips
  • Benchmark faster-whisper on your own audio data before committing to a model size

Before you deploy any of the tools above into production, you’ll need solid infrastructure. Two options dibi8 actually uses and recommends:

  • DigitalOcean — $200 free credit for 60 days across 14+ global regions. The default option for indie devs running open-source AI tools.
  • HTStack — Hong Kong VPS with low-latency access from mainland China. This is the same IDC that hosts dibi8.com — battle-tested in production.

Affiliate links — they don’t cost you extra and they help keep dibi8.com running.

Sources & Further Reading #

💬 Discussion