lang: vi slug: faster-whisper title: ‘faster-whisper: 4x Faster Speech-to-Text with 23K+ Stars’ description: ‘faster-whisper (SYSTRAN) reimplements OpenAI Whisper via CTranslate2 for 4x speedup. Covers faster whisper tutorial, benchmark data, Docker setup, Python API, VAD filter, batch processing, and production hardening with WhisperX and whisper.cpp integration.’ tags: [“open-source”] date: 2026-05-19 00:00:00+08:00 lastmod: 2026-05-19 00:00:00+08:00 tech_stack: [] application_domain: Ai Tools source_version: ’' licensing_model: Open Source license_type: MIT file_size: ’' file_md5: ’' download_url: ’' backup_url: ’' github_repo: ‘https://github.com/SYSTRAN/faster-whisper' last_maintained: ‘2026-05-19’ draft: false categories: [‘ai-tools’] aliases:- /posts/faster-whisper/ faqs:

  • q: ‘How much faster is faster-whisper than OpenAI’’s original Whisper?’ a: ‘On GPU (RTX 3070 Ti, large-v2, 13-min audio), faster-whisper fp16 single inference is about 2.3x faster than openai/whisper (1m03s vs 2m23s), and INT8 batched inference reaches 16 seconds, about 8.9x faster. On CPU, INT8 is roughly 4.1x faster than the original.’
  • q: ‘Is faster-whisper accuracy the same as OpenAI Whisper?’ a: ‘Yes. faster-whisper uses the same model weights and tokenizer as OpenAI Whisper, so Word Error Rate differences are within 0.1% and indistinguishable in practice. Any variation comes from beam search randomness, not the inference engine.’
  • q: ‘Does faster-whisper support AMD GPUs or Apple Silicon GPU acceleration?’ a: ‘No. CTranslate2 only supports CUDA for GPU acceleration, so AMD GPUs and Apple Silicon Metal are not supported and fall back to CPU. For AMD use whisper.cpp with Vulkan or ROCm, and for Apple Silicon use whisper.cpp with Metal.’
  • q: ‘Which compute_type should I use in faster-whisper?’ a: ‘Use int8 for maximum VRAM savings on 8 GB and older cards like GTX 10xx; on CPU int8 large-v3 needs only about 3 GB RAM. Use float16 for best speed on modern GPUs (RTX 30xx/40xx/50xx, A100, H100), or int8_float16 as a middle ground.’
  • q: ‘Can faster-whisper do speaker diarization (who spoke when)?’ a: ‘No, faster-whisper outputs transcription only. For speaker labels use WhisperX, which builds on faster-whisper and adds speaker diarization plus word-level timestamp alignment via wav2vec2.’

featureImage: /images/articles/faster-whisper-4x-faster-speech-to-text.png —{{< resource-info >}}OpenAI’s Whisper changed speech-to-text in 2022, but the original Python implementation left significant performance on the table. For a 13-minute audio file, openai/whisper with the large-v2 model takes over 4 minutes on a Tesla V100 GPU — unacceptable for production pipelines processing hundreds of hours daily. SYSTRAN’s faster-whisper reimplements Whisper inference using CTranslate2, delivering up to 4x speedup at identical accuracy while cutting VRAM usage by nearly 70%. With 23,000+ GitHub stars, it has become the de facto runtime for production speech-to-text in Python environments.This guide provides a production-grade faster whisper tutorial covering installation, benchmarking, Docker deployment, and integration with WhisperX and whisper.cpp. Every command and config is copy-paste ready — a complete speech to text setup you can deploy today.

faster-whisper GitHub repo
Figure 1: faster-whisper GitHub repository — 23,000+ stars, actively maintained by SYSTRAN.## How faster-whisper WorksThe architecture replaces PyTorch inference with CTranslate2’s optimized runtime:
CTranslate2 Architecture
Figure 2: CTranslate2 inference engine — the C++ backend powering faster-whisper’s speedup through custom CUDA kernels and quantization.## What Is faster-whisper?faster-whisper is a reimplementation of OpenAI’s Whisper automatic speech recognition (ASR) model using CTranslate2, a high-performance C++ inference engine for Transformer models. It runs the same model weights as the original Whisper but delivers substantially higher throughput and lower memory usage through custom CUDA kernels, INT8/FP16 quantization, and fused attention operations.The project started as a community effort by Guillaume Klein and is now maintained by SYSTRAN under the MIT license. It supports all Whisper model sizes (tiny through large-v3) on both CPU and NVIDIA GPU, with automatic model conversion from Hugging Face Hub on first load.## How faster-whisper WorksThe architecture replaces PyTorch inference with CTranslate2’s optimized runtime:``` Audio Input (wav/mp3/flac) | v PyAV Audio Decoder (FFmpeg bundled, no external dependency) | v Mel Spectrogram Computation | v CTranslate2 Whisper Engine (C++ backend) |– Custom CUDA kernels for NVIDIA GPU |– INT8/FP16 quantized weights |– Fused self-attention + cross-attention |– SIMD-optimized CPU path (AVX2/NEON) | v Tokenizer (Hugging Face tokenizers) | v Transcription Segments (start, end, text, confidence)

e
y
technical decisions that enable the speedup:- **Weight quantization**: INT8 reduces model memory by ~50% with negligible accuracy loss (< 0.1% WER).
- **Fused kernels**: CTranslate2 merges multiple GPU operations into single kernel launches, reducing dispatch overhead.
- **Flash Attention support**: Available on Ampere GPUs (RTX 30xx+) for additional memory bandwidth savings.
- **Batched inference**: Process multiple audio chunks in parallel on GPU for near-linear throughput scaling.
- **Silero VAD integration**: Built-in voice activity detection skips silent segments, reducing wasted compute.## Installation & Setup### Requirements- Python 3.9+
- For GPU: NVIDIA GPU with CUDA 12.x and cuDNN 9.x
- No FFmpeg installation required (bundled via PyAV)### pip Install (CPU)```
bas
h
# Create virtual environment
python -m venv venv-whisper
source venv-whisper/bin/activate  # Linux/Mac
# venv-whisper\Scripts\activate  # Windows# Install faster-whisper
pip install faster-whisper
```### pip Install with GPU Support```
bas
h
# Install cuBLAS and cuDNN via pip (Linux only)
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*# Set library path before running
export LD_LIBRARY_PATH=$(python3 -c 'import os; import nvidia.cublas.lib; import nvid```
bas
h
# Create virtual environment
python -m venv venv-whisper
source venv-whisper/bin/activate  # Linux/Mac
# venv-whisper\Scripts\activate  # Windows

# Install faster-whisper
pip install faster-whisper
```h
e
official NVIDIA CUDA image with cuDNN
docker run -it --rm --gpus all \
  -v $(pwd)/audio:/audio \
  nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04 \
  bash# Inside container
apt-get update && apt-get install -y python3-pip
pip install ```
bas
h
# Install cuBLAS and cuDNN via pip (Linux only)
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*

# Set library path before running
export LD_LIBRARY_PATH=$(python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))')

# Install faster-whisper
pip install faster-whisper
```h
o
n
verify_setup.py
```### First Transcription```
pytho
n
from faster_whisper import WhisperModel# Load model (auto-downloads from Hugging Face on first run)
model = WhisperModel("large-v3", device="cuda", compute_type="int8")# Transcribe audio file
segments, info = model.transcribe("audio.mp3", beam_size=5)print(f"Detected language: {info.language} "
      f"(probability: {info.language_probability:.2f})"```
bas
h
# Pull the official NVIDIA CUDA image with cuDNN
docker run -it --rm --gpus all \
  -v $(pwd)/audio:/audio \
  nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04 \
  bash

# Inside container
apt-get update && apt-get install -y python3-pip
pip install faster-whisper
```h
e
go-to tool for meeting transcription.```
bas
h
pip install whisperx

pytho n import whisperx import torchdevice = “cuda” if torch.cuda.is_available() else “cpu” audio_file = “meeting.mp3”# 1. Transcribe with faster-whisper backend model = whisperx.load_model(“large-v3”, de``` pytho n

verify_setup.py #

from faster_whisper import WhisperModel import torch

print(f"PyTorch CUDA available: {torch.cuda.is_available()}") print(f"CUDA devices: {torch.cuda.device_count()}")

model = WhisperModel(“tiny”, device=“cuda”, compute_type=“float16”) print(f"Model loaded on: {model.model.device}") print(“Setup verified successfully”)

o
n
(requires Hugging Face token)
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token="your_hf_token", device=device
)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)for segment in result["segments"]:
    speaker = segment.get("speaker", "UNKNOWN")
    print(f"[{segment['star```
bas
h
python verify_setup.py
```] "
          f"{speaker}: {segment['text']}")
```##```
pytho
n
from faster_whisper import WhisperModel

# Load model (auto-downloads from Hugging Face on first run)
model = WhisperModel("large-v3", device="cuda", compute_type="int8")

# Transcribe audio file
segments, info = model.transcribe("audio.mp3", beam_size=5)

print(f"Detected language: {info.language} "
      f"(probability: {info.language_probability:.2f})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
```", "output": "json"}
    )
print(response.json())
```### Speaches (Self-Hosted OpenAI-Compatible Server)Speaches is a modern, OpenAI-compatible server built on faster-whisper:```
bas
h
docker run -d --gpus all \
  -p 8000:8000 \
  -e WHISPER__MODEL=large-v3 \
  -e WHISPER__COMPUTE_TYPE=int8 \
  fedirz/speaches:latest-gpu

pytho n from openai import OpenAIclient = OpenAI( base_url=“http://localhost:8000/v1”, api_key=“dummy” )with open(“audio.mp3”, “rb”) as f: transcript = client.audio.transcriptions.create( model=“large-v3”, file=f ) print(transcript.text) ### LibreTranslate (Translation Pipeline)Chain transcription with tr bas h pip install whisperx

w
s
:```
pytho
n
from faste```
pytho
n
import whisperx
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
audio_file = "meeting.mp3"

# 1. Transcribe with faster-whisper backend
model = whisperx.load_model("large-v3", device, compute_type="int8")
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=8)

# 2. Align for word-level timestamps
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)

# 3. Speaker diarization (requires Hugging Face token)
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token="your_hf_token", device=device
)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

for segment in result["segments"]:
    speaker = segment.get("speaker", "UNKNOWN")
    print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] "
          f"{speaker}: {segment['text']}")
```nchm
a
r
k
results — faster-whisper achieves up to 8.9x speedup over openai/whisper with batched INT8 inference on RTX 3070 Ti.*### GPU Benchmark: 13 Minutes of Audio, large-v2 Model| Implementation | Precision | Beam Size | Time | VRAM Usage |
|---|---|---|---|---|
| openai/whisper | fp16 | 5 | 2m 23s | 4708 MB |
| whisper.cpp (Flash Attention) | fp16 | 5 | 1m 05s | 4127 MB |
| transformers (SDPA) | fp16 | 5 | 1m 52s | 4960 MB |
| **faster-whisper** | **fp16** | **5** | **1m 03s** | **4525 MB** |
| **faster-whisper (batch_size=8)** | **fp16** | **5** | **17s** | **6090 MB** |
| **faster-whisper** | **int8** | **5** | **59s** | **2926 MB** |
| **faster-whisper (batch_size=8)** | **int8** | **5** | **16s** | **4500 MB** |*Executed with CUDA 12.4 on NVIDIA RTX 3070 Ti 8GB.*Key takeaways from the GPU benchmark:- **Single inference**: faster-whisper fp16 is **2.3x faster** than openai/whisper (1m03s vs 2m23s).
- **Batched inference**: With `batch_size=8`, faster-whisper processes the same audio in **17 seconds** — an **8.4x speedup** over the original.
- **INT8 quantization**: Red```
bas
h
docker run -d --gpus all \
  -p 9000:9000 \
  -e ASR_MODEL=large-v3 \
  -e ASR_ENGINE=faster_whisper \
  -e COMPUTE_TYPE=int8 \
  onerahming/openai-whisper-asr
```pen
a
i
/whisper.### CPU Benchmark: 13 Minutes of Audio, small Model| Implementation | Precision | Beam Size | Time | RAM Usage |
|---|---|---|---|---|
| openai/whis```
pytho
n
import requests

with open("audio.mp3", "rb") as f:
    response = requests.post(
        "http://localhost:9000/asr",
        files={"audio_file": f},
        data={"language": "en", "output": "json"}
    )
print(response.json())
```| **fp32** | **5** | **1m 06s** | **4230 MB** |
| **faster-whisper** | **int8** | **5** | **1m 42s** | **1477 MB** |
| **faster-whisper (batch_size=8)** | **int8** | **5** | **51s** | **3608 MB** |*Executed with 8 threads on Intel Core i7-12700K.*Key takeaways from the CPU benchmark:- **INT8 on CPU**: faster-whisper int8 is **4.1x faster** than openai/whis```
bas
h
docker run -d --gpus all \
  -p 8000:8000 \
  -e WHISPER__MODEL=large-v3 \
  -e WHISPER__COMPUTE_TYPE=int8 \
  fedirz/speaches:latest-gpu
```o
r
y
efficiency**: INT8 uses only 1477 MB RAM vs 2335 MB for openai/whisper (37% reduction).### distil-whisper-large-v3 Benchmark| Implement```
pytho
n
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

with open("audio.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="large-v3", file=f
    )
print(transcript.text)
```oduct
i
o
n
Use Cases| Use Case | Model | Hardware | Performance |
|---|---|---|---|
| **Meeting transcription** (1h audio) | large-v3 int8 | RTX 4070 | ~3 min processing |
| **Podcast batch processing** (100 files) | large-v3 int8 batch=8 | A100 40GB | ~20 min for 100h |
| **Real-time captions** | small int8 | RTX 3060 | ~200ms latency |
| **Call center analytics** | medi```
pytho
n
from faster_whisper import WhisperModel
import requests

# Transcribe non-English audio
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
segments, info = model.transcribe("japanese_audio.mp3", beam_size=5)
japanese_text = " ".join([s.text for s in segments])

# Translate via LibreTranslate
response = requests.post("http://localhost:5000/translate", json={
    "q": japanese_text,
    "source": "ja",
    "target": "en"
})
translation = response.json()["translatedText"]
print(f"JA: {japanese_text}")
print(f"EN: {translation}")
```len
c
e
_duration_ms=500,   # Split on 500ms+ silence
        speech_pad_ms=200,              # 200ms padding around speech
        threshold=0.5                   # VAD confidence threshold
    ),
    beam_size=5
)
```### Batched Inference for Maximum Throughput```
pytho
n
from faster_whisper import WhisperModel
import glob
import timemodel = WhisperModel("large-v3", device="cuda", compute_type="int8")# Process multiple files in a single batch call
audio_files = glob.glob("podcasts/*.mp3")start = time.time()
for file_path in audio_files:
    segments, _ = model.transcribe(file_path, batch_size=8, beam_size=5)
    text = " ".join([s.text for s in segments])
    print(f"{file_path}: {len(text)} chars")elapsed = time.time() - start
print(f"Total time: {elapsed:.1f}s for {len(audio_files)} files")
```### Word-Level Timestamps```
pytho
n
segments, _ = model.transcribe("audio.mp3", word_timestamps=True)for segment in segments:
    for word in segment.words:
        print(f"[{word.start:.2f}s -> {word.end:.2f}s] {word.word}")
```### Custom Model ConversionConvert fine-tuned Whisper models for use with faster-whisper:```
bas
h
# Install conversion dependencies
pip install transformers[torch]>=4.23# Convert a fine-tuned model
ct2-transformers-converter \
  --model openai/whisper-large-v3 \
  --output_dir whisper-large-v3-ct2 \
  --copy_files tokenizer.json preprocessor_config.json \
  --quantization float16

pytho n

Load the converted model #

model = WhisperModel(“whisper-large-v3-ct2”, device=“cuda”) ### Monitoring with Prometheus pytho n from faster_whisper import WhisperModel from prometheus_client import Counter, Histogram, start_http_server import timeREQUEST_COUNT = Counter(“transcription_requests_total”, “Total requests”) REQUEST_DURATION = Histogram(“transcription_duration_seconds”, “Request duration”)model = WhisperModel(“large-v3”, device=“cuda”, compute_type=“int8”)@REQUEST_DURATION.time() def transcribe(audio_path): REQUEST_COUNT.inc() segments, info = model.transcribe(audio_path, beam_size=5) return segments, info# Expose metrics on port 8000 start_http_server(8000) ### Graceful Error Handling pytho n from faster_whisper import WhisperModeldef safe_transcribe(audio_path, device=“cuda”): “““Transcribe with fallback on GPU errors.””” compute_types = [“int8”, “int8_float16”, “float16”, “float32”]

for compute_type in compute_types:
    try:
        model = WhisperModel(
            "large-v3", device=device, compute_type=compute_type
        )
        segments, info = model.transcribe(audio_path, beam_size=5)
        return segments, info, compute_type
    except RuntimeError as e:
        print(f"{compute_type} failed: {e}, retrying...")
        continue

raise RuntimeError("All compute types failed")segments, info, used_type = safe_transcribe("audio.mp3")

print(f"Used compute type: {used_type}")

|---|---|---|---|---|
| **Speed (large-v3 GPU)** | ~12x real-time | ~3x real-time | ~12x real-time | ~8x real-time |
| **VRAM (large-v3)** | ~2.5 GB (int8) | ~11 GB (fp16) | ~3 GB | ~3 GB |
| **Python API** | Native | Native | Native | Wrapper only |
| **Speaker diarization** | No | No | Built-in | No |
| **Word-level timestamps** | Yes | No | Yes (via wav2vec2) | Yes |
| **Apple Silicon** | CPU only | CPU/GPU | CPU only | Metal (fastest) |
| **NVIDIA GPU** | CUDA (fastest) | CUDA | CUDA | CUDA |
| **AMD GPU** | No | No | No | Vulkan/ROCm |
| **CPU optimization** | int8 SIMD | Basic | int8 SIMD | AVX2/NEON |
| **VAD filter** | Built-in Silero | No | Built-in | Manual tuning |
| **Batch inference** | Yes | No | Yes | Limited |
| **Model quantization** | int8/fp16/fp32 | fp16/fp32 | int8/fp16 | q4/q5/fp16 |
| **License** | MIT | MIT | MIT | MIT |
| **Stars (May 2026)** | 23,000+ | 15,000+ | 12,000+ | 44,000+ |### When to Choose Which- **faster-whisper**: Python pipelines on NVIDIA GPU or CPU. Best integration ecosystem. Use for production STT services, batch processing, and real-time Python applications.
- **OpenAI Whisper**: Research and experimentation where you need the```
pytho
n
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="int8")

# Enable VAD filter with custom parameters
segments, info = model.transcribe(
    "audio.mp3",
    vad_filter=True,
    vad_parameters=dict(
        min_silence_duration_ms=500,   # Split on 500ms+ silence
        speech_pad_ms=200,              # 200ms padding around speech
        threshold=0.5                   # VAD confidence threshold
    ),
    beam_size=5
)
```l
backend. On M-series Macs, it runs CPU-only at roughly 3x real-time for large-v3. whisper.cpp with Metal achieves ~10x real-time — 3x faster.2. **AMD GPU support**: CTranslate2 only supports CUDA on GPU. AMD GPUs are not supported. Use whisper.cpp with Vulkan or ROCm instead.3. **Extreme edge devices**: While faster-whisper works on CPU, the Python runtime overhead makes it less suitable than whisper.cpp for Raspberry Pi Zero or microcontrollers. whisper.cpp fits in 1 GB RAM and runs without Python.4. **Speaker diariza```
pytho
n
from faster_whisper import WhisperModel
import glob
import time

model = WhisperModel("large-v3", device="cuda", compute_type="int8")

# Process multiple files in a single batch call
audio_files = glob.glob("podcasts/*.mp3")

start = time.time()
for file_path in audio_files:
    segments, _ = model.transcribe(file_path, batch_size=8, beam_size=5)
    text = " ".join([s.text for s in segments])
    print(f"{file_path}: {len(text)} chars")

elapsed = time.time() - start
print(f"Total time: {elapsed:.1f}s for {len(audio_files)} files")
```t
o
run faster-whisper?For GPU inference, any NVIDIA GPU with CUDA 12.x support works. INT8 quantization runs comfortably on 8 GB VRAM cards (RTX 3060, RTX 4060). For CPU inference, 4+ cores and 8 GB RAM are sufficient for the small model. The large-v3 model needs ~3 GB RAM with INT8 on CPU.### How do I install faster-whisper in Docker?Use the official NVIDIA CUDA runtime image with cuDNN 9. A complete faster whisper docker configuration is shown in the Installation & Setup section above. The key requirement is the `nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04````
pytho
n
segments, _ = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print(f"[{word.start:.2f}s -> {word.end:.2f}s] {word.word}")
```ren
c
e
s
are within 0.1% — indistinguishable in practice. Any variation comes from beam search randomness, not the inference engine.### What is the best compute_type for my GPU?Use `int8` for maximum VRAM savings (suitable for GTX 10xx cards and 8 GB GPUs). Use `float16` for best speed on ```
bas
h
# Install conversion dependencies
pip install transformers[torch]>=4.23

# Convert a fine-tuned model
ct2-transformers-converter \
  --model openai/whisper-large-v3 \
  --output_dir whisper-large-v3-ct2 \
  --copy_files tokenizer.json preprocessor_config.json \
  --quantization float16
```wo
r
d
-level timestamp alignment via wav2vec2. For pure transcription, faster-whisper is sufficient. Use WhisperX when you need "who said what" labels in meetings or interviews.### Can I use faster-whisper for real-time streaming?Yes, via integration with Whisper-Streaming or WhisperLive. ```
pytho
n
# Load the converted model
model = WhisperModel("whisper-large-v3-ct2", device="cuda")
```stream
i
n
g
servers. Typical end-to-end latency is 200-500 ms for the small model on GPU.### How do I convert a fine-tuned Whi```
pytho
n
from faster_whisper import WhisperModel
from prometheus_client import Counter, Histogram, start_http_server
import time

REQUEST_COUNT = Counter("transcription_requests_total", "Total requests")
REQUEST_DURATION = Histogram("transcription_duration_seconds", "Request duration")

model = WhisperModel("large-v3", device="cuda", compute_type="int8")

@REQUEST_DURATION.time()
def transcribe(audio_path):
    REQUEST_COUNT.inc()
    segments, info = model.transcribe(audio_path, beam_size=5)
    return segments, info

# Expose metrics on port 8000
start_http_server(8000)
```f
o
r
OpenAI Whisper in Python environments. With 4x speedup over the reference implementation, 70% VRAM reduction via INT8 quantization, and a mature ecosystem of integrations (WhisperX, speaches, whisper-asr-webservice), it handles everything from single-file transcription to batch-processing hundreds of hours of audio.The data is clear: in any whisper vs faster whisper comparison, the performance advantage goes to the CTranslate2-based runtime. If you run NVIDIA GPUs and Python, faster-whisper is the baseline. Start with the int8 large-v3 model for general use, drop to small for real-time, and in```
pytho
n
from faster_whisper import WhisperModel

def safe_transcribe(audio_path, device="cuda"):
    """Transcribe with fallback on GPU errors."""
    compute_types = ["int8", "int8_float16", "float16", "float32"]
    
    for compute_type in compute_types:
        try:
            model = WhisperModel(
                "large-v3", device=device, compute_type=compute_type
            )
            segments, info = model.transcribe(audio_path, beam_size=5)
            return segments, info, compute_type
        except RuntimeError as e:
            print(f"{compute_type} failed: {e}, retrying...")
            continue
    
    raise RuntimeError("All compute types failed")

segments, info, used_type = safe_transcribe("audio.mp3")
print(f"Used compute type: {used_type}")
```js
o
n
"footer-cta-legacy" "HTStack" >}}** — Hong Kong VPS with low-latency access from mainland China. This is the same IDC that hosts dibi8.com — battle-tested in production.*Affiliate links — they don't cost you extra and they help keep dibi8.com running.*## Sources & Further Reading- faster-whisper GitHub repository: https://github.com/SYSTRAN/faster-whisper
- CTranslate2 documentation: https://opennmt.net/CTranslate2/
- WhisperX (speaker diarization): https://github.com/m-bain/whisperX
- whisper.cpp (C++ port): https://github.com/ggml-org/whisper.cpp
- OpenAI Whisper (reference implementation): https://github.com/openai/whisper
- Speaches (OpenAI-compatible server): https://github.com/speaches-ai/speaches
- Whisper-Streaming (real-time): https://github.com/ufal/whisper_streaming
- PyAV (audio decoding): https://github.com/PyAV-Org/PyAV
- Silero VAD: https://github.com/snakers4/silero-vad

💬 Bình luận & Thảo luận