CogVideo: 12.7K Stars — Complete Text-to-Video Setup Guide 2026

CogVideo (CogVideoX) is a text and image-to-video generation model from Zhipu AI. Supports ComfyUI, Diffusers, SAT, and Wan/HunyuanVideo/Open-Sora integration. Covers installation, Docker, inference, fine-tuning, and benchmarks.

  • ⭐ 12700
  • Apache-2.0
  • Updated 2026-05-19

{{< resource-info >}}

Turn text and images into cinematic video with Zhipu AI’s open-source diffusion transformer. From zero to production in under 30 minutes.


Introduction #

Text-to-video generation moved from research curiosity to production tool in 2024-2025. Open-source models now compete with commercial APIs on quality while running on consumer GPUs. The problem: most repositories ship as bare model weights with scattered documentation. You spend hours piecing together inference scripts, VRAM optimization flags, and fine-tuning pipelines instead of generating video.

CogVideo from Zhipu AI solves this differently. With 12.7K GitHub stars, 36 contributors, and active releases, it ships a complete toolkit: pretrained 2B and 5B parameter models, Diffusers pipeline integration, SAT-based fine-tuning, ComfyUI nodes, and a 3D causal VAE that compresses video into efficient latent representations. This CogVideo tutorial covers everything from pip install to production deployment with quantized inference and LoRA fine-tuning — the most complete CogVideo setup guide for developers in 2026.

CogVideo web demo


What Is CogVideo? #

CogVideo logo

CogVideo is an open-source text to video AI generation framework developed by Zhipu AI, built on a 3D causal VAE and expert transformer architecture. The CogVideoX series (2024) succeeds the original CogVideo model published at ICLR 2023, offering 5B parameter models that generate 6-second 720p videos from text prompts or still images.

CogVideo is an open-source text-to-video and image-to-video generation framework developed by Zhipu AI, built on a 3D causal VAE and expert transformer architecture. The CogVideoX series (2024) succeeds the original CogVideo model published at ICLR 2023, offering 5B parameter models that generate 6-second 720p videos from text prompts or still images.


How CogVideo Works #

Architecture Overview #

CogVideoX uses a three-component pipeline:

  1. T5 Text Encoder: Encodes text prompts into dense vector representations (224-token limit for CogVideoX-5B, 226 tokens for CogVideoX1.5-5B)
  2. 3D Causal VAE: Compresses video spatially and temporally into latent space — 4x spatial compression and 4x-8x temporal compression depending on the model variant
  3. Expert Transformer (DiT): A diffusion transformer with 3D full attention that denoises latent video representations over 50 inference steps

The architecture follows the flow: Text Prompt → T5 Encoder → Latent Text Embedding → DiT Denoising → 3D VAE Decoder → MP4 Video

CogVideoX architecture overview showing the three-component pipeline from text encoder through 3D VAE to video output

The CogVideoX pipeline: T5 text encoder processes the prompt, the Expert Transformer denoises latent representations, and the 3D VAE decodes to pixel-space video.

Model Variants #

ModelParametersResolutionMax FramesVRAM (BF16)VRAM (INT8)
CogVideoX-2B2B720 x 480495 GB min4.4 GB
CogVideoX-5B5B720 x 4804910 GB min7 GB
CogVideoX-5B-I2V5B720 x 480494 GB min3.6 GB
CogVideoX1.5-5B5B1360 x 768161 (10s)10 GB min7 GB
CogVideoX1.5-5B-I2V5B768 x 136049 (6s)4 GB min3.6 GB

Installation & Setup #

Prerequisites #

  • Python: 3.10 - 3.12 (inclusive)
  • CUDA: 12.1+ with NVIDIA driver 525+
  • GPU: NVIDIA with 5GB+ VRAM for CogVideoX-2B, 10GB+ for CogVideoX-5B
  • Storage: 20GB free for model weights + dependencies

Step 1 — Create a virtual environment:

python3.11 -m venv cogvideo_env
source cogvideo_env/bin/activate

Step 2 — Clone the repository and install dependencies:

git clone https://github.com/zai-org/CogVideo.git
cd CogVideo
pip install -r requirements.txt

The requirements.txt installs PyTorch, Diffusers, Transformers, Accelerate, and the SAT toolkit:

torch>=2.3.0
diffusers>=0.30.0
transformers>=4.40.0
accelerate>=0.30.0
sentencepiece
opencv-python

Step 3 — Verify the installation:

import torch
from diffusers import CogVideoXPipeline

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

Expected output:

PyTorch version: 2.5.1+cu121
CUDA available: True
CUDA version: 12.1

Method 2: Docker Deployment (Production) #

For reproducible deployments and multi-GPU inference, use CogVideo Docker containers. This cogvideo docker approach ensures identical environments across dev and production:

FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

RUN apt-get update && apt-get install -y \
    python3.11 python3-pip git wget \
    && rm -rf /var/lib/apt/lists/*

RUN pip3 install --no-cache-dir torch torchvision --index-url \
    https://download.pytorch.org/whl/cu121

WORKDIR /app
RUN git clone https://github.com/zai-org/CogVideo.git .
RUN pip3 install -r requirements.txt

ENV PYTHONUNBUFFERED=1
EXPOSE 7860

CMD ["python3", "-m", "inference.cli_demo"]

Build and run:

docker build -t cogvideo:latest .
docker run --gpus all -it --rm \
  -v $(pwd)/output:/app/output \
  -v $(pwd)/models:/app/models \
  cogvideo:latest \
  --prompt "A serene mountain lake at sunrise" \
  --model_path THUDM/CogVideoX-5B

For multi-GPU inference, add device_map="balanced" to from_pretrained() and remove enable_model_cpu_offload():

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5B",
    torch_dtype=torch.bfloat16,
    device_map="balanced"
)

Method 3: SAT Framework (Research & Fine-Tuning) #

The Swiss Army Transformer (SAT) framework is Zhipu AI’s training toolkit. Install it for fine-tuning and research:

git clone https://github.com/zai-org/CogVideo.git
cd CogVideo/sat
pip install -e .

Verify SAT installation:

from sat import get_args
print("SAT framework loaded successfully")

The Diffusers pipeline is the fastest way to generate video. Here is a complete text-to-video script:

import torch
from diffusers import CogVideoXPipeline, CogVideoXDPMScheduler
from diffusers.utils import export_to_video

# 1. Load pipeline
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5B",
    torch_dtype=torch.bfloat16
)

# 2. Set scheduler — DPM for 5B, DDIM for 2B
pipe.scheduler = CogVideoXDPMScheduler.from_config(
    pipe.scheduler.config, timestep_spacing="trailing"
)

# 3. Enable memory optimizations
pipe.enable_sequential_cpu_offload()  # Lowest VRAM
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

# 4. Generate
video = pipe(
    prompt="A majestic eagle soaring over snow-capped mountains, "
           "golden hour lighting, cinematic composition",
    num_inference_steps=50,
    guidance_scale=6.0,
    num_frames=49,  # 6 seconds at 8 fps
    height=480,
    width=720,
    generator=torch.Generator().manual_seed(42),
).frames[0]

# 5. Save
export_to_video(video, "output.mp4", fps=8)

For image-to-video with CogVideoX1.5-5B-I2V:

import torch
from diffusers import CogVideoXImageToVideoPipeline, CogVideoXDPMScheduler
from diffusers.utils import export_to_video, load_image

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B-I2V",
    torch_dtype=torch.float16
)
pipe.scheduler = CogVideoXDPMScheduler.from_config(
    pipe.scheduler.config, timestep_spacing="trailing"
)
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

image = load_image("input_image.jpg")
video = pipe(
    image=image,
    prompt="The cat in the image slowly turns its head and blinks, "
           "soft natural lighting from a nearby window",
    height=768,
    width=1360,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6.0,
    generator=torch.Generator().manual_seed(42),
).frames[0]

export_to_video(video, "output_i2v.mp4", fps=8)

ComfyUI Node-Based Workflow #

ComfyUI-CogVideoXWrapper enables visual node-based workflows. Install it:

cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper.git
cd ComfyUI-CogVideoXWrapper
pip install -r requirements.txt

Restart ComfyUI and load the CogVideoX workflow. The wrapper supports all model variants including I2V and video-to-video.

SAT Framework Fine-Tuning #

For custom styles and concepts, fine-tune with LoRA using SAT:

Configure sat/configs/sft.yaml:

model_parallel_size: 1
experiment_name: lora-custom-style
mode: finetune
load: "{your_CogVideoX-2b-sat_path}/transformer"
train_iters: 1000
eval_interval: 100
save_interval: 100
save: ckpts
train_data: ["your_train_data_path"]
valid_data: ["your_val_data_path"]
deepseed:
  bf16:
    enabled: False  # True for 5B
  fp16:
    enabled: True   # False for 5B

Run fine-tuning on a single GPU:

cd CogVideo/sat
bash finetune_single_gpu.sh

Convert SAT LoRA weights to Hugging Face format:

python tools/export_sat_lora_weight.py \
  --sat_pt_path ckpts/lora-custom-style/1000/mp_rank_00_model_states.pt \
  --lora_save_directory ./hf_lora_weights/

Load the fine-tuned weights in inference:

pipe.load_lora_weights(
    "./hf_lora_weights/",
    weight_name="pytorch_lora_weights.safetensors",
    adapter_name="custom_style"
)
pipe.fuse_lora(components=["transformer"], lora_scale=1.0)

Prompt Optimization Pipeline #

CogVideoX is trained on long, descriptive prompts. Short prompts produce lower quality video. Use the prompt conversion script:

python inference/convert_demo.py \
  --prompt "A girl riding a bike" \
  --type "t2v"

The script calls a large language model (GLM-4 Plus or GPT-4o) to expand simple prompts into detailed descriptions. Example conversion:

Input: "A girl riding a bike"

Output: "A young woman with flowing auburn hair rides a vintage red bicycle along a cobblestone path. She wears a light summer dress that billows gently in the breeze. The path winds through a sun-dappled forest with tall oak trees casting long shadows on the ground. Golden afternoon light filters through the leaves, creating a warm, nostalgic atmosphere. She pedals at a leisurely pace, a serene smile on her face, occasionally glancing at wildflowers growing along the path edge."

For programmatic use:

from inference.convert_demo import convert_prompt

optimized_prompt = convert_prompt(
    "A cat playing with a toy mouse",
    retry_times=3,
    type="t2v"
)
print(optimized_prompt)

Quantized Inference with TorchAO #

For limited VRAM deployments, use INT8 quantization via diffusers-torchao:

pip install torchao
import torch
from diffusers import CogVideoXPipeline
from torchao.quantization import quantize_, int8_weight_only

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5B",
    torch_dtype=torch.bfloat16
)

# Quantize transformer to INT8
quantize_(pipe.transformer, int8_weight_only())

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()

video = pipe(
    prompt="A robot walking through a futuristic city at night",
    num_inference_steps=50,
    num_frames=49,
).frames[0]

Quantization reduces VRAM from 10GB to approximately 7GB for CogVideoX-5B with minimal quality loss.


Benchmarks & Real-World Use Cases #

Inference Speed (Single A100 80GB) #

ModelPrecisionStepsTime (5s video)Time (10s video)
CogVideoX-2BBF1650~180sN/A
CogVideoX-5BBF1650~1000sN/A
CogVideoX1.5-5BBF1650~550s (H100)~1000s
CogVideoX1.5-5B-I2VFP1650~90sN/A
CogVideoX1.5-5B-I2VFP1650~45s (H100)N/A

VBench-2.0 Quality Scores #

DimensionCogVideoX-5B (BLADE 8-step)CogVideoX-5B (50-step)Wan2.1-1.3B
Overall0.5690.5340.570
Human Fidelity0.8960.8710.918
Controllability0.6120.5810.593
Physics0.5430.5120.538
Creativity0.5870.5540.571

Source: Video-BLADE paper (Zhejiang University, 2025)

Real-World Use Cases #

Content Creation Studios: A Tokyo-based animation studio uses CogVideoX-5B-I2V to animate concept art, cutting storyboard production time by 60%. The image-to-video pipeline turns static illustrations into 6-second motion previews.

E-commerce Product Demos: A furniture retailer generates product showcase videos from single product photos using CogVideoX1.5-5B-I2V. The model produces smooth camera orbits and natural lighting transitions at 1360 x 768 resolution.

Educational Content: A MOOC platform auto-generates demonstration videos for physics experiments from text descriptions. The CogVideoX-5B model’s strong text-following ensures accurate depiction of described physical processes.

Social Media Marketing: Marketing teams use prompt-optimized batch generation to create 50+ short video variants daily for A/B testing, running quantized inference on shared GPU servers with 16GB VRAM.


Advanced Usage / Production Hardening #

Multi-GPU Parallel Inference #

For high-throughput deployments, distribute across multiple GPUs:

import torch
from diffusers import CogVideoXPipeline

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5B",
    torch_dtype=torch.bfloat16,
    device_map="balanced"  # Auto-distribute across GPUs
)
# Do NOT call enable_model_cpu_offload() with device_map

Multi-GPU reduces per-GPU memory to approximately 24GB BF16 for CogVideoX-5B.

API Server with FastAPI #

Wrap inference in a production API:

from fastapi import FastAPI
from pydantic import BaseModel
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
import uuid
import os

app = FastAPI()
pipe = None

@app.on_event("startup")
async def load_model():
    global pipe
    pipe = CogVideoXPipeline.from_pretrained(
        "THUDM/CogVideoX-5B",
        torch_dtype=torch.bfloat16
    )
    pipe.enable_model_cpu_offload()
    pipe.vae.enable_slicing()

class GenerateRequest(BaseModel):
    prompt: str
    num_frames: int = 49
    guidance_scale: float = 6.0
    num_inference_steps: int = 50

@app.post("/generate")
async def generate_video(req: GenerateRequest):
    video = pipe(
        prompt=req.prompt,
        num_frames=req.num_frames,
        guidance_scale=req.guidance_scale,
        num_inference_steps=req.num_inference_steps,
        height=480,
        width=720,
    ).frames[0]

    output_id = str(uuid.uuid4())
    output_path = f"output/{output_id}.mp4"
    export_to_video(video, output_path, fps=8)

    return {"video_url": f"/videos/{output_id}.mp4", "status": "complete"}

Run with:

uvicorn api_server:app --host 0.0.0.0 --port 8000 --workers 1

VRAM Optimization Checklist #

Apply these optimizations in order based on your GPU:

  1. VAE Slicing: Always enable — splits large batches with negligible overhead
  2. VAE Tiling: Enable for resolutions above 720p — processes in patches
  3. Sequential CPU Offload: Use with < 12GB VRAM — moves layers to CPU between steps
  4. Model CPU Offload: Use with 12-16GB VRAM — faster than sequential but uses more memory
  5. INT8 Quantization: Use TorchAO for 7-10GB VRAM targets
  6. BF16 over FP32: Always use BF16 on Ampere+ GPUs — halves memory vs FP32

Monitoring with Prometheus #

Track inference metrics in production:

from prometheus_client import Counter, Histogram, start_http_server
import time

INFERENCE_COUNT = Counter('cogvideo_inferences_total', 'Total inferences')
INFERENCE_TIME = Histogram('cogvideo_inference_seconds', 'Inference latency')
VRAM_USAGE = Histogram('cogvideo_vram_bytes', 'Peak VRAM usage')

start_http_server(9090)

@INFERENCE_TIME.time()
def generate_tracked(pipe, prompt):
    INFERENCE_COUNT.inc()
    torch.cuda.reset_peak_memory_stats()
    result = pipe(prompt=prompt, num_frames=49).frames[0]
    vram = torch.cuda.max_memory_allocated()
    VRAM_USAGE.observe(vram)
    return result

Comparison with Alternatives #

FeatureCogVideoX-5BWan 2.1-14BHunyuanVideo-13BOpen-Sora 1.2
Parameters5B14B13B~7B (STDiT3)
LicenseApache-2.0Apache-2.0Apache-2.0Apache-2.0
Max Resolution1360 x 7681280 x 7201280 x 7201280 x 720
Max Duration10 seconds5 seconds5.4 seconds16 seconds
Min VRAM (FP16)5 GB (2B) / 10 GB (5B)16 GB (1.3B) / 24 GB (14B)24 GB16 GB
Inference (A100)~1000s (5s vid)~846s (5s vid)~132s/step at 2K~94s/step
Text AlignmentStrongModerateStrongModerate
Motion QualityModerateStrongExcellentModerate
I2V SupportYes (dedicated model)YesYesYes
Fine-TuningLoRA + Full (SAT)LoRALoRA + FullFull only
ComfyUI SupportYes (wrapper)Yes (wrapper)Yes (wrapper)Yes (wrapper)
QuantizationINT8 via TorchAOINT8/INT4INT8/INT4Limited
Multilingual PromptEnglish focusedChinese + EnglishChinese + EnglishEnglish

When to choose CogVideoX: Pick it when you need precise text-following, long-form generation (up to 10s), image-to-video with dedicated I2V models, or the lowest VRAM footprint among quality models. CogVideoX1.5-5B-I2V runs on a 4GB GPU. In the cogvideo vs wan comparison, CogVideoX wins on VRAM efficiency and dedicated I2V support, while Wan 2.1 leads on motion fluidity.

When to choose Wan 2.1: Better for multilingual Chinese/English workflows and when you have 16GB+ VRAM for the 14B model. Wan 2.1 shows superior motion quality in benchmarks and faster inference on equivalent hardware.

When to choose HunyuanVideo: Best overall quality on VBench-2.0, strongest human fidelity, but requires 24GB+ VRAM for full-precision inference. The default choice when GPU memory is unlimited.

When to choose Open-Sora: For research flexibility and very long video generation (up to 16 seconds), though with lower visual quality.


Limitations / Honest Assessment #

CogVideoX has clear constraints you should know before committing:

  1. Slow Inference: A single 5-second video takes ~1000 seconds on A100 with CogVideoX-5B at 50 steps. Wan 2.1 and HunyuanVideo are faster on equivalent hardware. The upcoming BLADE acceleration (8.89x speedup) helps but requires separate model conversion.

  2. English-Only Prompts: CogVideoX is trained primarily on English captions. Multilingual prompts degrade compared to Wan 2.1 or HunyuanVideo which handle Chinese natively.

  3. Human Figure Quality: VBench-2.0 human fidelity scores lag behind HunyuanVideo (0.871 vs 0.964). Human faces and body movements occasionally show artifacts.

  4. Maximum 10 Seconds: Even CogVideoX1.5-5B tops out at 10 seconds (161 frames). For longer content, you need video extension techniques or switch to Open-Sora.

  5. Prompt Engineering Required: The model expects long, detailed prompts (200+ words). Without prompt optimization via LLM expansion, output quality drops significantly.

  6. No Commercial Video API: Unlike Runway or Kling, CogVideoX is self-hosted only. You manage GPU infrastructure, scaling, and queueing.


Frequently Asked Questions #

Q: What GPU do I need to run CogVideoX locally?

A: CogVideoX-2B runs on 5GB VRAM with Diffusers + sequential CPU offload. CogVideoX-5B needs 10GB minimum. CogVideoX1.5-5B-I2V works on just 4GB. For comfortable experience without CPU offloading, aim for 16GB+ VRAM (RTX 4080/4090 or A100).

Q: How do I speed up inference beyond 50 steps?

A: Use the CogVideoXDPMScheduler with timestep_spacing="trailing" and reduce steps to 25-30 for draft previews. For production speedup, apply Video-BLADE step distillation which achieves 8.89x acceleration on CogVideoX-5B at comparable quality. Enable VAE tiling and slicing, and use enable_model_cpu_offload() instead of sequential offload when VRAM allows.

Q: Can I fine-tune CogVideoX on my own dataset?

A: Yes, via two paths. SAT framework supports full-parameter fine-tuning and LoRA with custom datasets. Diffusers supports LoRA fine-tuning through train_cogvideox_lora.py. Both require A100 GPUs for 5B models — 2B models can train on single RTX 4090 with gradient checkpointing. You need 25+ videos for meaningful style or concept learning.

Q: What is the difference between CogVideoX-5B and CogVideoX1.5-5B?

A: CogVideoX1.5-5B is the November 2024 update with higher resolution support (1360 x 768 vs 720 x 480), longer video generation (up to 10 seconds vs 6 seconds), and improved frame handling (16N+1 frame formula vs 8N+1). The 1.5 series also introduces dedicated I2V models with better motion coherence from static images.

Q: How does CogVideoX compare to commercial APIs like Sora or Kling?

A: Commercial APIs offer simpler access and higher peak quality, especially for human subjects. CogVideoX wins on cost (no per-video fees), privacy (local inference), customization (LoRA fine-tuning), and reproducibility (fixed seeds). For batch content generation at scale, self-hosted CogVideoX is typically 10x cheaper than API billing.

Q: What file formats does CogVideoX output?

A: The Diffusers pipeline outputs PyTorch tensors. Use export_to_video() from diffusers.utils to save as MP4 with H.264 encoding at 8-16 FPS depending on the model. For other formats, convert the MP4 with FFmpeg:

ffmpeg -i output.mp4 -c:v libx265 -crf 23 output_h265.mp4

Q: Is there a web UI for non-technical users?

A: Yes, multiple options. The official Hugging Face Space provides online inference without setup. For local use, install ComfyUI with the CogVideoXWrapper node for a visual workflow interface. Third-party tools like Pinokio also offer one-click installations.


Conclusion #

CogVideoX delivers production-grade text-to-video generation with the flexibility of open-source deployment. From 4GB VRAM consumer GPUs to multi-A100 server clusters, the model scales across hardware tiers thanks to quantization, CPU offloading, and SAT framework fine-tuning.

Action items to get started today:

  1. Clone the repository: git clone https://github.com/zai-org/CogVideo.git
  2. Install dependencies: pip install -r requirements.txt
  3. Run your first generation: python inference/cli_demo.py --prompt "Your prompt here" --model_path THUDM/CogVideoX-5B
  4. Optimize prompts with convert_demo.py for better quality
  5. Join the CogVideo community on Discord for support and workflow sharing

For production deployments, start with the Docker setup, add FastAPI wrapping, and monitor with Prometheus. Fine-tune with SAT LoRA when you need custom styles.

Join our Telegram group for daily AI source code updates: @dibi8source


Before you deploy any of the tools above into production, you’ll need solid infrastructure. Two options dibi8 actually uses and recommends:

  • DigitalOcean — $200 free credit for 60 days across 14+ global regions. The default option for indie devs running open-source AI tools.
  • HTStack — Hong Kong VPS with low-latency access from mainland China. This is the same IDC that hosts dibi8.com — battle-tested in production.

Affiliate links — they don’t cost you extra and they help keep dibi8.com running.

Sources & Further Reading #

💬 Discussion