What is ViMax and who made it?

ViMax is an open-source agentic video generation framework from the Hong Kong University Data Science Lab (HKUDS) with over 9,807 GitHub stars as of May 2026. It treats video generation as a multi-agent orchestration problem, using four AI roles -- Director, Screenwriter, Producer, and Video Generator -- to produce long-form, multi-scene videos from a single idea.

How does ViMax differ from Sora, Runway, and OpenSora?

Sora, Runway, and OpenSora use a direct single-prompt-to-video pipeline that produces seconds-long clips with frame-level consistency drift. ViMax uses a multi-agent pipeline (script to storyboard to assets to video) with RAG-based structured scripts and a Producer agent that enforces character consistency across shots, enabling coherent multi-scene videos lasting minutes. Sora and Runway have better per-shot pixel quality, but ViMax wins on coherence across shots.

Is ViMax fully open-source and free to run?

ViMax's orchestration code is open-source under the MIT license, but it is not a fully open-source video model. It defers pixel-level generation to commercial APIs -- Google Veo for video and Google Nanobana for image generation -- which are not free, so you must plan for Google API costs. You also need an API key for at least one chat model, such as Gemini via OpenRouter.

What do the four ViMax agent roles do?

The Screenwriter turns a high-level idea into a full structured script using a RAG-based long script engine. The Director translates the script into a shot-level storyboard with camera setups and pacing. The Producer is the consistency engine, selecting reference images and running multimodal LLM checks so characters look the same across shots. The Video Generator renders shots in parallel and assembles the frames into the final video.

How do you install and run ViMax?

Clone the repository with git, then run 'uv sync' to install dependencies (ViMax uses Python 3.12 managed with uv). For idea-to-video, run 'python main_idea2video.py' with an idea, requirements, and style. If you already have a screenplay, run 'python main_script2video.py' to skip the Screenwriter step while the other three agents still run.

ViMax Review: Agentic Multi-Scene Video Generation from HKUDS

The Three Limits That Broke AI Video in 2025 #

Every AI video generation tool that hit consumer awareness in 2024–2025 — Sora, Runway Gen-3, Pika, Luma Dream Machine, OpenSora — shared the same three limits:

Short clips only. 5–10 seconds was the practical ceiling. Anything longer and consistency collapsed.
Consistency chaos. Same character changes face between shots. Same room reshuffles props. The single-prompt pipeline has no concept of “the same dog from scene 1.”
Visual-only output. No script, no narrative arc, no synchronized audio. You got pretty pictures that moved; you did not get a film.

For social-media clips, the limits were tolerable. For anyone who wanted to use AI to actually tell a story — explainer videos, educational content, branded narrative — the pipeline broke the moment the user wanted scene 2 to follow logically from scene 1.

ViMax (GitHub: HKUDS/ViMax, 9,807+ stars as of May 2026) from Hong Kong University Data Science Lab is the first widely-adopted open-source attempt to break those limits by treating video generation as a multi-agent orchestration problem, not a one-shot generation problem.

The tagline says it plainly: “Director, Screenwriter, Producer, and Video Generator All-in-One.”

The Four Agentic Roles #

ViMax’s architectural bet: video production in the real world is a multi-role pipeline, so AI video production should be too. The framework defines four autonomous agent roles, each with a different LLM-driven task:

🎬 Screenwriter #

Takes a high-level idea (“a cat and dog become friends, then meet a new cat”) and produces a full structured script — characters, scene segmentation, dialogue, transitions. Uses a RAG-based long script engine that can intelligently segment lengthy stories into multi-scene format. This is the layer that makes minute-plus videos coherent.

🎭 Director #

Translates the script into a shot-level storyboard. Decides multi-camera setups, framing, pacing, scene transitions. Outputs explicit shot descriptions that the downstream generator can render.

🎯 Producer #

The consistency engine. Selects reference images, validates that the same character looks the same across shots, orchestrates resources, runs MLLM (multimodal LLM) consistency checks. This is the layer that solves the “character reshuffling” problem.

🎥 Video Generator #

The final rendering layer. Generates shots in parallel, synthesizes images for each frame, assembles the frames into video. Defers the actual pixel-level generation to underlying models (Veo, etc.).

Each role is a separate LLM agent with its own prompt, its own context window, and its own deterministic output contract — a textbook application of 12-Factor Agents factor 10 (“small, focused agents”).

Tech Stack #

Language: Python 3.12, managed with uv.
Multi-agent framework: Custom orchestration layer.
Chat models supported: Google Gemini 2.5 Flash Lite (via OpenRouter), MiniMax-M2.7 (1M context), MiniMax-M2.5 (204K context). The long context windows matter — the Screenwriter agent needs to hold an entire script in working memory.
Image generation: Google Nanobana API.
Video generation: Google Veo via API.
License: MIT — code is permissive; the upstream model APIs come with their own commercial terms.

The choice to defer pixel-level generation to commercial APIs (Veo, Nanobana) is honest. Open-source video models haven’t yet caught up to the visual quality of frontier commercial models, and pretending otherwise would compromise the demo. ViMax’s contribution is the orchestration — bring your own pixel engine.

Quick Setup #

git clone https://github.com/HKUDS/ViMax.git
cd ViMax
uv sync

That’s it for the dependency install. You’ll need API keys for at least one chat model (OpenRouter for Gemini works) and Google’s Veo + Nanobana APIs for the video/image generation.

Idea-to-Video Workflow #

idea = "If a cat and a dog are best friends, what would happen when they meet a new cat?"
user_requirement = "For children, do not exceed 3 scenes."
style = "Cartoon"
# Run: python main_idea2video.py

The Screenwriter expands the idea into a 3-scene script. The Director plans shots. The Producer selects references and enforces consistency. The Video Generator renders each scene and assembles.

Script-to-Video Workflow #

For users who already have a screenplay, main_script2video.py takes the script directly and skips the Screenwriter step. The other three agents still run.

How It Differs from Sora, Runway, OpenSora #

Aspect	ViMax	Sora / Runway / OpenSora
Pipeline	Multi-agent (Script → Storyboard → Assets → Video)	Direct prompt → video
Narrative	RAG-based structured script generation	Single-prompt; no script structure
Consistency	Producer agent + MLLM checks + ref image selection	Frame-level drift across shots
Length	Multi-scene, minutes+	Seconds-long clips
Creative control	Per-agent override (rewrite the script, redo the storyboard)	Limited; mostly post-hoc editing
Audio	Synchronized audio-video binding	Video-primary focus
Open source	Yes (MIT)	OpenSora yes; Sora/Runway no

The honest counter: Sora and Runway have visibly better pixel-level quality per shot. ViMax wins on coherence across shots. If you need a 10-second tech demo, Sora wins. If you need a 90-second explainer where the dog needs to still be the same dog in scene 4, ViMax’s orchestration is what you want.

What ViMax Is NOT #

To calibrate expectations:

Not a fully open-source video model. It orchestrates calls to commercial video/image models. Self-hosting end-to-end requires waiting for the open video model layer to catch up.
Not a no-code tool. Today’s interface is Python scripts and config files. The agentic part is sophisticated; the UX is “researcher’s prototype.”
No formal release yet. 329 commits on main, no tagged releases. Expect API churn.
No performance benchmarks in the README. ViMax markets the qualitative advantages (consistency, length, narrative); quantitative ablations are not yet public.
Google API dependency. Veo and Nanobana are not free or open. Plan for cost.

Real Use Cases #

Where ViMax’s agentic pipeline actually moves the needle:

Educational / explainer videos — multi-scene, character continuity, narrative structure. The classic “teacher’s voice plus animated examples” format.
Children’s content — short stories with consistent characters across scenes (the example use case in the README).
Marketing storyboards — generate a full script + storyboard from a campaign brief, then have the marketing team approve before the (more expensive) generation step.
Long-form social content — TikTok / Reels content that’s 60-90 seconds with a coherent micro-narrative (vs. 5-second single-shot clips that already saturate the feed).
Pre-visualization for film/TV — affordable previs that respects character consistency for actual production planning.

For each of these, the alternative without ViMax is either expensive human production or short-clip AI tools that can’t sustain a story.

Where ViMax Fits in the 2026 AI Video Landscape #

Pair ViMax with:

Image generators — already integrated (Nanobana), but you can swap to Stable Diffusion / ComfyUI for self-hosted image gen workflows.
TTS for voiceover — Supertonic for on-device multi-language voice; pair with ViMax for fully integrated narrated video.
Long-context LLMs — MiniMax-M2.7’s 1M context is the practical choice for full-feature scripts. The 12-Factor “own your context window” principle applies — the Screenwriter agent is exactly where context discipline matters most.

The combination ViMax + Supertonic + open-source image gen is the closest 2026 has come to a “describe a movie, get a movie” pipeline that’s mostly under the user’s control.

Who Should Try ViMax #

Install if you:

Need narrative-coherent video longer than 30 seconds.
Are okay paying Google API rates for the final generation but want orchestration in your control.
Are researching multi-agent creative workflows and want a reference implementation.
Build content tooling for clients and want a pipeline that can produce drafts in minutes that a human can review.

Skip if you:

Need single-shot 10-second video and Sora/Runway already work for you.
Aren’t comfortable with researcher-grade Python tooling.
Need fully self-hosted end-to-end (wait one more cycle of open video models).

Verdict #

ViMax is the most credible 2026 evidence that the next jump in AI video quality isn’t a bigger model — it’s better orchestration. By treating video production as a multi-agent problem with separate Director, Screenwriter, Producer, and Generator roles, HKUDS unlocks the long-form coherent video that a single-prompt diffusion model fundamentally cannot deliver.

The MIT license, HKUDS academic backing, and the 9,807 stars in a few months point to a tool the open video community has been waiting for. It’s early — no formal release, no benchmarks, hard dependency on commercial APIs — but the architecture is right. Expect this pattern (agentic orchestration of generation models) to spread through every creative AI vertical in the next 12 months.

If you’ve ever produced a video with a script, this is the AI workflow that finally maps to how the work actually gets done.

GitHub: HKUDS/ViMax · License: MIT · Stars: 7.1K+ · Authors: Hong Kong University Data Science Lab · Status: Active development, no tagged release yet

Recommended Infrastructure for Self-Hosting #

If you want to run this stack reliably 24/7, infrastructure choice matters:

DigitalOcean — $200 free credit for 60 days across 14+ global regions. Default choice for indie devs running open-source AI tools.
HTStack — Hong Kong VPS with low-latency access from mainland China. dibi8.com is hosted here — battle-tested in production.

Affiliate links — they do not cost you extra and help keep dibi8.com running.

ViMax Review: Agentic Multi-Scene Video Generation from HKUDS

The Three Limits That Broke AI Video in 2025 #

The Four Agentic Roles #

🎬 Screenwriter #

🎭 Director #

🎯 Producer #

🎥 Video Generator #

Tech Stack #

Quick Setup #

Idea-to-Video Workflow #

Script-to-Video Workflow #

How It Differs from Sora, Runway, OpenSora #

What ViMax Is NOT #

Real Use Cases #

Where ViMax Fits in the 2026 AI Video Landscape #

Who Should Try ViMax #

Verdict #

Recommended Infrastructure for Self-Hosting #

References & Sources #

📦 Featured in collections

💬 Discussion

The Three Limits That Broke AI Video in 2025 #

The Four Agentic Roles #

🎬 Screenwriter #

🎭 Director #

🎯 Producer #

🎥 Video Generator #

Tech Stack #

Quick Setup #

Idea-to-Video Workflow #

Script-to-Video Workflow #

How It Differs from Sora, Runway, OpenSora #

What ViMax Is NOT #

Real Use Cases #

Where ViMax Fits in the 2026 AI Video Landscape #

Who Should Try ViMax #

Verdict #

Recommended Infrastructure for Self-Hosting #

References & Sources #

🔗 Related Resources

📦 Featured in collections

💬 Discussion