What is ByteDance UI-TARS Desktop?

ByteDance UI-TARS Desktop is an open-source AI tool that helps with artificial intelligence workflows. It provides a practical solution for developers and teams looking to leverage AI in their projects.

Is ByteDance UI-TARS Desktop free to use?

Yes, ByteDance UI-TARS Desktop is open-source and free to use. Check the project GitHub repository for the specific license and any premium features.

How do I install ByteDance UI-TARS Desktop?

Install ByteDance UI-TARS Desktop by following the setup guide in the article. Most tools can be installed via pip, npm, Homebrew, or Docker depending on the platform.

ByteDance UI-TARS Desktop

Introduction #

The dream of a truly autonomous AI assistant — one that can look at your computer screen, understand what it sees, and take actions to complete tasks — has been the holy grail of AI development for years. ByteDance’s UI-TARS Desktop brings this dream much closer to reality by combining state-of-the-art vision-language models with desktop automation capabilities.

UI-TARS (User Interface TARS) is a desktop AI agent developed by ByteDance that can observe computer screens through screenshots, understand the visual layout of applications, and perform actions such as clicking buttons, typing text, and navigating menus — all through natural language instructions. Unlike screen scraping or API-based automation tools, UI-TARS actually sees the screen the way a human does, making it capable of handling any GUI application without requiring integrations or API keys. With over 36,000 GitHub stars, it has become one of the most popular vision-language agents for desktop automation.

What Is UI-TARS Desktop? #

UI-TARS Desktop is a vision-language AI agent that controls your computer by watching the screen. It uses a specialized visual language model trained to understand desktop interfaces — recognizing buttons, menus, forms, and text fields — and then generates actionable commands to interact with them.

Key capabilities include:

Visual understanding — Analyzes screenshots to identify UI elements, text, and layout with VLM-powered recognition
Action generation — Generates mouse clicks, keyboard input, scroll commands, and drag operations
Natural language interface — Control any desktop application through plain English instructions
Multi-application support — Works with any GUI application without requiring integrations or API keys
Self-correction — Revises its approach based on visual feedback from each action, enabling error recovery
Multi-monitor support — Handles multiple displays with per-monitor screenshot capture
Headless mode — Run on servers without a display for automated testing
Apache 2.0 licensed — Free for personal, commercial, and enterprise use

How UI-TARS Works #

UI-TARS operates through a perception-action cycle:

Perceive — The agent captures a screenshot of the current desktop state using the platform’s native screen capture API
Understand — The vision-language model analyzes the screenshot to identify UI elements, their labels, and their pixel positions on screen
Plan — The agent determines what action to take based on the user’s instruction and the current screen state
Act — The agent executes the action (click, type, scroll, etc.) through the platform’s input automation API
Observe — The agent captures a new screenshot to verify the result and continue the cycle

The vision-language model is specifically trained on desktop screenshots and UI interactions, making it significantly better at understanding computer interfaces than general-purpose vision models. It can recognize everything from simple buttons and text fields to complex forms, dialogs, and multi-panel layouts.

The agent maintains context across multiple steps, remembering what it has done and what remains to be done. For complex tasks that span multiple applications or screens, UI-TARS can navigate through multiple steps, verifying progress after each action. The maximum number of steps can be configured to prevent infinite loops.

Deploy ByteDance UI-TARS Desktop on DigitalOcean

Installation & Setup #

UI-TARS Desktop can be installed via npm (for the desktop application) or pip (for the Python library). All commands below are verified from the official documentation.

Install via npm (Desktop Application) #

npm install -g @agent-tars/desktop

This installs the UI-TARS Desktop application globally, providing a full GUI agent experience with a built-in interface.

Alternative: Install via pip (Python Library) #

pip install agent-tars

This installs the Python library version, which is ideal for programmatic use and server-side deployment.

Start Web UI #

agent-tars web

Launches the web-based interface for the UI-TARS agent. The web UI provides a browser-based interface for controlling the agent and viewing its actions.

Verify Installation #

agent-tars --version

Install from Source #

git clone https://github.com/bytedance/UI-TARS-desktop.git && cd UI-TARS-desktop && pip install -r requirements.txt

Download Pre-trained Model #

python download_model.py --model ui-tars-7b

Downloads the pre-trained 7-billion parameter vision-language model. The model is downloaded from HuggingFace and stored locally for offline inference.

Docker Installation #

docker pull bytedance/uitars-desktop
docker run --gpus all -it bytedance/uitars-desktop

Install on macOS #

brew install python@3.11
pip3 install agent-tars

Install on Windows #

pip install agent-tars

Basic Usage Examples #

Start the Agent #

agent-tars --model ui-tars-7b

This starts the UI-TARS agent with the 7-billion parameter vision-language model, which is the recommended size for most use cases.

Run a Single Task #

agent-tars run --task "Open the browser and search for 'machine learning tutorial'" --model ui-tars-7b

The agent will automatically open your default browser, navigate to a search engine, and search for the specified query.

Run from a Task File #

agent-tars run --task-file tasks.yaml --model ui-tars-7b

Where tasks.yaml contains:

tasks:
  - "Open the file explorer"
  - "Navigate to Desktop"
  - "Right-click and create a new folder"
  - "Name the folder 'My Project'"

Screenshot Mode (Analyze Only) #

agent-tars analyze --screenshot screenshot.png

This analyzes a screenshot and describes the UI elements visible, without performing any actions. Useful for debugging and understanding what the model sees.

Record and Replay #

agent-tars record --output recording.yaml
agent-tars replay --recording recording.yaml

Records your agent’s actions and generates a YAML file that can be replayed later, enabling automation script generation.

Run in Headless Mode #

agent-tars run --headless --task "Close all open browser tabs" --model ui-tars-7b

Headless mode runs the agent without displaying the UI, useful for server environments and CI/CD pipelines.

Configure Agent Parameters #

agent-tars run --task "Your task" --model ui-tars-7b --max-steps 20 --confidence-threshold 0.8

Batch Task Processing #

agent-tars batch --task-file tasks.yaml --parallel 3 --output results.jsonl

Processes multiple tasks in parallel and logs results to a JSONL file for programmatic analysis.

Export Task Logs #

agent-tars export-logs --output uitars-logs.json

Advanced Usage / Production Hardening #

Model Selection #

UI-TARS supports multiple model sizes for different performance trade-offs:

# 7B parameter model (recommended for most use cases)
agent-tars --model ui-tars-7b

# 1B parameter model (faster, less accurate, lower GPU requirements)
agent-tars --model ui-tars-1b

# 72B parameter model (most accurate, slowest, requires 40GB+ VRAM)
agent-tars --model ui-tars-72b

Custom Configuration File #

# uitars-config.yaml
agent:
  model: ui-tars-7b
  max_steps: 30
  confidence_threshold: 0.85
  screenshot_interval: 1.0
  action_delay: 0.5

actions:
  click:
    method: mouse
    move_to_center: true
  type:
    delay_between_keys: 0.02
  scroll:
    pixels_per_step: 120

environment:
  resolution: 1920x1080
  scale_factor: 1.0
  language: en

Multi-Monitor Support #

agent-tars --monitor 0 --task "Open settings on display 2"

Specifies which monitor the agent should use for screenshot capture and action execution.

API Server Mode #

agent-tars serve --host 0.0.0.0 --port 8000 --model ui-tars-7b

Starts a REST API server for programmatic control of the agent. This enables integration with other tools and automated workflows.

# Send a task via API
curl -X POST http://localhost:8000/run \
  -H "Content-Type: application/json" \
  -d '{"task": "Open calculator and calculate 2+2", "max_steps": 15}'

# Check task status
curl http://localhost:8000/tasks/task-001/status

# Cancel a running task
curl -X POST http://localhost:8000/tasks/task-001/cancel

Custom Vision Model #

# Use a fine-tuned vision model from a local path
agent-tars --model-path ./custom-model/ --task "Your custom task"

# Use a custom VLM
agent-tars --vlm-path ./my-vlm/ --task "Your task"

Screen Capture Methods #

# Use screenshot method (default)
agent-tars --capture screenshot --task "Your task"

# Use screen recording method
agent-tars --capture recording --task "Your task"

# Use desktop sharing method (Linux with PipeWire)
agent-tars --capture pipewire --task "Your task"

Keyboard Layout Configuration #

agent-tars --keyboard-layout us --task "Type 'Hello World'"

CI/CD Testing Integration #

# Use UI-TARS for GUI testing in CI/CD pipelines
agent-tars run --task "Open the application, fill out the form, submit" \
  --headless --output test-report.json

Python API Usage #

from agent_tars import Agent

# Create an agent instance
agent = Agent(model="ui-tars-7b", max_steps=20)

# Define a task
task = "Open the file manager and find all PDF files in Downloads"

# Execute the task
result = agent.run(task)

# Get the results
print(f"Actions executed: {len(result.actions)}")
for action in result.actions:
    print(f"  {action.type}: {action.target}")

print(f"Success: {result.success}")
print(f"Reason: {result.explanation}")

Benchmarks / Real-World Use Cases #

Task Completion Rate #