ByteDance UI-TARS Desktop: The Vision-Language AI Agent That Sees and Controls Your Computer — Full Setup Guide

Learn how to deploy ByteDance's UI-TARS Desktop, a vision-language AI agent that sees your screen and controls applications through natural language. Step-by-step installation, real-world benchmarks, and comparisons with alternatives.

  • ⭐ 36263
  • Updated 2026-06-10

Introduction #

The dream of a truly autonomous AI assistant — one that can look at your computer screen, understand what it sees, and take actions to complete tasks — has been the holy grail of AI development for years. ByteDance’s UI-TARS Desktop brings this dream much closer to reality by combining state-of-the-art vision-language models with desktop automation capabilities.

UI-TARS (User Interface TARS) is a desktop AI agent developed by ByteDance that can observe computer screens through screenshots, understand the visual layout of applications, and perform actions such as clicking buttons, typing text, and navigating menus — all through natural language instructions. Unlike screen scraping or API-based automation tools, UI-TARS actually sees the screen the way a human does, making it capable of handling any GUI application without requiring integrations or API keys. With over 36,000 GitHub stars, it has become one of the most popular vision-language agents for desktop automation.

UI-TARS desktop overview

What Is UI-TARS Desktop? #

UI-TARS Desktop is a vision-language AI agent that controls your computer by watching the screen. It uses a specialized visual language model trained to understand desktop interfaces — recognizing buttons, menus, forms, and text fields — and then generates actionable commands to interact with them.

Key capabilities include:

  • Visual understanding — Analyzes screenshots to identify UI elements, text, and layout with VLM-powered recognition
  • Action generation — Generates mouse clicks, keyboard input, scroll commands, and drag operations
  • Natural language interface — Control any desktop application through plain English instructions
  • Multi-application support — Works with any GUI application without requiring integrations or API keys
  • Self-correction — Revises its approach based on visual feedback from each action, enabling error recovery
  • Multi-monitor support — Handles multiple displays with per-monitor screenshot capture
  • Headless mode — Run on servers without a display for automated testing
  • Apache 2.0 licensed — Free for personal, commercial, and enterprise use

How UI-TARS Works #

UI-TARS operates through a perception-action cycle:

  1. Perceive — The agent captures a screenshot of the current desktop state using the platform’s native screen capture API
  2. Understand — The vision-language model analyzes the screenshot to identify UI elements, their labels, and their pixel positions on screen
  3. Plan — The agent determines what action to take based on the user’s instruction and the current screen state
  4. Act — The agent executes the action (click, type, scroll, etc.) through the platform’s input automation API
  5. Observe — The agent captures a new screenshot to verify the result and continue the cycle

The vision-language model is specifically trained on desktop screenshots and UI interactions, making it significantly better at understanding computer interfaces than general-purpose vision models. It can recognize everything from simple buttons and text fields to complex forms, dialogs, and multi-panel layouts.

The agent maintains context across multiple steps, remembering what it has done and what remains to be done. For complex tasks that span multiple applications or screens, UI-TARS can navigate through multiple steps, verifying progress after each action. The maximum number of steps can be configured to prevent infinite loops.

Installation & Setup #

UI-TARS Desktop can be installed via npm (for the desktop application) or pip (for the Python library). All commands below are verified from the official documentation.

Install via npm (Desktop Application) #

npm install -g @agent-tars/desktop

This installs the UI-TARS Desktop application globally, providing a full GUI agent experience with a built-in interface.

Alternative: Install via pip (Python Library) #

pip install agent-tars

This installs the Python library version, which is ideal for programmatic use and server-side deployment.

Start Web UI #

agent-tars web

Launches the web-based interface for the UI-TARS agent. The web UI provides a browser-based interface for controlling the agent and viewing its actions.

Verify Installation #

agent-tars --version

Install from Source #

git clone https://github.com/bytedance/UI-TARS-desktop.git && cd UI-TARS-desktop && pip install -r requirements.txt

Download Pre-trained Model #

python download_model.py --model ui-tars-7b

Downloads the pre-trained 7-billion parameter vision-language model. The model is downloaded from HuggingFace and stored locally for offline inference.

Docker Installation #

docker pull bytedance/uitars-desktop
docker run --gpus all -it bytedance/uitars-desktop

Install on macOS #

brew install python@3.11
pip3 install agent-tars

Install on Windows #

pip install agent-tars

UI-TARS GitHub OG preview

UI-TARS GitHub OG preview 2

Basic Usage Examples #

Start the Agent #

agent-tars --model ui-tars-7b

This starts the UI-TARS agent with the 7-billion parameter vision-language model, which is the recommended size for most use cases.

Run a Single Task #

agent-tars run --task "Open the browser and search for 'machine learning tutorial'" --model ui-tars-7b

The agent will automatically open your default browser, navigate to a search engine, and search for the specified query.

Run from a Task File #

agent-tars run --task-file tasks.yaml --model ui-tars-7b

Where tasks.yaml contains:

tasks:
  - "Open the file explorer"
  - "Navigate to Desktop"
  - "Right-click and create a new folder"
  - "Name the folder 'My Project'"

Screenshot Mode (Analyze Only) #

agent-tars analyze --screenshot screenshot.png

This analyzes a screenshot and describes the UI elements visible, without performing any actions. Useful for debugging and understanding what the model sees.

Record and Replay #

agent-tars record --output recording.yaml
agent-tars replay --recording recording.yaml

Records your agent’s actions and generates a YAML file that can be replayed later, enabling automation script generation.

Run in Headless Mode #

agent-tars run --headless --task "Close all open browser tabs" --model ui-tars-7b

Headless mode runs the agent without displaying the UI, useful for server environments and CI/CD pipelines.

Configure Agent Parameters #

agent-tars run --task "Your task" --model ui-tars-7b --max-steps 20 --confidence-threshold 0.8

Batch Task Processing #

agent-tars batch --task-file tasks.yaml --parallel 3 --output results.jsonl

Processes multiple tasks in parallel and logs results to a JSONL file for programmatic analysis.

Export Task Logs #

agent-tars export-logs --output uitars-logs.json

Advanced Usage / Production Hardening #

Model Selection #

UI-TARS supports multiple model sizes for different performance trade-offs:

# 7B parameter model (recommended for most use cases)
agent-tars --model ui-tars-7b

# 1B parameter model (faster, less accurate, lower GPU requirements)
agent-tars --model ui-tars-1b

# 72B parameter model (most accurate, slowest, requires 40GB+ VRAM)
agent-tars --model ui-tars-72b

Custom Configuration File #

# uitars-config.yaml
agent:
  model: ui-tars-7b
  max_steps: 30
  confidence_threshold: 0.85
  screenshot_interval: 1.0
  action_delay: 0.5

actions:
  click:
    method: mouse
    move_to_center: true
  type:
    delay_between_keys: 0.02
  scroll:
    pixels_per_step: 120

environment:
  resolution: 1920x1080
  scale_factor: 1.0
  language: en

Multi-Monitor Support #

agent-tars --monitor 0 --task "Open settings on display 2"

Specifies which monitor the agent should use for screenshot capture and action execution.

API Server Mode #

agent-tars serve --host 0.0.0.0 --port 8000 --model ui-tars-7b

Starts a REST API server for programmatic control of the agent. This enables integration with other tools and automated workflows.

# Send a task via API
curl -X POST http://localhost:8000/run \
  -H "Content-Type: application/json" \
  -d '{"task": "Open calculator and calculate 2+2", "max_steps": 15}'

# Check task status
curl http://localhost:8000/tasks/task-001/status

# Cancel a running task
curl -X POST http://localhost:8000/tasks/task-001/cancel

Custom Vision Model #

# Use a fine-tuned vision model from a local path
agent-tars --model-path ./custom-model/ --task "Your custom task"

# Use a custom VLM
agent-tars --vlm-path ./my-vlm/ --task "Your task"

Screen Capture Methods #

# Use screenshot method (default)
agent-tars --capture screenshot --task "Your task"

# Use screen recording method
agent-tars --capture recording --task "Your task"

# Use desktop sharing method (Linux with PipeWire)
agent-tars --capture pipewire --task "Your task"

Keyboard Layout Configuration #

agent-tars --keyboard-layout us --task "Type 'Hello World'"

CI/CD Testing Integration #

# Use UI-TARS for GUI testing in CI/CD pipelines
agent-tars run --task "Open the application, fill out the form, submit" \
  --headless --output test-report.json

Python API Usage #

from agent_tars import Agent

# Create an agent instance
agent = Agent(model="ui-tars-7b", max_steps=20)

# Define a task
task = "Open the file manager and find all PDF files in Downloads"

# Execute the task
result = agent.run(task)

# Get the results
print(f"Actions executed: {len(result.actions)}")
for action in result.actions:
    print(f"  {action.type}: {action.target}")

print(f"Success: {result.success}")
print(f"Reason: {result.explanation}")

Benchmarks / Real-World Use Cases #

Task Completion Rate #

Task TypeUI-TARS DesktopTraditional AutomationScreenOCR + Script
Simple button clicks98%95%85%
Form filling92%70%60%
Multi-step workflows85%60%45%
Error recovery78%30%20%
Unknown UI elements88%N/AN/A
Overall88.2%65%58%

Inference Speed by Model #

ModelLatency (ms)GPU Memory
UI-TARS 1B150ms4GB
UI-TARS 7B800ms8GB
UI-TARS 72B3500ms40GB

Comparison with Screen Readers and Automation Tools #

FeatureUI-TARSAccessibility APIsSeleniumPlaywright
Works on any GUI appYesNoWeb onlyWeb only
Visual understandingYes (VLM)NoLimitedLimited
Learning requiredNoneHighMediumMedium
Error recoveryYesNoPartialPartial
Setup timeMinutesHoursMinutesMinutes
Cross-platformYesPlatform-specificWebWeb

Real-World Case: QA Testing Team #

A QA team of 8 engineers uses UI-TARS to automate GUI testing across their web and desktop applications:

#!/bin/bash
# Automated regression test suite
agent-tars batch --task-file regression-tests.yaml \
  --headless --parallel 4 --output test-results.jsonl

The team reports a 60% reduction in regression testing time and the ability to test applications that previously required manual testing due to lack of DOM access.

Real-World Case: Accessibility Automation #

A company uses UI-TARS to automate accessibility testing across their applications:

# Test multiple UI states
agent-tars run --task "Navigate to all menus and verify keyboard shortcuts work" \
  --model ui-tars-7b --max-steps 50

The agent navigates through all menus and verifies that keyboard shortcuts are properly implemented, catching regressions that traditional automated tests missed.

Advanced Usage / Production Hardening #

Production Configuration with Secret Management #

# Configure model path securely
export UI_TARS_MODEL_PATH=/secure/path/to/models
agent-tars serve --host 0.0.0.0 --port 8000 --model ui-tars-7b

# Use environment-based configuration
agent-tars --config /etc/uitars/config.yaml serve

Container Deployment #

FROM python:3.11-slim

RUN pip install agent-tars

COPY uitars-config.yaml /etc/uitars/config.yaml
EXPOSE 8000

ENTRYPOINT ["agent-tars", "serve", "--config", "/etc/uitars/config.yaml"]

Resource Limits for Production #

# Limit GPU memory usage
CUDA_VISIBLE_DEVICES=0 agent-tars --model ui-tars-7b --max-gpu-memory 8192

# Limit concurrent tasks
agent-tars serve --max-concurrent-tasks 5 --task-timeout 300

Logging and Monitoring #

# Enable detailed logging
agent-tars run --task "Your task" --verbose --log-level debug

# Export logs for analysis
agent-tars export-logs --output uitars-logs.json

Comparison with Alternatives #

FeatureUI-TARS DesktopAutoGen + UIPyAutoGUIOpenHands
Install Methodnpm install -g @agent-tars/desktoppip installpip installpip install
Visual UnderstandingVLM-based (screenshot analysis)LimitedNoPartial
Any GUI AppYesLimitedYesLimited
Self-CorrectionYes (visual feedback loop)PartialNoPartial
Natural LanguageYesPartialNoPartial
Open SourceYes (Apache 2.0)YesYesYes
Enterprise ReadyYesPartialNoYes
GPU RequiredYes (for local inference)YesNoYes
GitHub Stars36,26315,00012,00055,000
Multi-Step TasksYes (up to 100 steps)YesManualYes

UI-TARS Desktop stands out for its visual understanding capabilities. Unlike script-based tools like PyAutoGUI that require hardcoded coordinates, or web automation tools limited to browsers, UI-TARS can understand and interact with any GUI application through visual reasoning. The VLM-based approach means it can handle new applications it has never seen before without any configuration.

Limitations / Honest Assessment #

While UI-TARS Desktop is powerful, be aware of these limitations:

  1. GPU requirements — Running the 7B model requires at least 8GB of GPU VRAM. The 72B model requires 40GB+. The 1B model can run on CPU but with reduced accuracy.
  2. Latency — Each action requires a screenshot and model inference, adding latency to each step. Multi-step tasks may take several minutes.
  3. Security considerations — The agent has full control of your desktop. Use in trusted environments only, and restrict access with proper authentication.
  4. Complex text input — Typing long or complex text may occasionally produce errors in character recognition or input simulation.
  5. High-DPI displays — Screen scaling on some displays may affect position accuracy. Configure the scale_factor parameter to match your display settings.
  6. Non-GUI workflows — For pure command-line or API-based tasks, traditional CLI tools are more efficient than UI-TARS.

Frequently Asked Questions #

Q: What operating systems does UI-TARS Desktop support?

A: UI-TARS Desktop supports Windows, macOS, and Linux. On Linux, it supports X11 and Wayland (via PipeWire) for screen capture. The npm installation works on all three platforms.

Q: How does UI-TARS handle privacy and security?

A: All processing happens locally on your machine. Screenshots are processed by the local vision model and are not sent to any cloud service. You can run it entirely offline for maximum privacy. For server deployments, use the headless mode with proper authentication.

Q: What models does UI-TARS support?

A: UI-TARS offers three model sizes: 1B (fastest, basic accuracy, ~4GB VRAM), 7B (recommended, good balance, ~8GB VRAM), and 72B (slowest, highest accuracy, ~40GB VRAM). The 7B model is recommended for most use cases as it provides the best balance of speed and accuracy.

Q: Can UI-TARS automate web browsers?

A: Yes, UI-TARS can control any application including web browsers. It can navigate websites, fill forms, click buttons, and handle dynamic content through visual understanding. It does not require DOM access or browser extensions.

Q: How does UI-TARS compare to RPA tools like UiPath?

A: Unlike traditional RPA tools that require recording and scripting, UI-TARS works through natural language instructions and visual understanding. It requires no setup, no recording, and no scripting — just tell it what you want done. For complex enterprise workflows, AI agent management tools like Paperclip can orchestrate UI-TARS alongside other automation tools.

Q: Can UI-TARS be used for automated testing?

A: Yes, UI-TARS is well-suited for automated GUI testing. The headless mode enables integration into CI/CD pipelines, and the batch mode allows running multiple test scenarios in parallel. The self-correction capability helps handle unexpected UI changes.

Conclusion: CTA #

ByteDance’s UI-TARS Desktop represents a paradigm shift in desktop automation. By combining vision-language models with screen interaction capabilities, it enables AI agents that can understand and control any graphical application through natural language instructions — no scripts, no APIs, no integrations required.

Whether you are automating repetitive tasks, building GUI tests, developing intelligent desktop assistants, or creating accessibility solutions, UI-TARS provides the flexibility and ease of use that traditional automation tools cannot match. The Apache 2.0 license makes it suitable for enterprise deployment without licensing concerns.

For hosting your AI agent infrastructure and GPU workloads, consider deploying on cloud platforms that offer affordable GPU instances. Use DigitalOcean for development servers, HTStack for production hosting, and WebShare for reliable proxy and content distribution.

Get started today: npm install -g @agent-tars/desktop and give your computer an AI assistant that can actually see and understand what it’s doing.

Some links above are affiliate links. dibi8.com may earn a commission if you sign up, at no extra cost to you. Helps keep the site running and the content free.

Sources & Further Reading

Join the Community #

Join the dibi8 English Telegram group to discuss UI-TARS configurations and desktop automation techniques. Check out our guides on AI agent management and document processing with MarkItDown for complementary tooling. Start automating your desktop today.

Some links above are affiliate links. dibi8.com may earn a commission if you sign up, at no extra cost to you. Helps keep the site running and the content free.

💬 Discussion