AgentConn
U

UI-TARS-desktop

Coding Free

About UI-TARS-desktop

UI-TARS-desktop is the desktop application from ByteDance's open-source UI-TARS visual-agent stack. It is driven by the UI-TARS-1.5 multimodal language model and the Seed-1.5-VL/1.6 series, and it operates a local computer by taking screenshots, reasoning over them, and emitting mouse/keyboard actions. UI-TARS leads on the VisualWebBench benchmark (82.8%, ahead of GPT-4o and Claude 3.5 Sonnet) and on OSWorld at higher step budgets (24.6 vs Claude Computer Use 22.0). The v0.2.0 release added Remote Computer Operator and Remote Browser Operator, both free. The agent is the visual-agent counterpart to Anthropic's Computer Use and OpenAI's Operator, with two structural differences: it ships open-source, and it transfers strongly to the mobile domain where Claude visibly weakens.

Key Features

  • Vision-driven GUI agent — works against any application by screenshotting and reasoning
  • Strong cross-domain coverage: web (VisualWebBench 82.8%), desktop (OSWorld 24.6@50 steps), mobile, and Windows
  • Open-source weights and repo, runs against local or remote model backends
  • v0.2.0 Remote Computer Operator and Remote Browser Operator — both free
  • Multimodal stack also ships Agent TARS (terminal/computer/browser/product) alongside the desktop app
  • Cross-platform — macOS, Windows, and Linux desktop builds published

Overview

UI-TARS-desktop is the local-machine GUI agent piece of ByteDance’s broader UI-TARS open-source visual-agent stack. The repository description is direct: “The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra.” Two projects ship from the same effort — Agent TARS (a general multimodal stack across terminal, computer, browser, and product surfaces) and UI-TARS-desktop (the local-machine desktop product). The desktop product is the one most operators will encounter first.

The agent is a pure visual-agent design: the model takes screenshots of the user’s screen, reasons over them, and emits mouse/keyboard actions back. There is no DOM access, no application API integration, no privileged hooks. The agent operates only through what it can see — the same architectural family as Anthropic’s Claude Computer Use and OpenAI’s Operator.

Why It Matters

UI-TARS-desktop sat at GitHub trending #6 in early May 2026 inside a broader Chinese-AI ecosystem surge — alongside DeepSeek-TUI at #1 (positioned as a “Claude Code killer”) and datawhalechina/hello-agents at #8. It is also one of the few open-source visual-agent stacks with credible published benchmarks against the Western incumbents:

  • VisualWebBench: UI-TARS 72B at 82.8%, GPT-4o at 78.5%, Claude 3.5 Sonnet at 78.2%
  • OSWorld (50 steps): UI-TARS at 24.6, Claude Computer Use at 22.0
  • OSWorld (15 steps): UI-TARS at 22.7, Claude at 14.9
  • OSWorld (UI-TARS-2): 47.5%; AndroidWorld 73.3%; WindowsAgentArena 50.6%
  • OpenAI Operator on OSWorld: 38.1% (record at launch but trailing UI-TARS-2)

By ByteDance’s own measurement, Claude Computer Use “performs strongly in web-based tasks but significantly struggles with mobile scenarios — the GUI operation ability of Claude has not been well transferred to the mobile domain.” UI-TARS is the only one of the three that ships strong cross-domain coverage out of the box. For agent operators whose use cases include a phone or a tablet, that is the structural difference.

Use Cases

UI-TARS-desktop fits best in scenarios that combine one or more of: cross-domain coverage (web + mobile + desktop), open-source self-hosted deployment, no U.S.-vendor dependency, or high-volume cost-sensitive automation where API-priced incumbents become expensive faster than expected. The free remote operators in v0.2.0 also make it a low-friction first run — operators can evaluate the agent loop without standing up a local model.

Limitations

Documentation density is lower than Anthropic’s polished how-to material. The tooling ecosystem is younger — there is no UI-TARS equivalent of cc-switch or the broader Claude Code agentic stack yet. Safety reasoning is less developed than Anthropic’s filters or OpenAI Operator’s managed-environment safety layer; UI-TARS expects the operator to apply guardrails. The English-language community is smaller than the Chinese-language community — most operator commentary, recipes, and examples surface in Chinese first.

Pricing

The desktop application is free and open-source. Compute cost depends on the model backend chosen — local UI-TARS weights, a hosted UI-TARS endpoint, or another vision-capable model via the SDK.

Similar Agents