AgentConn

UI-TARS Desktop

Framework Agnostic Advanced Computer Use Open Source

UI-TARS Desktop is ByteDance's open-source multimodal agent that can see, understand, and interact with desktop applications. It uses vision models to understand the screen, plans actions, and executes them through mouse and keyboard — enabling AI agents that operate any desktop software.

Input / Output

Accepts

screenshot task-description

Produces

mouse-actions keyboard-actions task-result

Overview

UI-TARS Desktop brings AI computer use to open source. ByteDance’s multimodal agent sees your screen, understands UI elements, and interacts with any desktop application — handling multi-step tasks across applications.

How It Works

  1. Capture screen — Takes screenshots
  2. Understand — Vision model identifies UI elements
  3. Plan — Determines action sequence
  4. Execute — Mouse clicks, keyboard input, window management

Use Cases

  • Desktop automation — Automate tasks across any application
  • Software testing — AI-driven testing of desktop apps
  • Data entry — Form filling across legacy apps
  • Process automation — Multi-application workflows

Getting Started

git clone https://github.com/bytedance/UI-TARS-desktop
cd UI-TARS-desktop
npm install && npm run start

Example

Task: "Open Photoshop, create 1920x1080 canvas, add blue gradient"
Agent: Opens Photoshop → New canvas → Gradient tool → Apply
Total: ~30 seconds (vs 2 minutes manually)

Alternatives

  • CUA — Desktop sandboxes for computer-use agents
  • Anthropic Computer Use — Claude’s built-in computer use
  • Open Interpreter — Terminal-based computer control

Tags

#computer-use #desktop #multimodal #vision #gui-automation

Compatible Agents

AI agents that work well with UI-TARS Desktop.

Similar Skills