GUI-Agent-Skills

agent
SUMMARY

🦞 Vision-based desktop automation skills for OpenClaw agents on macOS. See, learn, click β€” any app.

README.md
GUI Agent Skills

Your AI can finally see the screen β€” and use it like a human.
Visual memory β€’ One-shot UI learning β€’ Zero hardcoded selectors


πŸ‡ΊπŸ‡Έ English Β· πŸ‡¨πŸ‡³ δΈ­ζ–‡


πŸ”₯ News

  • [2026-03-30] πŸ“ ImageContext coordinate system β€” Replaced dual-space model with ImageContext class. detect_all() now returns image pixel coords (no conversion). Cropping is scale-independent. pixel_scale from backingScaleFactor (not img_size/screen_size). Fixes component crop bugs on non-fullscreen images. Tests β†’
  • [2026-03-29] 🎬 v0.3 β€” Unified Actions & Cross-Platform GUI β€” gui_action.py as single entry point for all GUI operations. Platform-specific backends (mac_local.py, http_remote.py) auto-selected via --remote. activate.py for platform detection. OSWorld Multi-Apps: 54.3% (44/81). Results β†’
  • [2026-03-24] 🧠 Smart workflow navigation β€” Target state verification with tiered fallback (template match β†’ full detection β†’ LLM). Auto performance tracking via detect_all.
  • [2026-03-23] πŸ† OSWorld benchmark (Chrome) β€” one attempt: 93.5% (43/46), up to two attempts: 97.8% (45/46). Results β†’
  • [2026-03-23] πŸ”„ Memory overhaul β€” Split storage, automatic component forgetting (15 consecutive misses β†’ removed), state merging by Jaccard similarity.
  • [2026-03-22] πŸ” Unified detection pipeline β€” detect_all() as single entry point; atomic detect β†’ match β†’ execute β†’ verify loop.
  • [2026-03-21] 🌐 Cross-platform support β€” GPA-GUI-Detector runs on any OS screenshot (Linux VMs, remote servers).
  • [2026-03-10] πŸš€ Initial release β€” GPA-GUI-Detector + Apple Vision OCR + template matching + per-app visual memory.

πŸ“– Skills Overview

GUI Agent Skills is organized as a main skill (SKILL.md) that orchestrates 7 specialized sub-skills, each handling a distinct aspect of GUI automation:

7 Skills Powering Visual GUI Automation

Skill Description
πŸ‘οΈΒ gui‑observe Screenshot capture, OCR text extraction, current state identification. The agent's eyes β€” always runs first before any action.
πŸŽ“Β gui‑learn First-contact app learning β€” detect all UI components via GPA-GUI-Detector, have the VLM label each one, filter duplicates, save to visual memory.
πŸ–±οΈΒ gui‑act Unified action execution β€” detect β†’ match β†’ execute β†’ diff β†’ save as one atomic flow. Handles clicks, typing, and all UI interactions.
πŸ’ΎΒ gui‑memory Visual memory management β€” split storage (components/states/transitions), browser site isolation, activity-based forgetting, state merging.
πŸ”„Β gui‑workflow State graph navigation and workflow automation β€” record successful task sequences, replay with tiered verification, BFS path planning.
πŸ“ŠΒ gui‑report Task performance tracking β€” automatic timing, token usage, success/failure logging for every GUI operation.
βš™οΈΒ gui‑setup First-time setup on a new machine β€” install dependencies, download models, configure accessibility permissions.

The main SKILL.md acts as the orchestration layer: it defines the safety protocol (INTENT β†’ OBSERVE β†’ VERIFY β†’ ACT β†’ CONFIRM β†’ REPORT), the vision-vs-command boundary, and routes to sub-skills as needed. The agent reads SKILL.md first, then loads sub-skills on demand.

πŸ”„ How It Works

You: "Send a message to John in WeChat saying see you tomorrow"

OBSERVE  β†’ Screenshot, identify current state
           β”œβ”€β”€ Current app: Finder (not WeChat)
           └── Action: need to switch to WeChat

STATE    β†’ Check WeChat memory
           β”œβ”€β”€ Learned before? Yes (24 components)
           β”œβ”€β”€ OCR visible text: ["Chat", "Cowork", "Code", "Search", ...]
           β”œβ”€β”€ State identified: "initial" (89% match)
           └── Components for this state: 18 β†’ use these for matching

NAVIGATE β†’ Find contact "John"
           β”œβ”€β”€ Template match search_bar β†’ found (conf=0.96) β†’ click
           β”œβ”€β”€ Paste "John" into search field (clipboard β†’ Cmd+V)
           β”œβ”€β”€ OCR search results β†’ found β†’ click
           └── New state: "click:John" (chat opened)

VERIFY   β†’ Confirm correct chat opened
           β”œβ”€β”€ OCR chat header β†’ "John" βœ…
           └── Wrong contact? β†’ ABORT

ACT      β†’ Send message
           β”œβ”€β”€ Click input field (template match)
           β”œβ”€β”€ Paste "see you tomorrow" (clipboard β†’ Cmd+V)
           └── Press Enter

CONFIRM  β†’ Verify message sent
           β”œβ”€β”€ OCR chat area β†’ "see you tomorrow" visible βœ…
           └── Done
πŸ“– More examples

"Scan my Mac for malware"

OBSERVE  β†’ Screenshot β†’ CleanMyMac X not in foreground β†’ activate
           β”œβ”€β”€ Get main window bounds (largest window, skip status bar panels)
           └── OCR window content β†’ identify current state

STATE    β†’ Check memory for CleanMyMac X
           β”œβ”€β”€ OCR visible text: ["Smart Scan", "Malware Removal", "Privacy", ...]
           β”œβ”€β”€ State identified: "initial" (92% match)
           └── Know which components to match: 21 components

NAVIGATE β†’ Click "Malware Removal" in sidebar
           β”œβ”€β”€ Find element in window (exact match, filter by window bounds)
           β”œβ”€β”€ Click β†’ new state: "click:Malware_Removal"
           └── OCR confirms new state (87% match)

ACT      β†’ Click "Scan" button
           β”œβ”€β”€ Find "Scan" (exact match, bottom position β€” prevents matching "Deep Scan")
           └── Click β†’ scan starts

POLL     β†’ Wait for completion (event-driven, no fixed sleep)
           β”œβ”€β”€ Every 2s: screenshot β†’ OCR check for "No threats"
           └── Target found β†’ proceed immediately

CONFIRM  β†’ "No threats found" βœ…

"Check if my GPU training is still running"

OBSERVE  β†’ Screenshot β†’ Chrome is open
           └── Identify target: JupyterLab tab

NAVIGATE β†’ Find JupyterLab tab in browser
           β”œβ”€β”€ OCR tab bar or use bookmarks
           └── Click to switch

EXPLORE  β†’ Multiple terminal tabs visible
           β”œβ”€β”€ Screenshot terminal area
           β”œβ”€β”€ LLM vision analysis β†’ identify which tab has nvitop
           └── Click the correct tab

READ     β†’ Screenshot terminal content
           β”œβ”€β”€ LLM reads GPU utilization table
           └── Report: "8 GPUs, 7 at 100% β€” experiment running" βœ…

"Kill GlobalProtect via Activity Monitor"

OBSERVE  β†’ Screenshot current state
           └── Neither GlobalProtect nor Activity Monitor in foreground

ACT      β†’ Launch both apps
           β”œβ”€β”€ open -a "GlobalProtect"
           └── open -a "Activity Monitor"

EXPLORE  β†’ Screenshot Activity Monitor window
           β”œβ”€β”€ LLM vision β†’ "Network tab active, search field empty at top-right"
           └── Decide: click search field first

ACT      β†’ Search for process
           β”œβ”€β”€ Click search field (identified by explore)
           β”œβ”€β”€ Paste "GlobalProtect" (clipboard β†’ Cmd+V, never cliclick type)
           └── Wait for filter results

VERIFY   β†’ Process found in list β†’ select it

ACT      β†’ Kill process
           β”œβ”€β”€ Click stop button (X) in toolbar
           └── Confirmation dialog appears

VERIFY   β†’ Click "Force Quit"

CONFIRM  β†’ Screenshot β†’ process list empty β†’ terminated βœ…

πŸ“‹ Prerequisites

GUI Agent Skills is an OpenClaw skill β€” it runs inside OpenClaw and uses OpenClaw's LLM orchestration to reason about UI actions. It is not a standalone API, CLI tool, or Python library. You need:

  1. OpenClaw installed and running
  2. macOS with Apple Silicon (recommended) β€” enables Apple Vision OCR for high-accuracy text detection. Also supports Linux (local or remote VMs via HTTP API, e.g., OSWorld).
  3. Accessibility permissions granted to OpenClaw/Terminal (macOS only)

The LLM (Claude, GPT, etc.) is provided by your OpenClaw configuration β€” GUI Agent Skills itself does not call any external APIs directly.

πŸš€ Quick Start

1. Clone & install

git clone https://github.com/Fzkuji/GUI-Agent-Skills.git
cd GUI-Agent-Skills
bash scripts/setup.sh

2. Grant accessibility permissions

System Settings β†’ Privacy & Security β†’ Accessibility β†’ Add Terminal / OpenClaw

3. Configure OpenClaw

Add to ~/.openclaw/openclaw.json:

{
  "skills": { "entries": { "gui-agent": { "enabled": true } } },
  "tools": { "exec": { "timeoutSec": 300 } }
}

⚠️ timeoutSec: 300 is important β€” GUI Agent Skills operation chains (screenshot β†’ detect β†’ click β†’ wait β†’ verify) can take a while. A 5-minute timeout is recommended. The default is too short and will kill commands mid-execution.

Then just chat with your OpenClaw agent β€” it reads SKILL.md and handles everything automatically.

πŸ—οΈ Architecture

GUI Agent Skills Architecture

GUI Agent Skills transforms GUI agents from stateless (re-perceive everything every step) to stateful (learn, remember, reuse) through three core mechanisms:

1. Unified Component Memory

Problem: Existing GUI agents treat every screenshot as a fresh perception task β€” even on interfaces they've seen hundreds of times before.

When a UI element is first detected, GUI Agent Skills creates a dual representation: a cropped visual template (for fast matching) and a VLM-assigned semantic label (for reasoning). This pair is stored in per-app memory and reused across all future interactions.

Detection and annotation:

  • GPA-GUI-Detector (YOLO-based) detects UI components β†’ bounding boxes with coordinates, but no semantic labels
  • Apple Vision OCR extracts visible text with precise bounding boxes
  • VLM (Claude, GPT, etc.) assigns semantic labels to each detected element ("Search button", "Settings icon")
  • Result: each component carries both a visual template and a semantic label

Template matching and reuse:

  • On subsequent screenshots, stored templates are matched via normalized cross-correlation
  • Matches are validated against the target application's window bounds (prevents false positives from overlapping apps)
  • Matched components carry their previously-assigned labels β€” no VLM needed

Activity-based forgetting:

  • Each component tracks consecutive_misses β€” incremented when a full detection cycle fails to re-detect it
  • After 15 consecutive misses, the component is automatically removed (cascades through states and transitions)
  • Keeps memory aligned with the app's current UI as it updates over time
memory/apps/
β”œβ”€β”€ wechat/
β”‚   β”œβ”€β”€ meta.json              # Metadata (detect_count, forget_threshold)
β”‚   β”œβ”€β”€ components.json        # Component registry + activity tracking
β”‚   β”œβ”€β”€ states.json            # States defined by component sets
β”‚   β”œβ”€β”€ transitions.json       # State transitions (dict, deduped)
β”‚   β”œβ”€β”€ components/            # Cropped UI element images
β”‚   β”‚   β”œβ”€β”€ search_bar.png
β”‚   β”‚   └── emoji_button.png
β”‚   └── workflows/             # Saved task sequences
β”œβ”€β”€ chromium/
β”‚   β”œβ”€β”€ components.json        # Browser UI components
β”‚   └── sites/                 # ⭐ Per-website memory (same structure)
β”‚       β”œβ”€β”€ united.com/
β”‚       β”œβ”€β”€ delta.com/
β”‚       └── amazon.com/

2. Component-Based State Transition Modeling

Problem: Knowing "what's on screen" isn't enough β€” the agent also needs to know "what happens when I click X."

The UI is modeled as a directed graph of states, where each state is defined by a set of visible components.

State definition and matching:

  • A state s = {c₁, cβ‚‚, ..., cβ‚™} is the set of components currently on screen
  • States are matched using Jaccard similarity: J(s, s') = |s ∩ s'| / |s βˆͺ s'|
  • Match threshold > 0.7 β†’ identifies current state
  • Merge threshold > 0.85 β†’ similar states auto-merge (prevents state explosion)

Transition recording with pending-confirm validation:

  • Each click records a transition tuple: (state_before, component_clicked, state_after)
  • Transitions are not immediately committed β€” they accumulate as pending
  • Only when a task succeeds are all pending transitions confirmed and written to the graph
  • On failure β†’ all pending transitions are discarded (prevents exploratory clicks from polluting the graph)

BFS path planning:

  • The accumulated transitions form a directed graph G = (S, E)
  • Given current state sᢜ and target state sα΅—, BFS finds the shortest action sequence
  • Enables direct navigation to any previously-visited state without re-exploration
  • No path exists? β†’ falls back to exploration mode with VLM reasoning
// states.json
{
  "state_0": {
    "defining_components": ["Chat_tab", "Cowork_tab", "Search", "Ideas"],
    "description": "Main app view"
  },
  "state_1": {
    "defining_components": ["Chat_tab", "Account", "Billing", "Usage"],
    "description": "Settings page"
  }
}

// transitions.json β€” click Settings in state_0 β†’ arrive at state_1
{
  "state_0": { "Settings": "state_1" },
  "state_1": { "Chat_tab": "state_0" }
}

3. Progressive Visual-to-Semantic Grounding

Problem: VLMs hallucinate coordinates. Every existing GUI agent asks the VLM to estimate pixel positions β€” leading to misclicks and cascading failures.

GUI Agent Skills progressively shifts from image-level to text-level grounding as memory accumulates:

Phase 1 β€” Image-level grounding (unfamiliar interfaces):

  • Detector provides bounding boxes, OCR extracts text
  • VLM receives the full screenshot to understand the scene
  • VLM decides which element to interact with
  • Components are annotated and saved to memory
  • This expensive process happens only once per component

Phase 2 β€” Text-level grounding (familiar interfaces):

  • Template matching identifies known components on screen
  • VLM receives a list of component names (e.g., [Search, Settings, Profile, Chat]) β€” not a screenshot
  • VLM selects a target by name (e.g., "click Settings")
  • System resolves the name to precise coordinates via the stored template
  • The VLM never estimates pixel positions

Why this matters:

  1. No coordinate hallucination β€” coordinates come exclusively from template matching
  2. No redundant visual processing β€” familiar interfaces are handled in pure text space
  3. Decreasing cost over time β€” as memory grows, more interactions use text-level grounding, reducing both latency (~5.3Γ— faster) and token consumption (~60-100Γ— fewer tokens per step)

Hierarchical verification during workflow execution:

Level Method Speed When
Level 0 Template match target component ~0.3s Default first check
Level 1 Full detection + state identification ~2s Level 0 fails or ambiguous
Level 2 VLM vision fallback ~5s+ Level 1 can't determine state

Detection Stack

Detector Speed Finds
GPA-GUI-Detector ~0.3s Icons, buttons, input fields
Apple Vision OCR ~1.6s Text elements (CN + EN)
Template Match ~0.3s Known components (after first learn)

πŸ”΄ Vision vs Command

GUI Agent Skills uses visual detection for decisions and the most efficient method for execution:

Must be vision-based May use keyboard/CLI
What Determining state, locating elements, verifying results Shortcuts (Ctrl+L), text input, system commands
Why The agent must SEE what's on screen before acting Execution can use the fastest available method
Rule Decision = Visual, Execution = Best Tool

Three Visual Methods

Method Returns Use for
OCR (detect_text) Text + coordinates βœ… Finding text labels, links, menu items
GPA-GUI-Detector (detect_icons) Bounding boxes + coordinates βœ… (no labels) Finding icons, buttons, non-text elements
image tool (LLM vision) Semantic understanding β›” NO coordinates Understanding the scene, deciding WHAT to click

πŸ›‘οΈ Safety & Protocol

Every action follows a unified detect-match-execute-save protocol:

Step What Why
DETECT Screenshot + OCR + GPA-GUI-Detector Know what's on screen with coordinates
MATCH Compare against saved memory components Reuse learned elements (skip re-detection)
DECIDE LLM picks target element Visual understanding drives decisions
EXECUTE Click detected coordinates / keyboard shortcut Act using best tool
DETECT AGAIN Screenshot + OCR + GPA-GUI-Detector after action See what changed
DIFF Compare before vs after (appeared/disappeared/persisted) Understand state transition
SAVE Update memory: components, labels, transitions, pages Learn for future reuse

Safety rules enforced in code:

  • βœ… Verify chat recipient before sending messages (OCR header)
  • βœ… Window-bounded operations (no clicking outside target app)
  • βœ… Exact text matching (prevents "Scan" matching "Deep Scan")
  • βœ… Largest-window detection (skips status bar panels for multi-window apps)
  • βœ… No blind clicks after timeout β€” screenshot + inspect instead
  • βœ… Mandatory timing & token delta reporting after every task

πŸ—‚οΈ Project Structure

GUI-Agent-Skills/
β”œβ”€β”€ SKILL.md                   # 🧠 Main skill β€” orchestration layer
β”‚                              #    Safety protocol, vision-vs-command boundary,
β”‚                              #    routes to sub-skills as needed
β”œβ”€β”€ skills/                    # πŸ“– Sub-skills (7 specialized modules)
β”‚   β”œβ”€β”€ gui-observe/SKILL.md   #   πŸ‘οΈ Screenshot, OCR, identify state
β”‚   β”œβ”€β”€ gui-learn/SKILL.md     #   πŸŽ“ Detect components, label, filter, save
β”‚   β”œβ”€β”€ gui-act/SKILL.md       #   πŸ–±οΈ Unified: detectβ†’matchβ†’executeβ†’diffβ†’save
β”‚   β”œβ”€β”€ gui-memory/SKILL.md    #   πŸ’Ύ Memory structure, browser sites/, cleanup
β”‚   β”œβ”€β”€ gui-workflow/SKILL.md  #   πŸ”„ State graph navigation, workflow replay
β”‚   β”œβ”€β”€ gui-report/SKILL.md    #   πŸ“Š Task performance tracking
β”‚   └── gui-setup/SKILL.md     #   βš™οΈ First-time setup on a new machine
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ setup.sh               # πŸ”§ One-command setup
β”‚   β”œβ”€β”€ activate.py            # 🌐 Platform detection β€” detects OS, prints platform info
β”‚   β”œβ”€β”€ gui_action.py          # 🎯 Unified GUI action interface (click/type/key/screenshot)
β”‚   β”‚                          #    Auto-selects backend: mac_local or http_remote via --remote
β”‚   β”œβ”€β”€ backends/              # πŸ”Œ Platform-specific backends
β”‚   β”‚   β”œβ”€β”€ mac_local.py       #     macOS: cliclick + AppleScript
β”‚   β”‚   └── http_remote.py     #     Remote VMs: pyautogui via HTTP API (e.g., OSWorld)
β”‚   β”œβ”€β”€ ui_detector.py         # πŸ” Detection engine (GPA-GUI-Detector + OCR + Swift window info)
β”‚   β”œβ”€β”€ app_memory.py          # 🧠 Visual memory (learn/detect/click/verify/learn_site)
β”‚   └── template_match.py      # 🎯 Template matching utilities
β”œβ”€β”€ memory/                    # πŸ”’ Visual memory (gitignored but ESSENTIAL)
β”‚   β”œβ”€β”€ apps/<appname>/        #   Per-app memory:
β”‚   β”‚   β”œβ”€β”€ meta.json          #     Metadata (detect_count, forget_threshold)
β”‚   β”‚   β”œβ”€β”€ components.json    #     Component registry + activity tracking
β”‚   β”‚   β”œβ”€β”€ states.json        #     States defined by component sets
β”‚   β”‚   β”œβ”€β”€ transitions.json   #     State transitions (dict, deduped)
β”‚   β”‚   β”œβ”€β”€ components/        #     Template images
β”‚   β”‚   β”œβ”€β”€ pages/             #     Page screenshots
β”‚   β”‚   └── sites/<domain>/    #   Per-website memory (browsers only, same structure)
β”œβ”€β”€ platforms/                  # 🌐 Platform-specific guides & detection
β”‚   β”œβ”€β”€ detect.py              #     Platform auto-detection script
β”‚   β”œβ”€β”€ macos.md               #     macOS-specific tips & workarounds
β”‚   β”œβ”€β”€ linux.md               #     Linux-specific tips & workarounds
β”‚   └── DESIGN.md              #     Cross-platform architecture design
β”œβ”€β”€ benchmarks/osworld/        # πŸ“ˆ OSWorld benchmark results
β”œβ”€β”€ assets/                    # 🎨 Architecture diagrams, banners
β”œβ”€β”€ actions/
β”‚   β”œβ”€β”€ _actions_macos.yaml    # πŸ“‹ macOS-specific action definitions
β”‚   └── _actions_linux.yaml    # πŸ“‹ Linux-specific action definitions
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ core.md                # πŸ“š Lessons learned & hard-won rules
β”‚   └── README_CN.md           # πŸ‡¨πŸ‡³ δΈ­ζ–‡ζ–‡ζ‘£
β”œβ”€β”€ LICENSE                    # πŸ“„ MIT
└── requirements.txt

πŸ“¦ Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4) β€” for local GUI automation
  • Linux (Ubuntu 22.04+) β€” for remote VM automation via HTTP API
  • Accessibility permissions (macOS only): System Settings β†’ Privacy β†’ Accessibility
  • Everything else installed by bash scripts/setup.sh

🀝 Ecosystem

🦞 OpenClaw AI assistant framework β€” loads GUI Agent Skills as a skill
πŸ” GPA-GUI-Detector Salesforce/GPA-GUI-Detector β€” general-purpose UI element detection model
πŸ’¬ Discord Community Get help, share feedback

πŸ“„ License

MIT β€” see LICENSE for details.


πŸ“Œ Citation

If you find GUI Agent Skills useful in your research, please cite:

@misc{fu2026gui-agent-skills,
  author       = {Fu, Zichuan},
  title        = {GUI Agent Skills: Visual Memory-Driven GUI Automation for macOS},
  year         = {2026},
  publisher    = {GitHub},
  url          = {https://github.com/Fzkuji/GUI-Agent-Skills},
}

⭐ Star History

Star History Chart

Built with 🦞 by the GUI Agent Skills team · Powered by OpenClaw

Yorumlar (0)

Sonuc bulunamadi