GUI-Agent-Skills
π¦ Vision-based desktop automation skills for OpenClaw agents on macOS. See, learn, click β any app.
Your AI can finally see the screen β and use it like a human.
Visual memory β’ One-shot UI learning β’ Zero hardcoded selectors
πΊπΈ English Β· π¨π³ δΈζ
π₯ News
- [2026-03-30] π ImageContext coordinate system β Replaced dual-space model with
ImageContextclass.detect_all()now returns image pixel coords (no conversion). Cropping is scale-independent.pixel_scalefrombackingScaleFactor(notimg_size/screen_size). Fixes component crop bugs on non-fullscreen images. Tests β - [2026-03-29] π¬ v0.3 β Unified Actions & Cross-Platform GUI β
gui_action.pyas single entry point for all GUI operations. Platform-specific backends (mac_local.py,http_remote.py) auto-selected via--remote.activate.pyfor platform detection. OSWorld Multi-Apps: 54.3% (44/81). Results β - [2026-03-24] π§ Smart workflow navigation β Target state verification with tiered fallback (template match β full detection β LLM). Auto performance tracking via
detect_all. - [2026-03-23] π OSWorld benchmark (Chrome) β one attempt: 93.5% (43/46), up to two attempts: 97.8% (45/46). Results β
- [2026-03-23] π Memory overhaul β Split storage, automatic component forgetting (15 consecutive misses β removed), state merging by Jaccard similarity.
- [2026-03-22] π Unified detection pipeline β
detect_all()as single entry point; atomic detect β match β execute β verify loop. - [2026-03-21] π Cross-platform support β GPA-GUI-Detector runs on any OS screenshot (Linux VMs, remote servers).
- [2026-03-10] π Initial release β GPA-GUI-Detector + Apple Vision OCR + template matching + per-app visual memory.
π Skills Overview
GUI Agent Skills is organized as a main skill (SKILL.md) that orchestrates 7 specialized sub-skills, each handling a distinct aspect of GUI automation:
7 Skills Powering Visual GUI Automation
| Skill | Description |
|---|---|
| ποΈΒ guiβobserve | Screenshot capture, OCR text extraction, current state identification. The agent's eyes β always runs first before any action. |
| πΒ guiβlearn | First-contact app learning β detect all UI components via GPA-GUI-Detector, have the VLM label each one, filter duplicates, save to visual memory. |
| π±οΈΒ guiβact | Unified action execution β detect β match β execute β diff β save as one atomic flow. Handles clicks, typing, and all UI interactions. |
| πΎΒ guiβmemory | Visual memory management β split storage (components/states/transitions), browser site isolation, activity-based forgetting, state merging. |
| πΒ guiβworkflow | State graph navigation and workflow automation β record successful task sequences, replay with tiered verification, BFS path planning. |
| πΒ guiβreport | Task performance tracking β automatic timing, token usage, success/failure logging for every GUI operation. |
| βοΈΒ guiβsetup | First-time setup on a new machine β install dependencies, download models, configure accessibility permissions. |
The main SKILL.md acts as the orchestration layer: it defines the safety protocol (INTENT β OBSERVE β VERIFY β ACT β CONFIRM β REPORT), the vision-vs-command boundary, and routes to sub-skills as needed. The agent reads SKILL.md first, then loads sub-skills on demand.
π How It Works
You: "Send a message to John in WeChat saying see you tomorrow"
OBSERVE β Screenshot, identify current state
βββ Current app: Finder (not WeChat)
βββ Action: need to switch to WeChat
STATE β Check WeChat memory
βββ Learned before? Yes (24 components)
βββ OCR visible text: ["Chat", "Cowork", "Code", "Search", ...]
βββ State identified: "initial" (89% match)
βββ Components for this state: 18 β use these for matching
NAVIGATE β Find contact "John"
βββ Template match search_bar β found (conf=0.96) β click
βββ Paste "John" into search field (clipboard β Cmd+V)
βββ OCR search results β found β click
βββ New state: "click:John" (chat opened)
VERIFY β Confirm correct chat opened
βββ OCR chat header β "John" β
βββ Wrong contact? β ABORT
ACT β Send message
βββ Click input field (template match)
βββ Paste "see you tomorrow" (clipboard β Cmd+V)
βββ Press Enter
CONFIRM β Verify message sent
βββ OCR chat area β "see you tomorrow" visible β
βββ Done
π More examples
"Scan my Mac for malware"
OBSERVE β Screenshot β CleanMyMac X not in foreground β activate
βββ Get main window bounds (largest window, skip status bar panels)
βββ OCR window content β identify current state
STATE β Check memory for CleanMyMac X
βββ OCR visible text: ["Smart Scan", "Malware Removal", "Privacy", ...]
βββ State identified: "initial" (92% match)
βββ Know which components to match: 21 components
NAVIGATE β Click "Malware Removal" in sidebar
βββ Find element in window (exact match, filter by window bounds)
βββ Click β new state: "click:Malware_Removal"
βββ OCR confirms new state (87% match)
ACT β Click "Scan" button
βββ Find "Scan" (exact match, bottom position β prevents matching "Deep Scan")
βββ Click β scan starts
POLL β Wait for completion (event-driven, no fixed sleep)
βββ Every 2s: screenshot β OCR check for "No threats"
βββ Target found β proceed immediately
CONFIRM β "No threats found" β
"Check if my GPU training is still running"
OBSERVE β Screenshot β Chrome is open
βββ Identify target: JupyterLab tab
NAVIGATE β Find JupyterLab tab in browser
βββ OCR tab bar or use bookmarks
βββ Click to switch
EXPLORE β Multiple terminal tabs visible
βββ Screenshot terminal area
βββ LLM vision analysis β identify which tab has nvitop
βββ Click the correct tab
READ β Screenshot terminal content
βββ LLM reads GPU utilization table
βββ Report: "8 GPUs, 7 at 100% β experiment running" β
"Kill GlobalProtect via Activity Monitor"
OBSERVE β Screenshot current state
βββ Neither GlobalProtect nor Activity Monitor in foreground
ACT β Launch both apps
βββ open -a "GlobalProtect"
βββ open -a "Activity Monitor"
EXPLORE β Screenshot Activity Monitor window
βββ LLM vision β "Network tab active, search field empty at top-right"
βββ Decide: click search field first
ACT β Search for process
βββ Click search field (identified by explore)
βββ Paste "GlobalProtect" (clipboard β Cmd+V, never cliclick type)
βββ Wait for filter results
VERIFY β Process found in list β select it
ACT β Kill process
βββ Click stop button (X) in toolbar
βββ Confirmation dialog appears
VERIFY β Click "Force Quit"
CONFIRM β Screenshot β process list empty β terminated β
π Prerequisites
GUI Agent Skills is an OpenClaw skill β it runs inside OpenClaw and uses OpenClaw's LLM orchestration to reason about UI actions. It is not a standalone API, CLI tool, or Python library. You need:
- OpenClaw installed and running
- macOS with Apple Silicon (recommended) β enables Apple Vision OCR for high-accuracy text detection. Also supports Linux (local or remote VMs via HTTP API, e.g., OSWorld).
- Accessibility permissions granted to OpenClaw/Terminal (macOS only)
The LLM (Claude, GPT, etc.) is provided by your OpenClaw configuration β GUI Agent Skills itself does not call any external APIs directly.
π Quick Start
1. Clone & install
git clone https://github.com/Fzkuji/GUI-Agent-Skills.git
cd GUI-Agent-Skills
bash scripts/setup.sh
2. Grant accessibility permissions
System Settings β Privacy & Security β Accessibility β Add Terminal / OpenClaw
3. Configure OpenClaw
Add to ~/.openclaw/openclaw.json:
{
"skills": { "entries": { "gui-agent": { "enabled": true } } },
"tools": { "exec": { "timeoutSec": 300 } }
}
β οΈ
timeoutSec: 300is important β GUI Agent Skills operation chains (screenshot β detect β click β wait β verify) can take a while. A 5-minute timeout is recommended. The default is too short and will kill commands mid-execution.
Then just chat with your OpenClaw agent β it reads SKILL.md and handles everything automatically.
ποΈ Architecture
GUI Agent Skills transforms GUI agents from stateless (re-perceive everything every step) to stateful (learn, remember, reuse) through three core mechanisms:
1. Unified Component Memory
Problem: Existing GUI agents treat every screenshot as a fresh perception task β even on interfaces they've seen hundreds of times before.
When a UI element is first detected, GUI Agent Skills creates a dual representation: a cropped visual template (for fast matching) and a VLM-assigned semantic label (for reasoning). This pair is stored in per-app memory and reused across all future interactions.
Detection and annotation:
- GPA-GUI-Detector (YOLO-based) detects UI components β bounding boxes with coordinates, but no semantic labels
- Apple Vision OCR extracts visible text with precise bounding boxes
- VLM (Claude, GPT, etc.) assigns semantic labels to each detected element ("Search button", "Settings icon")
- Result: each component carries both a visual template and a semantic label
Template matching and reuse:
- On subsequent screenshots, stored templates are matched via normalized cross-correlation
- Matches are validated against the target application's window bounds (prevents false positives from overlapping apps)
- Matched components carry their previously-assigned labels β no VLM needed
Activity-based forgetting:
- Each component tracks
consecutive_missesβ incremented when a full detection cycle fails to re-detect it - After 15 consecutive misses, the component is automatically removed (cascades through states and transitions)
- Keeps memory aligned with the app's current UI as it updates over time
memory/apps/
βββ wechat/
β βββ meta.json # Metadata (detect_count, forget_threshold)
β βββ components.json # Component registry + activity tracking
β βββ states.json # States defined by component sets
β βββ transitions.json # State transitions (dict, deduped)
β βββ components/ # Cropped UI element images
β β βββ search_bar.png
β β βββ emoji_button.png
β βββ workflows/ # Saved task sequences
βββ chromium/
β βββ components.json # Browser UI components
β βββ sites/ # β Per-website memory (same structure)
β βββ united.com/
β βββ delta.com/
β βββ amazon.com/
2. Component-Based State Transition Modeling
Problem: Knowing "what's on screen" isn't enough β the agent also needs to know "what happens when I click X."
The UI is modeled as a directed graph of states, where each state is defined by a set of visible components.
State definition and matching:
- A state
s = {cβ, cβ, ..., cβ}is the set of components currently on screen - States are matched using Jaccard similarity:
J(s, s') = |s β© s'| / |s βͺ s'| - Match threshold > 0.7 β identifies current state
- Merge threshold > 0.85 β similar states auto-merge (prevents state explosion)
Transition recording with pending-confirm validation:
- Each click records a transition tuple:
(state_before, component_clicked, state_after) - Transitions are not immediately committed β they accumulate as pending
- Only when a task succeeds are all pending transitions confirmed and written to the graph
- On failure β all pending transitions are discarded (prevents exploratory clicks from polluting the graph)
BFS path planning:
- The accumulated transitions form a directed graph
G = (S, E) - Given current state
sαΆand target statesα΅, BFS finds the shortest action sequence - Enables direct navigation to any previously-visited state without re-exploration
- No path exists? β falls back to exploration mode with VLM reasoning
// states.json
{
"state_0": {
"defining_components": ["Chat_tab", "Cowork_tab", "Search", "Ideas"],
"description": "Main app view"
},
"state_1": {
"defining_components": ["Chat_tab", "Account", "Billing", "Usage"],
"description": "Settings page"
}
}
// transitions.json β click Settings in state_0 β arrive at state_1
{
"state_0": { "Settings": "state_1" },
"state_1": { "Chat_tab": "state_0" }
}
3. Progressive Visual-to-Semantic Grounding
Problem: VLMs hallucinate coordinates. Every existing GUI agent asks the VLM to estimate pixel positions β leading to misclicks and cascading failures.
GUI Agent Skills progressively shifts from image-level to text-level grounding as memory accumulates:
Phase 1 β Image-level grounding (unfamiliar interfaces):
- Detector provides bounding boxes, OCR extracts text
- VLM receives the full screenshot to understand the scene
- VLM decides which element to interact with
- Components are annotated and saved to memory
- This expensive process happens only once per component
Phase 2 β Text-level grounding (familiar interfaces):
- Template matching identifies known components on screen
- VLM receives a list of component names (e.g.,
[Search, Settings, Profile, Chat]) β not a screenshot - VLM selects a target by name (e.g., "click Settings")
- System resolves the name to precise coordinates via the stored template
- The VLM never estimates pixel positions
Why this matters:
- No coordinate hallucination β coordinates come exclusively from template matching
- No redundant visual processing β familiar interfaces are handled in pure text space
- Decreasing cost over time β as memory grows, more interactions use text-level grounding, reducing both latency (~5.3Γ faster) and token consumption (~60-100Γ fewer tokens per step)
Hierarchical verification during workflow execution:
| Level | Method | Speed | When |
|---|---|---|---|
| Level 0 | Template match target component | ~0.3s | Default first check |
| Level 1 | Full detection + state identification | ~2s | Level 0 fails or ambiguous |
| Level 2 | VLM vision fallback | ~5s+ | Level 1 can't determine state |
Detection Stack
| Detector | Speed | Finds |
|---|---|---|
| GPA-GUI-Detector | ~0.3s | Icons, buttons, input fields |
| Apple Vision OCR | ~1.6s | Text elements (CN + EN) |
| Template Match | ~0.3s | Known components (after first learn) |
π΄ Vision vs Command
GUI Agent Skills uses visual detection for decisions and the most efficient method for execution:
| Must be vision-based | May use keyboard/CLI | |
|---|---|---|
| What | Determining state, locating elements, verifying results | Shortcuts (Ctrl+L), text input, system commands |
| Why | The agent must SEE what's on screen before acting | Execution can use the fastest available method |
| Rule | Decision = Visual, Execution = Best Tool |
Three Visual Methods
| Method | Returns | Use for |
|---|---|---|
OCR (detect_text) |
Text + coordinates β | Finding text labels, links, menu items |
GPA-GUI-Detector (detect_icons) |
Bounding boxes + coordinates β (no labels) | Finding icons, buttons, non-text elements |
| image tool (LLM vision) | Semantic understanding β NO coordinates | Understanding the scene, deciding WHAT to click |
π‘οΈ Safety & Protocol
Every action follows a unified detect-match-execute-save protocol:
| Step | What | Why |
|---|---|---|
| DETECT | Screenshot + OCR + GPA-GUI-Detector | Know what's on screen with coordinates |
| MATCH | Compare against saved memory components | Reuse learned elements (skip re-detection) |
| DECIDE | LLM picks target element | Visual understanding drives decisions |
| EXECUTE | Click detected coordinates / keyboard shortcut | Act using best tool |
| DETECT AGAIN | Screenshot + OCR + GPA-GUI-Detector after action | See what changed |
| DIFF | Compare before vs after (appeared/disappeared/persisted) | Understand state transition |
| SAVE | Update memory: components, labels, transitions, pages | Learn for future reuse |
Safety rules enforced in code:
- β Verify chat recipient before sending messages (OCR header)
- β Window-bounded operations (no clicking outside target app)
- β Exact text matching (prevents "Scan" matching "Deep Scan")
- β Largest-window detection (skips status bar panels for multi-window apps)
- β No blind clicks after timeout β screenshot + inspect instead
- β Mandatory timing & token delta reporting after every task
ποΈ Project Structure
GUI-Agent-Skills/
βββ SKILL.md # π§ Main skill β orchestration layer
β # Safety protocol, vision-vs-command boundary,
β # routes to sub-skills as needed
βββ skills/ # π Sub-skills (7 specialized modules)
β βββ gui-observe/SKILL.md # ποΈ Screenshot, OCR, identify state
β βββ gui-learn/SKILL.md # π Detect components, label, filter, save
β βββ gui-act/SKILL.md # π±οΈ Unified: detectβmatchβexecuteβdiffβsave
β βββ gui-memory/SKILL.md # πΎ Memory structure, browser sites/, cleanup
β βββ gui-workflow/SKILL.md # π State graph navigation, workflow replay
β βββ gui-report/SKILL.md # π Task performance tracking
β βββ gui-setup/SKILL.md # βοΈ First-time setup on a new machine
βββ scripts/
β βββ setup.sh # π§ One-command setup
β βββ activate.py # π Platform detection β detects OS, prints platform info
β βββ gui_action.py # π― Unified GUI action interface (click/type/key/screenshot)
β β # Auto-selects backend: mac_local or http_remote via --remote
β βββ backends/ # π Platform-specific backends
β β βββ mac_local.py # macOS: cliclick + AppleScript
β β βββ http_remote.py # Remote VMs: pyautogui via HTTP API (e.g., OSWorld)
β βββ ui_detector.py # π Detection engine (GPA-GUI-Detector + OCR + Swift window info)
β βββ app_memory.py # π§ Visual memory (learn/detect/click/verify/learn_site)
β βββ template_match.py # π― Template matching utilities
βββ memory/ # π Visual memory (gitignored but ESSENTIAL)
β βββ apps/<appname>/ # Per-app memory:
β β βββ meta.json # Metadata (detect_count, forget_threshold)
β β βββ components.json # Component registry + activity tracking
β β βββ states.json # States defined by component sets
β β βββ transitions.json # State transitions (dict, deduped)
β β βββ components/ # Template images
β β βββ pages/ # Page screenshots
β β βββ sites/<domain>/ # Per-website memory (browsers only, same structure)
βββ platforms/ # π Platform-specific guides & detection
β βββ detect.py # Platform auto-detection script
β βββ macos.md # macOS-specific tips & workarounds
β βββ linux.md # Linux-specific tips & workarounds
β βββ DESIGN.md # Cross-platform architecture design
βββ benchmarks/osworld/ # π OSWorld benchmark results
βββ assets/ # π¨ Architecture diagrams, banners
βββ actions/
β βββ _actions_macos.yaml # π macOS-specific action definitions
β βββ _actions_linux.yaml # π Linux-specific action definitions
βββ docs/
β βββ core.md # π Lessons learned & hard-won rules
β βββ README_CN.md # π¨π³ δΈζζζ‘£
βββ LICENSE # π MIT
βββ requirements.txt
π¦ Requirements
- macOS with Apple Silicon (M1/M2/M3/M4) β for local GUI automation
- Linux (Ubuntu 22.04+) β for remote VM automation via HTTP API
- Accessibility permissions (macOS only): System Settings β Privacy β Accessibility
- Everything else installed by
bash scripts/setup.sh
π€ Ecosystem
| π¦ OpenClaw | AI assistant framework β loads GUI Agent Skills as a skill |
| π GPA-GUI-Detector | Salesforce/GPA-GUI-Detector β general-purpose UI element detection model |
| π¬ Discord Community | Get help, share feedback |
π License
MIT β see LICENSE for details.
π Citation
If you find GUI Agent Skills useful in your research, please cite:
@misc{fu2026gui-agent-skills,
author = {Fu, Zichuan},
title = {GUI Agent Skills: Visual Memory-Driven GUI Automation for macOS},
year = {2026},
publisher = {GitHub},
url = {https://github.com/Fzkuji/GUI-Agent-Skills},
}
β Star History
Built with π¦ by the GUI Agent Skills team Β· Powered by OpenClaw
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi