clawdcursor

mcp
Security Audit
Fail
Health Pass
  • License — License: MIT
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Community trust — 189 GitHub stars
Code Fail
  • rm -rf — Recursive force deletion command in docs/install.sh
Permissions Pass
  • Permissions — No dangerous permissions requested

No AI report is available for this listing yet.

SUMMARY

OS-agnostic, model-agnostic desktop automation server. Gives any AI agent eyes, hands, and ground-truth verification on Windows, macOS, and Linux.

README.md

Clawd Cursor

Clawd Cursor

OS-level desktop automation server. Gives any AI model eyes, hands, and ears on a real computer.
Model-agnostic · Works with Claude, GPT, Gemini, Llama, or any tool-calling model · Free with local models

GitHub stars Latest release MIT license Discord Website

Website · Discord · Quick Start · Connect · API · Changelog


What's New in v0.8.0 — V2 Architecture

A vision-first alternative to the legacy cascade, opt in with --v2:

clawdcursor start --v2
  • Ground-truth verifier — six independent signals (pixel diff, window state, focus change, OCR delta, task-type assertions, error-pattern detection). Independent of the agent, so it can't be fooled by "done" self-reports. Caught false positives in testing where the legacy pipeline reported UNVERIFIED_SUCCESS.
  • Single vision-first agent loop — screenshot → tool call → new screenshot → repeat. 6-rule system prompt (down from 36). Works with Anthropic, OpenAI, OpenRouter, or anything with vision + tool calls.
  • PlatformAdapter abstraction — platform-specific code now lives in src/v2/platform/{macos,windows,linux}.ts behind one interface. Replaces 142+ scattered if (IS_MAC) branches across 34 files. Adding a new OS is a single file.
  • Legacy pipeline untouchedclawdcursor start (no flag) behaves exactly as before. Zero breaking changes.

Full history in CHANGELOG.md.


What It Does

Clawd Cursor is a tool server. It wraps your desktop as 42 callable tools: mouse, keyboard, screen, windows, browser. Any AI that can call functions can use it.

Your AI → "Click the Send button"  →  find_element + mouse_click
Your AI → "What's on screen?"      →  desktop_screenshot + read_screen
Your AI → "Open Chrome to gmail"   →  open_app + navigate_browser

No app-specific integrations. No per-service API keys. If it's on screen, clawdcursor can interact with it.


Quick Start

Windows

powershell -c "irm https://clawdcursor.com/install.ps1 | iex"
clawdcursor start

macOS

curl -fsSL https://clawdcursor.com/install.sh | bash
clawdcursor grant     # grant Accessibility + Screen Recording permissions
clawdcursor start

Linux

curl -fsSL https://clawdcursor.com/install.sh | bash
clawdcursor start

First run auto-detects your AI provider from environment variables. Or be explicit:

clawdcursor start --provider anthropic --api-key sk-ant-...
clawdcursor start --provider gemini     # GEMINI_API_KEY in env
clawdcursor start                       # free with Ollama

See docs/MACOS-SETUP.md for macOS permission setup.


Connect

Three modes. Same 42 tools.

1. Built-in Agent (start)

Full autonomous agent. Send a task, get a result.

clawdcursor start
curl http://localhost:3847/task -d '{"task": "Open Notepad and write Hello"}'

2. Tools-Only Server (serve)

Exposes tools over REST. You bring the AI.

clawdcursor serve
curl http://localhost:3847/tools          # discover tools
curl http://localhost:3847/execute/mouse_click -d '{"x":500,"y":300}'

3. MCP Mode (mcp)

MCP stdio server for Claude Code, Cursor, Windsurf, Zed.

// ~/.claude/settings.json
{
  "mcpServers": {
    "clawdcursor": {
      "command": "node",
      "args": ["/path/to/clawdcursor/dist/index.js", "mcp"]
    }
  }
}

Tools

42 tools across 6 categories:

Category Count Examples
Perception 9 desktop_screenshot, read_screen, get_active_window, smart_read, ocr_read_screen
Mouse 6 mouse_click, mouse_double_click, mouse_drag, mouse_scroll
Keyboard 5 key_press, type_text, smart_type, shortcuts_list, shortcuts_execute
Window / App 6 focus_window, open_app, get_windows, invoke_element
Browser (CDP) 10 cdp_connect, cdp_click, cdp_type, cdp_read_text, cdp_evaluate
Orchestration 6 smart_click, navigate_browser, delegate_to_agent, wait

Pipeline

Two pipelines ship side by side. Same 42 tools, same MCP interface — only the decision-maker differs.

V2 — vision-first (--v2)

Three stages, each does one thing:

┌──────────┐     ┌────────────────┐     ┌──────────────────────┐
│  Router  │  →  │  VisionAgent   │  →  │  GroundTruthVerifier │
│          │     │                │     │                      │
│  regex   │     │  screenshot    │     │  pixel diff · window │
│  shortcut│     │  → tool call   │     │  focus · OCR delta   │
│  zero    │     │  → screenshot  │     │  task assertions     │
│  LLM     │     │  → repeat      │     │  anti-patterns       │
└──────────┘     └────────────────┘     └──────────────────────┘

Router handles trivial tasks ("open Safari") without a model. Everything else hits the VisionAgent (16 tools, 6-rule prompt, model-agnostic). The Verifier runs six independent checks against the screen after the agent claims done — so "done" has to be true, not just asserted.

Legacy — text-first cascade (default, no flag)

Cheapest-first. Kept for backwards compatibility.

L1.5   Deterministic flows  →  hardcoded sequences. Zero LLM.
L2     Skill Cache          →  learned action patterns. Zero LLM.
L2.5   OCR Reasoner         →  OS OCR + cheap text LLM. ~90% of tasks.
L2.5b  A11y Reasoner        →  fallback when OCR is unavailable.
L3     Computer Use         →  vision model. Last resort.

API

http://localhost:3847

Endpoint Method Description
/tools GET All tools in OpenAI function-calling format
/execute/:name POST Execute a tool
/task POST Submit a plain-English task
/status GET Agent state
/screenshot GET Current screen as PNG
/task-logs GET Recent task logs (JSONL)
/confirm POST Approve/reject a safety-gated action
/abort POST Stop current task
/health GET Version + health check

Safety

Tier Actions Behavior
Auto Navigation, reading, opening apps Runs immediately
Preview Typing, form filling Logged before executing
Confirm Sending messages, deleting, purchases Pauses for approval

Server binds to localhost only. Dangerous key combos blocked. Consent required on first run.


CLI

clawdcursor start        Full agent (built-in LLM pipeline)
clawdcursor serve        Tools-only REST server
clawdcursor mcp          MCP stdio server
clawdcursor doctor       Diagnose and configure
clawdcursor grant        Grant macOS permissions (interactive)
clawdcursor task <t>     Send task to running agent
clawdcursor stop         Stop server
clawdcursor dashboard    Open web dashboard

Options:
  --port <port>          Default: 3847
  --provider <name>      anthropic | openai | gemini | groq | ollama | deepseek | ...
  --model <model>        Override model
  --api-key <key>        Provider API key
  --base-url <url>       OpenAI-compatible endpoint
  --accept               Skip consent prompt (non-interactive)
  --v2                   Use v2 architecture (vision-first agent + ground truth verifier)

Platform Support

Platform-specific code lives in src/v2/platform/{macos,windows,linux}.ts behind one PlatformAdapter interface — business logic never reads process.platform.

Platform UI Automation OCR Browser
Windows x64 / ARM64 PowerShell + UI Automation Windows.Media.Ocr Chrome / Edge
macOS Intel / Apple Silicon JXA + System Events Apple Vision Chrome / Edge
Linux x64 / ARM64 AT-SPI Tesseract Chrome / Edge

Prerequisites

  • Node.js 20+
  • macOS — Xcode CLI tools: xcode-select --install, then clawdcursor grant for Accessibility + Screen Recording
  • Linuxsudo apt install tesseract-ocr
  • AI key — optional; works fully offline with Ollama

Tech Stack

TypeScript · Node.js · nut-js · Playwright · sharp · Express · MCP SDK · Zod

License

MIT — see LICENSE.


clawdcursor.com

Reviews (0)

No results found