openclaw-self-healing

agent
Guvenlik Denetimi
Basarisiz
Health Gecti
  • License — License: MIT
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Community trust — 35 GitHub stars
Code Basarisiz
  • rm -rf — Recursive force deletion command in scripts/emergency-recovery.sh
Permissions Gecti
  • Permissions — No dangerous permissions requested

Bu listing icin henuz AI raporu yok.

SUMMARY

AI-powered self-healing system for OpenClaw Gateway • 4-tier autonomous recovery • macOS & Linux

README.md

🦞 OpenClaw Self-Healing System

Autonomous AI-Powered Recovery for Any Service

Stop getting paged at 3 AM. Let AI fix your crashes automatically.

Version
Platform
License: MIT
GitHub Stars
Recovery Rate
LLM-Agnostic
Prometheus
Lint

🚀 Quick Start · 🎬 Demo · 🏗️ Architecture · 📖 Docs

한국어

openclaw-self-healing

If this saved your night 🌙, a ⭐ helps others find it.


🎬 Demo

Self-Healing Demo

4-tier recovery in action: KeepAlive → Watchdog → AI Doctor → Alert


🔥 Why This Exists

This system wraps any long-running service with autonomous crash recovery. OpenClaw Gateway is the primary example — a self-hosted AI gateway for Claude/GPT-4 routing — but the watchdog/recovery architecture adapts to any service.

Your service crashes at midnight. A basic watchdog restarts it — but what if the config is corrupted? The API rate limit hit? A dependency broken?

Simple restart = crash loop. You get paged. Your weekend is ruined.

This system doesn't just restart — it understands and fixes root causes.


🆚 Why openclaw-self-healing vs. rolling your own

Feature Basic Watchdog supervisord openclaw-self-healing
Auto-restart on crash
HTTP health polling
Crash loop prevention (backoff) partial ✅ exponential backoff
Config validation before start ✅ Level 0 preflight
AI root-cause diagnosis ✅ Claude / GPT-4 / Gemini / Ollama
Auto-fix corrupted config
Multi-channel alerts (Discord/Slack/Telegram)
Prometheus metrics
Works on macOS + Linux + Docker partial
Zero vendor lock-in ✅ MIT

The gap that matters: when a crash loop is caused by something that can't be fixed by restarting alone, every other tool pages you. This one tries to fix it first.


⚡ Try It First (Dry Run)

Not ready to commit? Preview exactly what the installer will do — no changes made:

curl -fsSL https://raw.githubusercontent.com/Ramsbaby/openclaw-self-healing/main/install.sh | bash -s -- --dry-run

Sample output:

╔═══════════════════════════════════════════════════════════╗
║  🔍 DRY RUN — nothing will be installed or modified       ║
╚═══════════════════════════════════════════════════════════╝

[Pre-flight] All prerequisites present ✅

[Step 1] Directories that would be created: ~/.openclaw/...
[Step 2] Scripts that would be downloaded: gateway-watchdog.sh, ...
[Step 3] Environment file: ~/.openclaw/.env
[Step 4] LaunchAgents: ai.openclaw.watchdog, com.openclaw.healthcheck

  ✓ Level 0: Pre-flight validation              READY
  ✓ Level 1: KeepAlive (instant restart)         READY
  ✓ Level 2: Watchdog + HealthCheck (3-5 min)    READY
  ✓ Level 3: AI Emergency Recovery (auto-trigger) READY
  ✓ Level 4: Discord/Telegram Human Alert         READY

  Run without --dry-run to install.

🚀 Quick Start

Prerequisites

  • macOS 12+ or Linux (Ubuntu 20.04+ / systemd) or Docker
  • OpenClaw Gateway installed and running
  • Any major LLM — Claude CLI (default), OpenAI, Gemini, or Ollama. See LLM-Agnostic
  • tmux, jq (brew install tmux jq or apt install tmux jq)

Note: While this was built for OpenClaw Gateway, the watchdog/recovery architecture works for any service. See docs/configuration.md to adapt it.

Option 1: One-line Install (macOS / Linux)

curl -fsSL https://raw.githubusercontent.com/ramsbaby/openclaw-self-healing/main/install.sh | bash

The installer walks you through everything:

╔═══════════════════════════════════════════════╗
║  🦞 OpenClaw Self-Healing System Installer    ║
╚═══════════════════════════════════════════════╝

[1/6] Checking prerequisites...          ✅
[2/6] Creating directories...            ✅
[3/6] Installing scripts...              ✅
[4/6] Configuring environment..
      Discord webhook URL (optional): https://discord.com/api/webhooks/...
      Gateway port [18789]: 
      Gateway token (auto-detected): ✅
[5/6] Installing Watchdog LaunchAgent... ✅
[6/6] Verifying installation...
      Health check: HTTP 200 ✅
      Chain: Watchdog → HealthCheck → Emergency Recovery ✅

🎉 Self-Healing System Active!

Option 2: Docker Compose

git clone https://github.com/Ramsbaby/openclaw-self-healing.git
cd openclaw-self-healing
cp .env.example .env   # edit with your config
docker compose up -d

See docs/DOCKER.md for full configuration guide.

Verify It Works

# Kill your Gateway to test auto-recovery
kill -9 $(pgrep -f openclaw-gateway)

# Wait ~30 seconds, then check
curl http://localhost:18789/
# Expected: HTTP 200 ✅

🎬 How It Works

5-Tier Autonomous Recovery

graph TD
    A[🚀 LaunchAgent Starts Gateway] --> B[Level 0: Preflight]
    B -->|"Config valid"| C[exec gateway — launchd tracks PID]
    B -->|"Config invalid"| D[AI Recovery Session + backoff → retry]
    C --> E{Stable?}
    E -->|Repeated crashes| F[Level 1: KeepAlive]
    F -->|"Instant restart (0-30s)"| G{Stable?}
    G -->|Yes| Z[✅ Online]
    G -->|Repeated crashes| H[Level 2: Watchdog]
    H -->|"HTTP check every 3min"| I{Stable?}
    I -->|Yes| Z
    I -->|"30min continuous failure"| J[Level 3: AI Recovery]
    J -->|"Autonomous diagnosis & fix"| K{Fixed?}
    K -->|Yes| Z
    K -->|No| L[Level 4: Human Alert]
    L -->|"Discord / Slack / Telegram"| M[👤 Manual Fix]

    style A fill:#74c0fc
    style B fill:#74c0fc
    style Z fill:#51cf66
    style J fill:#4dabf7

Each Level Explained

Level What When How
0 Preflight Validation Every cold start Validate binary, .env keys, JSON configs before exec
1 LaunchAgent KeepAlive Any crash Instant restart (0–30s)
2 Watchdog v4.1 + HealthCheck Repeated crashes PID + HTTP + memory monitoring, exponential backoff
3 AI Emergency Recovery 30min continuous failure PTY session → log analysis → auto-fix (Claude/GPT-4/Gemini/Ollama)
4 Human Alert All automation fails Discord/Slack/Telegram with full context

Level 0 (new in v3.2): Catches config corruption, missing .env keys, and broken JSON before the gateway even starts — preventing crash loops from bad config entirely.


📊 Real Production Numbers

Based on an audit of 14 real incidents (Feb 2026):

Scenario Result
17 consecutive crashes ✅ Full recovery via Level 1
Config corruption ✅ Auto-fixed in ~3 min
All services killed (nuclear) ✅ Recovered in ~3 min
38+ crash loop ⛔ Stopped by design (prevents infinite loops)

9 of 14 incidents resolved fully autonomously. The remaining 5 escalated correctly to Level 4 — the system worked as designed.


🏗️ Architecture

4-tier recovery architecture

Level 0: Preflight 🔍 (every cold start)
│  Validates binary, .env keys, JSON configs before exec
│  On failure: AI recovery session (tmux) + exponential backoff
│  scripts/gateway-preflight.sh
│
▼  passes
Level 1: KeepAlive ⚡ (0-30s)
│  Instant restart on any crash
│  Built into ai.openclaw.gateway.plist
│
▼  repeated failures
Level 2: Watchdog v4.1 🔍 (3-5 min)
│  HTTP + PID + memory monitoring every 3 min
│  Exponential backoff: 10s → 30s → 90s → 180s → 600s
│  Crash counter auto-decay after 6 hours
│
▼  30 minutes of continuous failure
Level 3: AI Emergency Recovery 🧠 (5-30 min)
│  Auto-triggered — no manual intervention
│  Supports Claude, GPT-4, Gemini, Ollama (OPENCLAW_LLM_PROVIDER)
│  PTY session: reads logs → diagnoses → fixes
│  Documents learnings for future incidents
│
▼  all automation fails
Level 4: Human Alert 🚨
   Discord/Slack/Telegram notification with full context
   Log paths + recovery report attached

📢 Multi-Channel Notifications

Unified notification library supporting Discord, Slack, Telegram, and extensible adapters — one config, all channels.

# Auto-detects channel from available env vars
# Set one (or more) of:
DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/..."
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/..."
TELEGRAM_BOT_TOKEN="..."  # + TELEGRAM_CHAT_ID

# Or explicitly force a channel
NOTIFICATION_CHANNEL="slack"

Implementation: scripts/lib/notify.sh

Scripts Reference

Script Level Purpose
scripts/gateway-preflight.sh 0 Proactive config validation before service start
scripts/gateway-watchdog.sh 2 Reactive recovery after crash detection
scripts/gateway-healthcheck.sh 2 HTTP health polling + Level 3 escalation
scripts/emergency-recovery-v2.sh 3 AI autonomous diagnosis and repair
scripts/emergency-recovery-monitor.sh 3 Monitor active recovery sessions
scripts/incident-digest.sh ops Weekly incident report with autonomy rate
scripts/lib/llm-gateway.sh 3 LLM-agnostic wrapper (Claude/GPT-4/Gemini/Ollama)
scripts/lib/notify.sh all Unified notification dispatcher (Discord/Slack/Telegram)
scripts/prometheus-exporter.py obs Prometheus metrics HTTP server
scripts/start-metrics-exporter.sh obs Start/stop/status for the metrics exporter

✅ What v3.4 Added

Before v3.4 After v3.4
Docker users had to adapt manually Docker Compose support with gateway + watchdog services
Discord-only in some scripts Unified notify.sh library (Discord/Slack/Telegram)

Previous: What v3.2 Added

Before v3.2 After v3.2
Config corruption caused crash loops at start Preflight catches it before exec
ANTHROPIC_API_KEY silently missing in tmux sessions spawned from launchd Key forwarded via tmux -e flag
No proactive validation layer Level 0: gateway-preflight.sh

Previous: What v3.1 Fixed

Before v3.1 After v3.1
Manual LaunchAgent/systemd setup install.sh does everything
.env had to be created by hand Interactive wizard generates it
Level 2 → Level 3 was disconnected Auto-triggers after 30 min
macOS only macOS + Linux (systemd)
Install often failed mid-way Verified end-to-end

🤖 LLM-Agnostic Recovery (New in v3.3)

Level 3 Emergency Recovery is no longer Claude-only. Set OPENCLAW_LLM_PROVIDER in your .env:

Provider Config Default model Requires
Claude (default) OPENCLAW_LLM_PROVIDER=claude Claude Code CLI Claude Max subscription
OpenAI OPENCLAW_LLM_PROVIDER=openai gpt-4o OPENAI_API_KEY + pip install openai
Google Gemini OPENCLAW_LLM_PROVIDER=gemini gemini-2.0-flash GOOGLE_API_KEY + pip install google-generativeai
Ollama (local/offline) OPENCLAW_LLM_PROVIDER=ollama llama3.2 Ollama running locally
# Switch to GPT-4o
echo 'OPENCLAW_LLM_PROVIDER=openai' >> ~/.openclaw/.env
echo 'OPENAI_API_KEY=sk-...'        >> ~/.openclaw/.env

# Fully offline with Ollama (no API key needed)
echo 'OPENCLAW_LLM_PROVIDER=ollama' >> ~/.openclaw/.env
echo 'OPENCLAW_LLM_MODEL=llama3.2'  >> ~/.openclaw/.env

Implementation: scripts/lib/llm-gateway.sh


📈 Prometheus Metrics (New in v3.3)

Expose real-time recovery metrics for Grafana dashboards and alerting rules.

Metric Type Description
openclaw_gateway_healthy gauge 1 if gateway returns HTTP 200, 0 otherwise
openclaw_recovery_attempts gauge Total Level-3 recovery attempts
openclaw_recovery_success gauge Successful recoveries
openclaw_recovery_failed gauge Failed recoveries
openclaw_recovery_rate_percent gauge Autonomous recovery success rate (0–100)
openclaw_last_recovery_duration_seconds gauge Duration of last recovery attempt
openclaw_last_recovery_success gauge 1 if last recovery succeeded, 0 if failed
openclaw_last_recovery_timestamp_seconds gauge Unix timestamp of last recovery
# Start the exporter (port 9090 by default)
bash scripts/start-metrics-exporter.sh start

# Test
curl -s http://localhost:9090/metrics

# Stop / restart
bash scripts/start-metrics-exporter.sh stop
bash scripts/start-metrics-exporter.sh restart

# Custom port
OPENCLAW_METRICS_PORT=8080 bash scripts/start-metrics-exporter.sh start

Prometheus scrape config

# prometheus.yml
scrape_configs:
  - job_name: 'openclaw'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 30s

Grafana alerting example

# Alert: gateway down for >5 minutes
openclaw_gateway_healthy == 0

# Alert: recovery rate drops below 50%
openclaw_recovery_rate_percent < 50

Implementation: scripts/prometheus-exporter.py · scripts/start-metrics-exporter.sh


📊 Weekly Incident Digest

Generate a Markdown report of the last 7 days of incidents:

# Print to stdout
bash scripts/incident-digest.sh

# Print + post to Discord
bash scripts/incident-digest.sh --discord

Sample output:

## 📊 Weekly Incident Digest

**Period**: 2026-03-18 → 2026-03-25

| Metric | Value |
|--------|-------|
| Total incidents | 7 |
| Auto-resolved | 5 (71%) |
| Escalated to human | 2 |
| Autonomy rate | 71% |

✅ System performing well (autonomy rate on target)

🗺️ Roadmap

✅ Done: 4-tier architecture · Claude AI integration · install.sh automation · Linux systemd · Level 2→3 auto-escalation · Discord/Telegram alerts · Preflight validation (v3.2) · LLM-agnostic layer — Claude, GPT-4, Gemini, Ollama (v3.3) · Prometheus metrics exporter (v3.3) · Multi-channel notifications — Discord/Slack/Telegram (v3.4) · Docker Compose support (v3.4) · --dry-run demo mode · Weekly incident digest

🚧 Next: Grafana dashboard template · Multi-node clusters

🔮 Future: Kubernetes Operator

🗳️ Vote on features →


💬 Community

Found this useful? A ⭐ on GitHub helps others discover it.


📚 Docs

📖 Quick Start Installation guide
🏗️ Architecture System design
🔧 Configuration Environment variables
🐳 Docker Docker Compose setup
🐛 Troubleshooting Common issues
📜 Changelog Version history

🔒 Security

No secrets in code. .env for all webhooks. Lock files prevent races. All recoveries logged.

Level 3 AI access: OpenClaw config, gateway restart, log files — intentional for autonomous recovery.


🌐 OpenClaw Ecosystem

Project Role
openclaw-self-healing ← you are here 4-tier autonomous crash recovery
openclaw-memorybox Memory hygiene CLI — prevents the bloat that causes crashes
openclaw-self-evolving AI agent that proposes its own AGENTS.md improvements
jarvis 24/7 AI ops system using Claude Max — self-healing, RAG, cron automation

All MIT licensed, all battle-tested on the same 24/7 production instance.


🤝 Contributing

Bug reports, feature requests, docs improvements welcome. 📋 Contribution Guide →

Community: Discussions · Issues · Discord


MIT License · Made with 🦞 by @ramsbaby

"The best system is one that fixes itself before you notice it's broken."

Yorumlar (0)

Sonuc bulunamadi