agentic-chatops
Health Warn
- No license — Repository has no license file
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Community trust — 95 GitHub stars
Code Fail
- rm -rf — Recursive force deletion command in a2a/agent-cards/openclaw-t1.json
- process.env — Environment variable access in mcp-proxmox/index.js
- network request — Outbound network request in mcp-proxmox/index.js
Permissions Pass
- Permissions — No dangerous permissions requested
This tool is a multi-tier ChatOps platform that automates infrastructure alert triage and remediation. It integrates n8n, GPT-4o, and Claude to investigate alerts and propose infrastructure fixes while keeping a human in the loop for final approval.
Security Assessment
The overall risk is rated as Medium. The repository contains a recursive force deletion command (`rm -rf`), which requires careful review to ensure it cannot be accidentally triggered or misused. The project accesses environment variables and makes outbound network requests, which are expected behaviors for an infrastructure automation tool, though they still warrant a manual code review. Fortunately, no hardcoded secrets or dangerous permission scopes were detected.
Quality Assessment
The project is in active development, with its last push occurring today. It has earned a solid 95 GitHub stars, indicating a good level of community trust and interest from fellow developers. However, it lacks a standard open-source license. Without a license, the software is technically proprietary, meaning you do not have explicit legal permission to use, modify, or distribute the code.
Verdict
Use with caution due to destructive shell commands and the complete lack of a formal software license.
3-tier agentic ChatOps (n8n + GPT-4o + Claude Code) implementing all 21 patterns from "Agentic Design Patterns" — solo operator managing 137 devices
agentic-chatops
Production agentic ChatOps platform implementing all 21 design patterns from Agentic Design Patterns by Antonio Gulli — cross-referenced against the Claude Certified Architect Exam Guide — running on a self-hosted homelab managed by a single operator.

Why This Exists
Managing 310 infrastructure objects — 113 physical devices, 197 virtual machines, 421 IP addresses, 39 VLANs, 653 interfaces across 6 sites (Netherlands, Greece x2, Switzerland, Norway) and 3 Proxmox clusters — as a solo operator is unsustainable without automation.
That's 3 firewalls, 3 managed switches, 12 Kubernetes nodes with Cilium ClusterMesh, self-hosted everything (Matrix, GitLab, YouTrack, n8n, LibreNMS, Grafana, Nextcloud HA, SeaweedFS, Thanos), and no team to delegate to. When an alert fires at 3am, there's one person on call. Always.
This platform bridges the gap: infrastructure alerts flow in, AI agents triage and investigate, propose remediation plans, and wait for human approval before executing. The human stays in the loop for critical decisions but doesn't have to do the detective work.
The 3-Tier Architecture
LibreNMS/Prometheus Alert
│
▼
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ n8n │────▶│ OpenClaw │────▶│ Claude Code │
│ Orchestrator│ │ (Tier 1) │ │ (Tier 2) │
│ 11 workflows│ │ GPT-4o │ │ Claude Opus │
│ ~354 nodes │ │ L1/L2 triage│ │ Deep analysis│
└──────┬───────┘ └──────────────┘ └───────┬───────┘
│ │
▼ ▼
┌─────────────┐ ┌──────────────┐
│ Matrix │◀───────────────────────────│ Human (T3) │
│ Chat rooms │ polls, reactions, replies │ Approval │
└─────────────┘ └──────────────┘
- Tier 1 (OpenClaw / GPT-4o): Fast triage (7-21s). Creates YouTrack issues, deduplicates alerts, investigates via SSH/kubectl, outputs confidence scores. Handles 80%+ of alerts without escalation.
- Tier 2 (Claude Code / Opus): Deep analysis (5-15 min). Reads Tier 1 findings, verifies independently using ReAct reasoning, proposes remediation plans via interactive polls, executes after human approval.
- Tier 3 (Human): Clicks a poll option in Matrix, reacts with thumbs up/down, or types a reply. The system stops and waits for this — it never makes infrastructure changes autonomously.
Agentic Design Patterns — 21/21 Implemented
After reading Antonio Gulli's Agentic Design Patterns (Springer, 2025), we benchmarked this platform against all 21 patterns and upgraded each to A-grade or above. Full audit: docs/agentic-patterns-audit.md.
| # | Pattern | Implementation | Grade |
|---|---|---|---|
| 1 | Prompt Chaining | n8n 44-node sequential workflow (Runner) | A |
| 2 | Routing | Issue prefix → room → slot, alert category detection (8 types) | A- |
| 3 | Parallelization | 3 concurrent session slots (dev, infra-nl, infra-gr) | A- |
| 4 | Reflection | Cross-tier review: OpenClaw critiques Claude with 5-step chain-of-verification | A- |
| 5 | Tool Use | 9 MCP servers, 150+ tools (NetBox, Proxmox, K8s, YouTrack, GitLab, n8n) | A |
| 6 | Planning | Interactive [POLL] plan selection via MSC3381 Matrix polls + plan-only mode | A- |
| 7 | Multi-Agent | 3-tier production system (GPT-4o → Claude Opus → Human) | A |
| 8 | Memory | 4 types: short-term (SQLite sessions), long-term (incident KB), episodic (OpenClaw memory), procedural (SOUL.md/CLAUDE.md) | A- |
| 9 | Learning & Adaptation | A/B prompt testing, outcome scoring, lessons-to-prompt pipeline, regression detection | A |
| 10 | MCP | 9 servers including custom Proxmox MCP (15 tools), mcporter Docker bridge | A |
| 11 | Goal Setting | Confidence gating (< 0.5 = STOP), budget enforcement ($5/session, $25/day) | A- |
| 12 | Exception Handling | 5-layer watchdog, ERROR_CONTEXT structured propagation, fallback ladders | A |
| 13 | Human-in-the-Loop | MSC3381 polls, thumbs up/down reactions, 15min/30min approval timeouts | A |
| 14 | RAG | Vector embeddings (nomic-embed-text via Ollama) + keyword fallback, 3-tier injection | A- |
| 15 | A2A Communication | NL-A2A/v1 protocol, agent cards, REVIEW_JSON auto-action, task lifecycle logging | A |
| 16 | Resource Optimization | Cost prediction per alert category, dynamic timeout (300-600s), per-type metrics | A |
| 17 | Reasoning | ReAct (THOUGHT/ACTION/OBSERVATION), step-back prompting, tree-of-thought, self-consistency check, A/B variants | A |
| 18 | Guardrails | Code-level exec enforcement (safe-exec.sh), input sanitization (10 injection patterns), output fact-checking, credential scanning | A |
| 19 | Evaluation | Multi-dimensional quality scoring (5 dimensions, 0-100), SLA metrics, CI golden tests, confidence calibration | A |
| 20 | Prioritization | Slot-based, burst detection (3+ hosts = correlated triage), flap escalation | A- |
| 21 | Exploration | Daily proactive health scan (disk, certs, stale issues, VPN) | A- |
Book gap analysis for remaining polish items: docs/book-gap-analysis.md
How It Works
Alert Lifecycle (End-to-End)
1. LibreNMS detects "Devices up/down" on host X
2. n8n LibreNMS Receiver → dedup, flap detection, burst detection
3. Posts to Matrix #infra room: "[LibreNMS] ALERT: host X — Devices up/down (critical)"
4. OpenClaw (Tier 1) auto-triages:
a. Checks YouTrack for existing issues (24h dedup)
b. Creates issue IFRNLLEI01PRD-XXX
c. Queries NetBox CMDB for device identity
d. Queries incident knowledge base (semantic search)
e. Investigates via SSH (PVE status, container logs, etc.)
f. Posts findings + CONFIDENCE score to YouTrack + Matrix
g. If confidence < 0.7 or critical: escalates to Claude Code
5. Claude Code (Tier 2) activates:
a. Reads YouTrack issue + Tier 1 comments
b. Uses ReAct reasoning: THOUGHT → ACTION → OBSERVATION loop
c. Checks if recurring alert → step-back analysis
d. Proposes 2-3 remediation plans via [POLL]
6. Matrix renders interactive poll — operator clicks preferred plan
7. Claude Code executes selected plan
8. Reports results, moves issue to "To Verify"
9. Session End: archives to session_log, populates incident KB,
computes quality score, extracts lessons learned
Real Incident Example
IFRNLLEI01PRD-82 — Full L1→L2→L3→approval→fix→recovery cycle:
- LibreNMS alert → n8n → Matrix → OpenClaw triage (30s) → Claude Code investigation (8min) → [POLL] with 3 options → operator clicks Plan A → fix applied → recovery confirmed → YT closed
Operating Modes
| Mode | Frontend | Backend | Use Case |
|---|---|---|---|
oc-cc |
OpenClaw | Claude Code | Default — full 3-tier pipeline |
oc-oc |
OpenClaw | OpenClaw (self-contained) | Quick lookups, no Claude needed |
cc-cc |
n8n/Claude | Claude Code | Direct Claude access (legacy) |
cc-oc |
n8n | OpenClaw as backend | Testing OpenClaw capabilities |
Switch with !mode <mode> in any Matrix room.
Architecture Components
n8n Workflows (11 workflows, ~354 nodes)
| Workflow | Nodes | Purpose |
|---|---|---|
| YouTrack Receiver | 5 | Webhook listener, fires Runner async |
| Claude Runner | 44 | Lock/cooldown → RAG → Build Prompt → Launch Claude → Parse → Validate → Post |
| Progress Poller | 10 | Polls JSONL log every 30s, posts tool activity to Matrix |
| Matrix Bridge | 73 | Polls /sync, routes commands, manages sessions, handles reactions/polls |
| Session End | 12 | Summarize → archive → populate KB → quality score → YT comment |
| LibreNMS Receiver (NL) | 26 | Alert dedup, flap detection, burst detection, recovery tracking |
| LibreNMS Receiver (GR) | 26 | Clone of NL for second site |
| Prometheus Receiver (NL) | 26 | K8s alert processing, fingerprint dedup |
| Prometheus Receiver (GR) | 26 | Clone of NL for second site |
| Synology DSM Receiver | 7 | I/O latency, SMART, iSCSI errors (beyond SNMP) |
| WAL Self-Healer (GR) | 16 | Auto-restart Prometheus on WAL corruption (6h cooldown, recovery verify) |
MCP Servers (9)
| MCP | Tools | Purpose |
|---|---|---|
netbox |
~20 | CMDB: 310 devices/VMs, 421 IPs, 39 VLANs across 6 sites |
n8n-mcp |
20 | Build, update, test n8n workflows programmatically |
youtrack |
55 | Issue management, custom fields, state transitions |
proxmox |
15 | VM/LXC lifecycle, node status, storage (custom MCP) |
kubernetes |
21 | kubectl operations via MCP |
gitlab-mcp |
— | MRs, pipelines, commits |
codegraph |
~15 | Code graph database (KuzuDB), call chain analysis |
opentofu |
— | Registry provider/resource docs |
tfmcp |
— | Terraform module analysis |
OpenClaw (v2026.3.23) — 10 native skills
| Skill | Purpose |
|---|---|
infra-triage |
L1+L2 infra alert triage (YT dedup → investigate → escalate) |
k8s-triage |
Kubernetes alert triage (control plane deep investigation) |
correlated-triage |
Multi-host burst analysis (master + child issues) |
escalate-to-claude |
Tier 2 escalation via n8n webhook |
youtrack-lookup |
Issue CRUD operations |
netbox-lookup |
CMDB device/VM/IP/VLAN lookup |
playbook-lookup |
Query incident knowledge base for past resolutions |
memory-recall |
Episodic memory: past triage outcomes by host/alert |
proactive-scan |
Daily health checks (disk, certs, stale issues, VPN) |
safe-exec.sh |
Exec enforcement wrapper (30+ blocked patterns, rate limiting, exfiltration detection) |
Inter-Agent Communication (NL-A2A/v1)
Standardized protocol for all tier-to-tier messages. See docs/a2a-protocol.md.
- Agent Cards — machine-readable capability declarations per tier (
a2a/agent-cards/) - Message Envelope — standard wrapper with protocol, messageId, from/to, type, payload
- REVIEW_JSON Auto-Action — Bridge parses OpenClaw reviews: AGREE→auto-approve, DISAGREE→pause, AUGMENT→resume with context
- Task Lifecycle —
a2a_task_logtable tracks escalation→in_progress→completed
Data & Intelligence
SQLite Tables
| Table | Purpose |
|---|---|
sessions |
Active sessions (issue_id, session_id, cost, confidence) |
session_log |
Archived sessions with cost/duration/confidence/resolution/variant/category |
session_quality |
5-dimension quality scores (confidence, cost efficiency, completeness, feedback, speed) |
session_feedback |
Thumbs up/down reactions linked to issues |
incident_knowledge |
Alert resolutions with vector embeddings (nomic-embed-text, 768 dims) |
lessons_learned |
Operational insights extracted from sessions |
openclaw_memory |
Episodic memory for Tier 1 triage outcomes |
a2a_task_log |
Inter-agent message lifecycle tracking |
Prometheus Metrics
| Metric | What it tracks |
|---|---|
| Session cost/duration/confidence/turns | Per-project, rolling 7d/30d |
| Quality score (5 dimensions) | Rolling 7d averages, composite score |
| SLA: MTTR avg/p90 | Per-project, per-category |
| Confidence calibration | Predicted vs actual success rate per band |
| Cost per alert category | 8 categories, avg cost + duration |
| A/B variant comparison | Per-variant confidence, cost, session count |
| Feedback (thumbs up/down) | Total + 7d rolling |
| A2A messages | By type (escalation/review/completion) |
| Exec guardrail | Blocked vs allowed commands |
| Golden test results | Pass/fail counts, last run timestamp |
Grafana Dashboards (5 dashboards, 63+ panels)
- ChatOps Platform Performance — sessions, queue, locks, API status, costs, quality, knowledge
- Infrastructure Overview — CPU/memory/disk per host, GPU metrics, service availability
- Infra Alerts & Remediation — alert rates, triage outcomes, MTTR trends
- CubeOS Project / MeshSat Project — pipeline success, MRs, issue states
Guardrails & Safety
Defense-in-depth — not just prompt instructions, but code-level enforcement:
| Layer | Mechanism | Level |
|---|---|---|
| Exec enforcement | safe-exec.sh — 30+ blocked patterns, rate limiting (30/min), exfiltration detection |
Code |
| Input sanitization | 10 prompt injection patterns stripped from Matrix messages | Code |
| Credential scanning | 10 regex patterns redact tokens/keys before posting to Matrix | Code |
| Output fact-checking | Hostname validation, TRIAGE_JSON/REVIEW_JSON schema validation | Code |
| Self-consistency | Detects confidence/reasoning mismatches, triggers retry | Code |
| Exec blocklist | 15+ forbidden commands in SOUL.md | Prompt |
| AUTHORIZED_SENDERS | Only designated operator can interact | Code |
| Approval gates | Infrastructure changes require human thumbs-up or poll vote | Workflow |
| Budget ceiling | $5/session warning, $25/day → plan-only mode | Code |
Installation
Prerequisites
- n8n (v2.11+) — workflow automation
- Matrix (Synapse) — chat server with bot account
- YouTrack — issue tracking with webhook support
- Claude Code — Anthropic CLI (
~/.local/bin/claude) - OpenClaw — GPT-4o agent (Docker-based)
- SQLite3 — session/knowledge storage
- Python 3.11+ — semantic search script
- Ollama (optional) — local embedding model for RAG
Setup Steps
- Clone and configure:
git clone https://github.com/papadopouloskyriakos/agentic-chatops.git
cd agentic-chatops
cp .env.example .env # Edit with your credentials
- Import n8n workflows:
# Via n8n-mcp or manual import
for wf in workflows/*.json; do
# Import each workflow into your n8n instance
npx n8n-mcp import "$wf"
done
Configure Matrix bot:
- Create a bot user on your Matrix server
- Set Bearer token in n8n credentials
- Join bot to your rooms
Configure OpenClaw:
- Deploy
openclaw/openclaw.jsonto your OpenClaw instance - Deploy
openclaw/SOUL.mdas system prompt - Deploy skills to
/workspace/skills/
- Deploy
Initialize SQLite:
# Tables are auto-created by n8n workflows on first run
# Or manually:
sqlite3 gateway.db < schema.sql
- Set up crons:
# Session + agent metrics (every 5 min)
*/5 * * * * /path/to/scripts/write-session-metrics.sh
*/5 * * * * /path/to/scripts/write-agent-metrics.sh
*/5 * * * * /path/to/scripts/write-sla-metrics.sh
# Watchdog (every 5 min)
*/5 * * * * /path/to/scripts/gateway-watchdog.sh
# Regression detection (every 6 hours)
0 */6 * * * /path/to/scripts/regression-detector.sh
# Weekly lessons digest (Monday 07:00 UTC)
0 7 * * 1 /path/to/scripts/weekly-lessons-digest.sh
# Golden test suite (1st of month 04:00 UTC)
0 4 1 * * /path/to/scripts/golden-test-suite.sh
# Proactive scan (daily 06:03 UTC)
3 6 * * * /path/to/scripts/trigger-proactive-scan.sh
- Configure alert sources:
- LibreNMS: create HTTP transport pointing to
https://your-n8n/webhook/librenms-alert - Prometheus/Alertmanager: add webhook receiver pointing to
https://your-n8n/webhook/prometheus-alert
- LibreNMS: create HTTP transport pointing to
Repository Structure
.
├── CLAUDE.md # Full technical reference (600+ lines)
├── a2a/ # NL-A2A/v1 inter-agent protocol
│ └── agent-cards/ # Machine-readable capability declarations
│ ├── openclaw-t1.json # Tier 1 capabilities + constraints
│ ├── claude-code-t2.json # Tier 2 capabilities + reasoning config
│ └── human-t3.json # Tier 3 approval policies
├── docs/
│ ├── a2a-protocol.md # A2A protocol specification
│ ├── agentic-patterns-audit.md # 21/21 pattern scorecard
│ ├── book-gap-analysis.md # Remaining improvements from the book
│ ├── known-failure-rules.md # 27 rules from 26 bugs
│ └── chatops-audit-2026-03-24.md # Cross-reference audit (Gulli book + Anthropic exam guide)
├── grafana/ # Dashboard JSON exports (5 dashboards)
├── openclaw/
│ ├── SOUL.md # OpenClaw system prompt (source of truth)
│ ├── openclaw.json # OpenClaw configuration
│ ├── escalate-to-claude.sh # Tier 2 escalation script
│ └── skills/ # 9 native skills (SKILL.md format)
│ ├── infra-triage/ # L1+L2 infrastructure triage
│ ├── k8s-triage/ # Kubernetes alert triage
│ ├── correlated-triage/ # Multi-host burst analysis
│ ├── safe-exec.sh # Exec enforcement wrapper
│ └── ...
├── scripts/
│ ├── compute-quality-score.sh # 5-dimension session quality scoring
│ ├── regression-detector.sh # 7d rolling regression detection
│ ├── golden-test-suite.sh # 42-test benchmark suite
│ ├── kb-semantic-search.py # Vector similarity search (nomic-embed-text)
│ ├── gateway-watchdog.sh # 5-layer health monitor
│ ├── maintenance-companion.sh # Planned maintenance lifecycle
│ ├── write-session-metrics.sh # Prometheus: cost, quality, calibration
│ ├── write-sla-metrics.sh # Prometheus: MTTR, duration, trends
│ └── ...
├── workflows/ # n8n workflow JSON exports (11 workflows)
│ ├── claude-gateway-runner.json # Main orchestration (44 nodes)
│ ├── claude-gateway-matrix-bridge.json # Matrix integration (73 nodes)
│ └── ...
├── mcp-proxmox/ # Custom MCP server for Proxmox VE API
│ ├── index.js # 15 tools: discovery, config, lifecycle
│ └── package.json
└── .gitlab-ci.yml # CI: validate, test, review, GitHub sync
Commands
Matrix bang commands (processed by n8n Bridge):
| Command | Description |
|---|---|
!session current/list/done/cancel/pause/resume |
Session management |
!issue status/info/start/stop/verify/done/close |
Issue lifecycle |
!pipeline status/logs/retry |
GitLab CI pipelines |
!mode status/oc-cc/oc-oc/cc-cc/cc-oc |
Operating mode switching |
!system status/processes |
System health |
!gateway offline/online/status |
Gateway control |
!debug |
Dump lock, sessions, queue, cooldown state |
Inspiration & References
- Agentic Design Patterns by Antonio Gulli (Springer, 2025) — 21 patterns, all implemented. Cross-reference audit:
docs/chatops-audit-2026-03-24.md - Claude Certified Architect — Foundations Exam Guide — Exam domains map to this architecture: agentic orchestration, MCP integration, CLAUDE.md configuration, prompt engineering, context management. Full domain mapping in the audit report.
- n8n Workflow Template — Published on n8n creator portal: "Manage Claude Code sessions from Matrix with YouTrack and GitLab"
- n8n — Workflow automation engine (self-hosted)
- Model Context Protocol — Standardized LLM-tool integration
License
This is a sanitized mirror of a private GitLab repository. Internal hostnames, IP addresses, credentials, and personal identifiers have been replaced with placeholders.
The code is provided as-is for educational and reference purposes. See individual components for their respective licenses.
Built by a solo infrastructure operator who got tired of waking up at 3am for alerts that an AI could triage.
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found