OpenSRE
System Uyari
- Scan error — Audit could not complete: fetch failed
Bu listing icin henuz AI raporu yok.
Open-source AI SRE agent that investigates production incidents using episodic memory and Neo4j knowledge graph. 46 production skills. Self-hosted.
Your AI SRE that investigates production incidents
Long-term memory · Knowledge graph · 46 production skills
OpenSRE is an open-source AI SRE agent that automatically investigates production incidents, finds root causes, and learns from every investigation. It combines episodic memory (remembering past incidents and what fixed them) with a Neo4j knowledge graph (understanding service dependencies and blast radius) and 46 production-ready skills for tools like Datadog, Grafana, PagerDuty, Elasticsearch, Kubernetes, and AWS. Self-hosted, provider-agnostic via LiteLLM, and licensed Apache 2.0.
Click to watch OpenSRE investigate an incident in 60 seconds
Website · Docs · Live Demo · Contributing
Why OpenSRE?
| Learns from every incident | OpenSRE remembers past investigations — what worked, what didn't. Similar incident at 3am? It already knows the playbook. |
| Understands your infrastructure | Neo4j knowledge graph maps service dependencies, so the agent knows blast radius before it starts investigating. |
| Plugs into what you already use | 46 production skills for Datadog, Grafana, PagerDuty, Elasticsearch, Kubernetes, AWS, and more. No rip-and-replace. |
Quick Start
git clone https://github.com/swapnildahiphale/OpenSRE.git
cd OpenSRE
cp .env.example .env
# Add your OPENROUTER_API_KEY (or ANTHROPIC_API_KEY) to .env
make dev
This starts Postgres, config-service, LiteLLM proxy, Neo4j, sre-agent, and the web console. Migrations run automatically. Open http://localhost:3002 and paste the admin token shown in the terminal to sign in.
Architecture
Features
| Feature | Description |
|---|---|
| 46 Production Skills | Elasticsearch, Datadog, Grafana, PagerDuty, K8s, AWS, and more |
| Long-term Memory | Stores investigations, surfaces past solutions for similar incidents |
| Knowledge Graph | Neo4j service topology, dependency traversal, blast radius |
| Multi-provider LLM | Claude, OpenAI, Gemini, DeepSeek, Mistral, Ollama, and more |
| Web Console | Dashboard, agent runs, memory browser |
| Slack Integration | Investigate incidents directly from Slack |
Useful Commands
| Command | What it does |
|---|---|
make dev |
Start all services (Postgres, config, LiteLLM, agent, web UI) |
make dev-slack |
Start all services + Slack bot |
make stop |
Stop all services |
make status |
Show service health status |
make logs |
Follow all service logs |
make logs-agent |
Follow sre-agent logs only |
make clean |
Remove containers, volumes, and images |
Slack integration
Create a Slack app, add SLACK_BOT_TOKEN and SLACK_APP_TOKEN to .env, and run make dev-slack. Full guide.
E2E Testing with EKS
Run OpenSRE against a real Kubernetes cluster with the OpenTelemetry Demo app to test end-to-end investigations.
Prerequisites
- An existing EKS cluster with
kubectlandhelminstalled - AWS CLI configured with access to the cluster
Setup
export EKS_CLUSTER=my-cluster
export EKS_REGION=us-west-2
make e2e-setup-eks
This installs the otel-demo app on your EKS cluster, sets up port-forward tunnels to Prometheus/Grafana/Jaeger, starts sre-agent and the web UI, and generates a team token you can use to sign in.
Run fault injection tests
make e2e-test # Quick cart failure investigation (raw curl)
make e2e-test-cart # Cart service fault — ~10% EmptyCart failures
make e2e-test-product # Product catalog fault — ~5% GetProduct failures
make e2e-test-recommendation # Recommendation service cache failure
make e2e-test-ad # Ad service failure — all requests fail
make e2e-test-all # Run all 4 fault injection tests sequentially
Each test injects a fault into the otel-demo app via feature flags, then triggers an OpenSRE investigation to diagnose it.
EKS commands
| Command | What it does |
|---|---|
make e2e-setup-eks |
Full setup: otel-demo on EKS + tunnels + agent + token |
make e2e-teardown-eks |
Uninstall otel-demo from EKS and stop tunnels |
make e2e-status |
Show cluster, pods, and observability status |
make e2e-token |
Generate a team token for web UI access |
make eks-port-forward |
Start port-forward tunnels to EKS |
make eks-port-forward-stop |
Stop port-forward tunnels |
Local cluster (Kind)
For testing without a cloud cluster, you can use Kind instead:
make e2e-setup # Create Kind cluster + install otel-demo + start agent
make e2e-teardown # Delete Kind cluster and clean up
Comparing OpenSRE
How does OpenSRE compare to commercial incident response tools like PagerDuty Copilot, Rootly AI, and Shoreline? See the full breakdown:
→ Comparison matrix · Blog: OpenSRE vs Commercial Tools
Built With
OpenSRE is built on top of proven open-source technologies:
- LangGraph — Agent orchestration (planner → subagents → synthesizer)
- Neo4j — Knowledge graph for service topology and dependency traversal
- FastAPI — Backend API with SSE streaming
- Next.js — Web console (dashboard, memory browser, config editor)
- LiteLLM — Multi-provider LLM proxy (18+ providers)
- PostgreSQL — Persistent storage for configs and agent state
Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines. Please open an issue before starting major work.
Creator
|
Swapnil Dahiphale · SRE · Builder "Built by someone who's been paged at 3am." |
|
License
OpenSRE is licensed under the Apache License 2.0.
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi