Your AI SRE that investigates production incidents
_{Long-term memory · Knowledge graph · 46 production skills}

OpenSRE is an open-source AI SRE agent that automatically investigates production incidents, finds root causes, and learns from every investigation. It combines episodic memory (remembering past incidents and what fixed them) with a Neo4j knowledge graph (understanding service dependencies and blast radius) and 46 production-ready skills for tools like Datadog, Grafana, PagerDuty, Elasticsearch, Kubernetes, and AWS. Self-hosted, provider-agnostic via LiteLLM, and licensed Apache 2.0.

_{Click to watch OpenSRE investigate an incident in 60 seconds}

Website · Docs · Live Demo · Contributing

Why OpenSRE?


Learns from every incident	OpenSRE remembers past investigations — what worked, what didn't. Similar incident at 3am? It already knows the playbook.
Understands your infrastructure	Neo4j knowledge graph maps service dependencies, so the agent knows blast radius before it starts investigating.
Plugs into what you already use	46 production skills for Datadog, Grafana, PagerDuty, Elasticsearch, Kubernetes, AWS, and more. No rip-and-replace.

Quick Start

git clone https://github.com/swapnildahiphale/OpenSRE.git
cd OpenSRE
cp .env.example .env
# Add your OPENROUTER_API_KEY (or ANTHROPIC_API_KEY) to .env
make dev

This starts Postgres, config-service, LiteLLM proxy, Neo4j, sre-agent, and the web console. Migrations run automatically. Open http://localhost:3002 and paste the admin token shown in the terminal to sign in.

Full setup guide · Slack integration · Configuration

Architecture

OpenSRE architecture diagram — LangGraph orchestration with episodic memory, 46 investigation skills, and Neo4j knowledge graph

→ Detailed architecture docs · Architecture overview

Features

Feature	Description
46 Production Skills	Elasticsearch, Datadog, Grafana, PagerDuty, K8s, AWS, and more
Long-term Memory	Stores investigations, surfaces past solutions for similar incidents
Knowledge Graph	Neo4j service topology, dependency traversal, blast radius
Multi-provider LLM	Claude, OpenAI, Gemini, DeepSeek, Mistral, Ollama, and more
Web Console	Dashboard, agent runs, memory browser
Slack Integration	Investigate incidents directly from Slack

→ See all features · Roadmap

Useful Commands

Command	What it does
`make dev`	Start all services (Postgres, config, LiteLLM, agent, web UI)
`make dev-slack`	Start all services + Slack bot
`make stop`	Stop all services
`make status`	Show service health status
`make logs`	Follow all service logs
`make logs-agent`	Follow sre-agent logs only
`make clean`	Remove containers, volumes, and images

Slack integration

Create a Slack app, add SLACK_BOT_TOKEN and SLACK_APP_TOKEN to .env, and run make dev-slack. Full guide.

E2E Testing with EKS

Run OpenSRE against a real Kubernetes cluster with the OpenTelemetry Demo app to test end-to-end investigations.

Prerequisites

An existing EKS cluster with kubectl and helm installed
AWS CLI configured with access to the cluster

Setup

export EKS_CLUSTER=my-cluster
export EKS_REGION=us-west-2
make e2e-setup-eks

This installs the otel-demo app on your EKS cluster, sets up port-forward tunnels to Prometheus/Grafana/Jaeger, starts sre-agent and the web UI, and generates a team token you can use to sign in.

Run fault injection tests

make e2e-test                    # Quick cart failure investigation (raw curl)
make e2e-test-cart               # Cart service fault — ~10% EmptyCart failures
make e2e-test-product            # Product catalog fault — ~5% GetProduct failures
make e2e-test-recommendation     # Recommendation service cache failure
make e2e-test-ad                 # Ad service failure — all requests fail
make e2e-test-all                # Run all 4 fault injection tests sequentially

Each test injects a fault into the otel-demo app via feature flags, then triggers an OpenSRE investigation to diagnose it.

EKS commands

Command	What it does
`make e2e-setup-eks`	Full setup: otel-demo on EKS + tunnels + agent + token
`make e2e-teardown-eks`	Uninstall otel-demo from EKS and stop tunnels
`make e2e-status`	Show cluster, pods, and observability status
`make e2e-token`	Generate a team token for web UI access
`make eks-port-forward`	Start port-forward tunnels to EKS
`make eks-port-forward-stop`	Stop port-forward tunnels

Local cluster (Kind)

For testing without a cloud cluster, you can use Kind instead:

make e2e-setup       # Create Kind cluster + install otel-demo + start agent
make e2e-teardown    # Delete Kind cluster and clean up

Comparing OpenSRE

How does OpenSRE compare to commercial incident response tools like PagerDuty Copilot, Rootly AI, and Shoreline? See the full breakdown:

→ Comparison matrix · Blog: OpenSRE vs Commercial Tools

Built With

OpenSRE is built on top of proven open-source technologies:

LangGraph — Agent orchestration (planner → subagents → synthesizer)
Neo4j — Knowledge graph for service topology and dependency traversal
FastAPI — Backend API with SSE streaming
Next.js — Web console (dashboard, memory browser, config editor)
LiteLLM — Multi-provider LLM proxy (18+ providers)
PostgreSQL — Persistent storage for configs and agent state

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines. Please open an issue before starting major work.

Creator

Swapnil Dahiphale · SRE · Builder
"Built by someone who's been paged at 3am."

License

OpenSRE is licensed under the Apache License 2.0.