OpenSRE

agent
Guvenlik Denetimi
Uyari
System Uyari
  • Scan error — Audit could not complete: fetch failed

Bu listing icin henuz AI raporu yok.

SUMMARY

Open-source AI SRE agent that investigates production incidents using episodic memory and Neo4j knowledge graph. 46 production skills. Self-hosted.

README.md

OpenSRE — Open-source AI SRE platform for automated incident investigation

Your AI SRE that investigates production incidents
Long-term memory · Knowledge graph · 46 production skills

Apache 2.0 License GitHub Stars GitHub Forks PRs Welcome OpenSRE Documentation OpenSRE Website

OpenSRE is an open-source AI SRE agent that automatically investigates production incidents, finds root causes, and learns from every investigation. It combines episodic memory (remembering past incidents and what fixed them) with a Neo4j knowledge graph (understanding service dependencies and blast radius) and 46 production-ready skills for tools like Datadog, Grafana, PagerDuty, Elasticsearch, Kubernetes, and AWS. Self-hosted, provider-agnostic via LiteLLM, and licensed Apache 2.0.

OpenSRE demo — AI SRE agent investigating a production incident with episodic memory and knowledge graph
Click to watch OpenSRE investigate an incident in 60 seconds

Website · Docs · Live Demo · Contributing

Why OpenSRE?

Learns from every incident OpenSRE remembers past investigations — what worked, what didn't. Similar incident at 3am? It already knows the playbook.
Understands your infrastructure Neo4j knowledge graph maps service dependencies, so the agent knows blast radius before it starts investigating.
Plugs into what you already use 46 production skills for Datadog, Grafana, PagerDuty, Elasticsearch, Kubernetes, AWS, and more. No rip-and-replace.

Quick Start

git clone https://github.com/swapnildahiphale/OpenSRE.git
cd OpenSRE
cp .env.example .env
# Add your OPENROUTER_API_KEY (or ANTHROPIC_API_KEY) to .env
make dev

This starts Postgres, config-service, LiteLLM proxy, Neo4j, sre-agent, and the web console. Migrations run automatically. Open http://localhost:3002 and paste the admin token shown in the terminal to sign in.

Full setup guide · Slack integration · Configuration

Architecture

OpenSRE architecture diagram — LangGraph orchestration with episodic memory, 46 investigation skills, and Neo4j knowledge graph

→ Detailed architecture docs · Architecture overview

Features

Feature Description
46 Production Skills Elasticsearch, Datadog, Grafana, PagerDuty, K8s, AWS, and more
Long-term Memory Stores investigations, surfaces past solutions for similar incidents
Knowledge Graph Neo4j service topology, dependency traversal, blast radius
Multi-provider LLM Claude, OpenAI, Gemini, DeepSeek, Mistral, Ollama, and more
Web Console Dashboard, agent runs, memory browser
Slack Integration Investigate incidents directly from Slack

→ See all features · Roadmap

Useful Commands

Command What it does
make dev Start all services (Postgres, config, LiteLLM, agent, web UI)
make dev-slack Start all services + Slack bot
make stop Stop all services
make status Show service health status
make logs Follow all service logs
make logs-agent Follow sre-agent logs only
make clean Remove containers, volumes, and images

Slack integration

Create a Slack app, add SLACK_BOT_TOKEN and SLACK_APP_TOKEN to .env, and run make dev-slack. Full guide.

E2E Testing with EKS

Run OpenSRE against a real Kubernetes cluster with the OpenTelemetry Demo app to test end-to-end investigations.

Prerequisites

  • An existing EKS cluster with kubectl and helm installed
  • AWS CLI configured with access to the cluster

Setup

export EKS_CLUSTER=my-cluster
export EKS_REGION=us-west-2
make e2e-setup-eks

This installs the otel-demo app on your EKS cluster, sets up port-forward tunnels to Prometheus/Grafana/Jaeger, starts sre-agent and the web UI, and generates a team token you can use to sign in.

Run fault injection tests

make e2e-test                    # Quick cart failure investigation (raw curl)
make e2e-test-cart               # Cart service fault — ~10% EmptyCart failures
make e2e-test-product            # Product catalog fault — ~5% GetProduct failures
make e2e-test-recommendation     # Recommendation service cache failure
make e2e-test-ad                 # Ad service failure — all requests fail
make e2e-test-all                # Run all 4 fault injection tests sequentially

Each test injects a fault into the otel-demo app via feature flags, then triggers an OpenSRE investigation to diagnose it.

EKS commands

Command What it does
make e2e-setup-eks Full setup: otel-demo on EKS + tunnels + agent + token
make e2e-teardown-eks Uninstall otel-demo from EKS and stop tunnels
make e2e-status Show cluster, pods, and observability status
make e2e-token Generate a team token for web UI access
make eks-port-forward Start port-forward tunnels to EKS
make eks-port-forward-stop Stop port-forward tunnels

Local cluster (Kind)

For testing without a cloud cluster, you can use Kind instead:

make e2e-setup       # Create Kind cluster + install otel-demo + start agent
make e2e-teardown    # Delete Kind cluster and clean up

Comparing OpenSRE

How does OpenSRE compare to commercial incident response tools like PagerDuty Copilot, Rootly AI, and Shoreline? See the full breakdown:

→ Comparison matrix · Blog: OpenSRE vs Commercial Tools

Built With

OpenSRE is built on top of proven open-source technologies:

  • LangGraph — Agent orchestration (planner → subagents → synthesizer)
  • Neo4j — Knowledge graph for service topology and dependency traversal
  • FastAPI — Backend API with SSE streaming
  • Next.js — Web console (dashboard, memory browser, config editor)
  • LiteLLM — Multi-provider LLM proxy (18+ providers)
  • PostgreSQL — Persistent storage for configs and agent state

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines. Please open an issue before starting major work.

Creator

Swapnil Dahiphale · SRE · Builder
"Built by someone who's been paged at 3am."
Portfolio   LinkedIn

License

OpenSRE is licensed under the Apache License 2.0.

Yorumlar (0)

Sonuc bulunamadi