otel-k8s-graph

A live Kubernetes service dependency graph (service map) built from OpenTelemetry traces. See what runs where and which service calls what — then query it over REST or from any LLM / MCP client (Claude, etc.).

Keywords: Kubernetes service map · service dependency graph · service topology · OpenTelemetry · distributed tracing · observability · trace-based service map · MCP server for Kubernetes.

A live relationship graph — a service map / service topology — of a Kubernetes
cluster: what runs where, and which service talks to what. Built from two sources,
queryable by humans, scripts, and LLM agents.

Cluster structure — namespaces, nodes, zones, regions, workloads
(deployments, statefulsets, daemonsets, jobs, rollouts, …), autoscalers
(HPA, KEDA), pods and containers, and how they contain/manage/scale each
other — from the Kubernetes API.
Service relationships — which service calls which HTTP endpoint or RPC
method (e.g. gRPC), queries which database, publishes to which topic — derived
from OpenTelemetry (OTel) trace spans, with no spanmetrics connector or
metrics pipeline required.

What you can ask

One graph answers questions that normally need a service mesh, a tracing UI, and a
cluster inspector at once — over REST or from Claude / any MCP client:

Safe deprecation — "what pods call /v1/checkout vs /v2/checkout — can I safely retire v1?"
Blast radius — "how many upstream dependencies does this deployment have — what's impacted if it changes?"
Dead-code / unused APIs — "which exposed endpoints have no callers?"
Data-store reach — "which services talk to the mysql database at 10.0.0.0?"
Cross-zone traffic — "which services call auth-service from another zone?"

How it compares

Unlike a service mesh (Istio/Linkerd) or eBPF agent, otel-k8s-graph needs no
sidecars and no kernel access — it derives the service map straight from the
OpenTelemetry spans your apps already emit. Unlike a tracing UI (Jaeger, Tempo,
Kiali) it stores the distilled relationship graph, not raw traces, so it stays
small and is directly queryable by scripts and LLMs.

Architecture

  Kubernetes API ──watch──────>  graph-k8s ──┐
   (pods, nodes, ...)           (structure)  │
                                             ├──>  Redis  <──reads──  graph-read ──> REST API
  OTel Collector ──OTLP/gRPC──> otel-hub ──NATS──> graph-spans ──┐  (graph)          └─> MCP server
   (trace spans)               (ingest +           graph-flows ──┘                     (graph-read mcp)
                                NATS pub)

Five small Go binaries coordinating through Redis and NATS:

Component	Source	Role	Docs
graph-k8s	Kubernetes API	Writes structural entities + containment/management edges. Single writer.	cmd/graph-k8s
otel-hub	OTel trace spans (OTLP/gRPC)	Ingests raw OTLP spans, assembles traces, publishes `spans.raw` + `traces.assembled` to NATS. Writes no Redis.	cmd/otel-hub
graph-spans	NATS `spans.raw`	Writes relationship entities + CALLS/QUERIES/PUBLISHES/EXPOSES edges.	cmd/graph-spans
graph-flows	NATS `traces.assembled`	Derives flows from assembled traces.	cmd/graph-flows
graph-read	Redis	Serves the read/query HTTP API; also the MCP server (`graph-read mcp`).	cmd/graph-read

Quickstart

The Helm chart bundles a single-replica Redis by default, so a fresh install
is self-contained:

# 1. Build + push the three images (versioned + latest) and install the chart:
REGISTRY=<your-registry> ./deploy.sh # add a registry here that your k8s cluster can access,
#                                      the script will build and push the image,
#                                      update the helm chart and run the helm install command 

# 2. Add an otlp exporter to your OTel Collector's traces pipeline, pointing at otel-hub:
#      endpoint: otel-hub-otlp:4317
#    (when running multiple otel-hub replicas, use a traceID loadbalancing
#     exporter; see "Scaling & the bus" below)

Already have Redis? --set redis.internal.enabled=false --set redis.host=....
See the chart README for every value, including
existingSecret for credentials and persistence for the bundled Redis.

Sample values: self-contained install

The chart needs only a registry; everything else has working defaults
(bundled Redis included):

# graph-values.yaml
image:
  registry: <your-registry>   # e.g. ghcr.io/<you>

# Optional: keep the graph across Redis restarts (default: rebuilt from
# the K8s API and OTel metrics, so persistence is off).
# redis:
#   internal:
#     persistence:
#       enabled: true
#       size: 1Gi

helm upgrade --install graph helm/graph -f graph-values.yaml

Sample values: OTel Collector feeding trace spans

The pipeline consumes trace spans directly (otel-hub ingests raw OTLP); no
spanmetrics connector or metrics pipeline required. Just fan the trace spans your
apps already emit to otel-hub alongside your existing trace backend. Sample values
for the upstream opentelemetry-collector chart:

# otel-collector-values.yaml
mode: deployment

image:
  repository: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib

# Adds the k8sattributes processor (+ RBAC) to every pipeline, so spans carry
# k8s.namespace.name / k8s.pod.name / k8s.container.name. otel-hub needs
# these to attach relationships to the right container.
presets:
  kubernetesAttributes:
    enabled: true

config:
  exporters:
    otlp/graph:
      # <service>.<namespace>: adjust if you installed the chart elsewhere
      endpoint: otel-hub-otlp.default.svc.cluster.local:4317
      tls:
        insecure: true

  service:
    pipelines:
      # Trace spans in (apps send OTLP to this collector) -> fanned out to
      # otel-hub, which assembles traces and publishes them to NATS. graph-spans
      # and graph-flows then derive the relationship graph from span.kind,
      # span.name and the span attributes (http.route; rpc.system / rpc.service /
      # rpc.method for gRPC etc.; db.system, server.address, peer.service, ...).
      # Add your own trace backend to the exporters list alongside otlp/graph.
      traces:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [otlp/graph]

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm upgrade --install otel-collector open-telemetry/opentelemetry-collector \
  -f otel-collector-values.yaml

Already exporting traces from a collector? Just add the otlp/graph exporter
to your existing traces pipeline.

The snippet above (plain otlp/graph → the otel-hub-otlp Service) is
correct for a single otel-hub replica. For multiple replicas you need
traceID load-balancing — see Scaling & the bus below.

Scaling & the bus

otel-hub assembles traces in memory, so a trace's spans must all land on the
same otel-hub replica. When running more than one replica
(otelHub.replicas), use a collector
loadbalancing
exporter keyed by traceID, pointed at the headless Service the chart ships
(otel-hub-otlp-headless, which resolves to individual pod IPs):

config:
  exporters:
    loadbalancing:
      routing_key: traceID                 # all spans of a trace -> the same pod
      protocol:
        otlp:
          tls: { insecure: true }
      resolver:
        dns:
          hostname: otel-hub-otlp-headless.default.svc.cluster.local
          port: 4317
  service:
    pipelines:
      traces:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [loadbalancing]         # + your own trace backend

Run this as a single gateway collector tier (not a per-node daemonset — each
instance must see whole traces). For a single hub replica (--set otelHub.replicas=1) the plain otlp/graph exporter above is sufficient.

graph-spans writes idempotent entity upserts and needs no trace affinity —
scale graphSpans.replicas freely. graph-flows consumes traces.assembled
via a NATS queue group, so whole-trace affinity is automatic; scale
graphFlows.replicas freely as well. Neither downstream binary requires
collector changes when scaled.

Full collector configuration details: helm/graph/README.md.

The graph

Entity kinds: namespace, node, zone, region, deployment, statefulset, daemonset, job, cronjob, rollout, pod, container, hpa, scaledobject (K8s derived); endpoint, topic, database (span derived).

Edge kinds: CONTAINS/RUNS_IN, MANAGES/MANAGED_BY, SCALES/SCALED_BY (K8s derived);
EXPOSES/EXPOSED_BY, CALLS/CALLED_BY, PUBLISHES/PUBLISHED_BY,
CONSUMES/CONSUMED_BY, QUERIES/QUERIED_BY (span derived). Edges
are single-directional but have a counterpart edge in the store, and QUERIES
edges carry an action (the SQL/command).

Entity IDs are namespace-qualified where applicable: pod:<ns>/<name>,
container:<ns>/<pod>/<name>, endpoint:<service>/<METHOD>/<route> for HTTP and
endpoint:<rpc.system>/<rpc.service>/<rpc.method> for RPC (host-independent, so a
caller and callee converge — e.g. endpoint:grpc/oteldemo.CartService/GetCart),
database:<system>/<host>[:<port>], topic:<name>.

Zones & regions. Nodes carrying the well-known topology.kubernetes.io/zone / region labels (or their legacy failure-domain.beta.kubernetes.io forms) produce zone and region entities: region CONTAINS zone CONTAINS node. Cross-zone questions — "which services call auth-service from another zone?" — become short graph walks: pod → node → zone on each side of a CALLS edge.

Workloads & autoscalers. Beyond deployments, graph-k8s models statefulset, daemonset, job/cronjob, and Argo rollout (each MANAGES its pods), plus hpa and KEDA scaledobject autoscalers that SCALES a target workload. HPA replica bounds, KEDA triggers/scaling policy, and cron schedules are captured as metadata. The Argo and KEDA resources are CRDs — graph-k8s detects their absence and skips them, so it runs anywhere.

Redis schema (prefix configurable, default `graph`)

<prefix>:entity:<id>            HASH  id, kind, name, last_seen_at_ms
<prefix>:entity:<id>:metadata   HASH  arbitrary string key/values
<prefix>:entity:<id>:edges      SET   JSON-encoded Edge objects
<prefix>:by_kind:<kind>         SET   entity IDs of the given kind
<prefix>:ids                    SET   all entity IDs

Querying

REST (graph-read): GET /search, /entities, /entity/{id},
/subgraph/{id}, POST /prune, GET /healthz.
MCP (graph-read mcp): the tools search, get_entity,
list_entities, get_subgraph for LLM clients (Claude Code, Claude
Desktop, …). See cmd/graph-read.

Repository layout

cmd/graph-k8s/       K8s watcher binary
cmd/otel-hub/        OTLP ingest + NATS publisher
cmd/graph-spans/     NATS spans.raw consumer (relationship graph)
cmd/graph-flows/     NATS traces.assembled consumer (flows)
cmd/graph-read/      read API + MCP binary
internal/k8swatch/   informer wiring + object->entity mapping + diff/apply
internal/builder/    span record -> relationship entities/edges
internal/graph/      Redis read (RedisGraph) + write (RedisWriter, BatchWriteSet) + keys/schema
internal/api/        HTTP query handlers
internal/mcp/        MCP server + tools
helm/graph/          Helm chart (+ generated README)

Development

go build ./...
go test ./...
go test -race ./internal/graph/ ./internal/k8swatch/

Requires Go 1.25+. Tests use miniredis
and the client-go fake clientset, so no real Redis or cluster is needed.

See CONTRIBUTING.md for how to contribute.

License

Apache License 2.0 — see LICENSE.

arrca-graph

otel-k8s-graph

What you can ask

How it compares

Architecture

Quickstart

Sample values: self-contained install

Sample values: OTel Collector feeding trace spans

Scaling & the bus

The graph

Redis schema (prefix configurable, default `graph`)

Querying

Repository layout

Development

License

Yorumlar (0)

otel-k8s-graph

What you can ask

How it compares

Architecture

Quickstart

Sample values: self-contained install

Sample values: OTel Collector feeding trace spans

Scaling & the bus

The graph

Redis schema (prefix configurable, default graph)

Querying

Repository layout

Development

License

Yorumlar (0)

Redis schema (prefix configurable, default `graph`)