Context Runtime

An efficiency optimizer for a fleet of apps — a database query planner for LLM
context. The application says "I need an answer"; the runtime decides what the model
sees — what to retrieve, compress, route, and verify — emits an inspectable,
replayable plan, and learns from the outcome. It does for AI context what query
planners did for SQL. See POSITIONING.md for the thesis.

It optimizes any app with (a) a decision point about what context/config to use and
(b) a measurable outcome. Eleven tenants are built and green (each number is the
learned-vs-baseline reward its offline examples/<tenant>.py prints):

Tenant	Context Runtime tunes	Result
sidekick	which skills to recall · budget	drop-in for `SkillStore`; 67% vs 33% naive baseline acceptance
redevops-rag	`pool · limit · threshold · rerank …` per query	`ContextRuntimeRetrieverTuner`; 0.773 vs 0.323 vs fixed default
edge-sentinel (SOC)	which sources to pull per alert (CrowdSec · threat-intel · EDR)	tool-using + approval-gated; 0.900 vs 0.800 always-full baseline
growth-engine	which attribution window + source bundle per lead-source query	7.851 vs 5.282 vs fixed window
control-tower	which Metabase query set per "ask anything" question	5.326 vs 1.643 vs core query set
agentic-billing	which usage/invoice/dunning signals to pull per account	4.122 vs 2.442 vs full-stack
social-autopilot	which channel/timing/content strategy per goal	3.875 vs 0.773 vs fixed strategy
agentic-support	which KB/tickets/account context to retrieve per ticket	3.679 vs 2.394 vs full-context
agentic-books	which ledgers/reports to pull per books question	3.632 vs 2.430 vs full-books
market-radar	which competitor watches to sweep per intel question	3.611 vs 0.403 vs full-sweep
agentic-compliance	which rule-family evidence to pull per finding	3.562 vs 2.463 vs full-evidence

PYTHONPATH=. python examples/sidekick_learning.py   # discrete-strategy bandit
PYTHONPATH=. python examples/rag_tuning.py          # numeric-knob tuning
PYTHONPATH=. python examples/soc_triage.py          # tool-using cybersecurity tenant

Plus the ToolPlugin seam (context_runtime/tools/ — how plans reach external systems,
with an approval-gated audit trail) and trace exporters (context_runtime/observability/ exporters.py — JSONL offline, or Langfuse / OpenLLMetry-OTel when the extras are
installed).

Status: v0.1 vertical slice. Runs fully offline with stub plugins; the real
redevops-rag retrieval and LiteLLM
model bindings are wired and lazy-imported. See SPEC.md §10 for the
conformance checklist these tests assert against.

Install

pip install -e .                 # core (offline stub path, zero heavy deps)
pip install -e ".[litellm]"      # real models across 100+ providers
pip install -e ".[rag]"          # redevops-rag — single-hop hybrid retrieval
pip install -e ".[hipporag]"     # HippoRAG — multi-hop graph retrieval (the planner picks per query)

Single-hop vs multi-hop is a per-query decision. The planner classifies intent and
routes: BM25/hybrid (redevops-rag) when the answer is in one chunk, graph (HippoRAG)
when it lives in the connections between documents — and the cost model only pays the
graph premium when it's warranted. python examples/hop_routing.py shows single-hop
missing the bridge document that multi-hop surfaces.

30-second tour

from context_runtime import ContextRuntime, SourceRef

rt = ContextRuntime.default(docs)          # offline: stub model + in-memory store

# RUN — the core abstraction (plan → build_context → execute → verify)
res = rt.run("Explain why deployment X failed",
             sources=[SourceRef("docs", "docs")],
             constraints={"max_cost_usd": 2.0, "require_citations": True})
print(res.answer, res.cost_usd, res.trace)

# EXPLAIN — debug the plan like SQL (add analyze=True for EXPLAIN ANALYZE)
ex = rt.explain("Explain why deployment X failed")
print(ex.intent.bucket, len(ex.candidates), ex.chosen.score.total)

# SIMULATE — forecast cost/latency/tokens with confidence intervals, no execution
sim = rt.simulate("Explain why deployment X failed")
print(sim.expected_cost_usd, sim.expected_models, sim.based_on_samples)

Or from the CLI / config:

PYTHONPATH=. python examples/incident_review.py
context-runtime --corpus ./docs run "what's our incident process?"
context-runtime --config context_runtime.yaml explain --analyze "why did deploy X fail?"

What's implemented (v0.1)

Seam (SPEC)	v0.1 implementation	Real binding (lazy)
Planner trio (intent/candidate/optimizer)	rule-table intent → candidate gen → heuristic cost model	— (the genuinely new core)
Cost model + statistics	`PlanScore` weighted utility + `pg_statistic`-style calibration	learned/neural (v0.3+)
Optimizer	knapsack / greedy-by-utility over the feasible set	OR-Tools CP-SAT (v0.2)
Execution Graph IR	linear graph carrying branch/loop/rollback kinds	full shapes (v0.4)
Scheduler	topo-sort waves	Dagster / cost-aware (v2)
Reasoner	`SingleShotReasoner` (one model)	mixtures: plan-worker-critic (v0.3+)
Model plugin	offline `StubModel`	LiteLLM + native cost-tiered routing
Retriever/Store	`InMemoryStore` (keyword)	redevops-rag (DuckDB+BM25+RRF+rerank)
Compression	sidekick `clip` structural pack	LLMLingua-2 semantic (v0.1 optional)
Verifier	citation/grounding check	RAGAS / Instructor
Observability	in-process `Trace` + JSON	OpenLLMetry → Langfuse
Plan Cache	null/always-miss stub	semantic cache (v0.2)

Architecture

The decision layer is thin; the substrate is reused. See:

ARCHITECTURE.md — the layered design and the cost-based optimizer loop
SPEC.md — the normative interface contracts (six plugin seams, IR, trace, plan-cache key)
ROADMAP.md — v0.1 → v2 phasing with per-phase exit benchmarks

Test

pip install -e ".[dev]" && pytest      # 18 tests; test_conformance.py == SPEC §10