Operations Runbook¶

Logging & metrics - Use Instrumentation to pass a logger (LoggerSink) and metrics (MetricsSink). - Async runner emits events: plan.built, retrieve.batch.ms, gate.results, judge.score_batch.ms, aggregate.overall, plus error counters. - Add your own sinks to forward to stdout/JSON logs or metrics backends (Datadog, Prometheus).

Retries and budgets - LLM: wrap providers with BudgetedProvider (prompt/output caps) and RetryingProvider (limited retries, backoff). - Retrieval/judge errors: async runner logs and re-raises; wrap at call site for circuit-breaker/rate-limit if needed.

Resource limits - Configure planner total_k and judge chunk/text limits via settings; profile-specific configs tighten gating/aggregation.

Tracing - Use TraceBuilder to collect a run; export DOT for audit.

Deployment tips - Keep optional deps isolated (llm/qdrant/chroma/cloud). Only install what you use. - Externalize credentials (env/secret manager); do not log PII/secrets. Add scrubbing in your sinks.