Flagship case study·Python · SAP ecosystem

Autonomous observability triage for SAP HANA workloads

Python pipeline: SQL-grounded features, hybrid rules + confidence for root cause, allowlist-only remediations, and audit—optional LLM limited to narrative wording.

Python 3.11+SQL / SQLScriptSAP HANA & BTP (target)pytestOffline eval harness

Problem

SAP HANA workloads emit rich session, statement, and wait telemetry. In practice, operators still face alert fatigue, slow mean-time-to-diagnose, and governance risk when someone acts on a half-formed theory.

The goal is not an unconstrained auto-remediation bot. It is a governed triage assistant that turns signals into ranked hypotheses, bounded actions, explicit approvals where impact is high, rollback where defined, and an append-style audit trail.

Constraints

  • Core detection, ranking, and action eligibility stay deterministic or rule-driven; no LLM as sole authority for root cause or safety.
  • Remediations are allowlist-only (`config/action_allowlist.yaml`); high-risk paths require human approval and rollback hooks.
  • Secrets and HANA connectivity via `.env` / BTP destinations—never committed; synthetic CSV fixtures for local regression.
  • Evaluation is offline and honest: synthetic incidents, rubric-based narrative scoring, no claim of production representativeness without evidence.

End-to-end pipeline

Mirrors package layout in the triage repo: ingest through audit, with safety and human approval on the critical path before narrative and logging.

End-to-end triage pipeline: telemetry sources through ingest, store, detect, rank, impact, plan actions, safety, narrative, and audit.

Key technical decisions

  • Package boundaries mirror the pipeline: `ingest`, `detector`, `reasoner`, `impact`, `actions`, `safety`, `reporter`, `audit`—so the story matches the code layout.
  • Hybrid rules + confidence for `rank` and `impact` (Phase 6) instead of LLM-first diagnosis; reduces brittle “model guessed the cause” failure modes.
  • Template-first narrative (`what / why / so what / now what`); optional model only polishes wording, not facts or eligibility.
  • Eval harness (`scripts/eval_run.py`) forces LLM explainer off by default for reproducible CI-style gates on triage quality and safety metrics.

Security & governance

Allowlist violations target zero in aggregate metrics. High-risk actions hit an approval gate; rollback manager exercises stubbed paths under eval. Audit logs correlate decisions to `incident_id`.

Risks like narrative hallucination misread as ground truth are mitigated by template-first output, labeling model-assisted sections when enabled, and keeping ranked causes tied to SQL feature evidence.

Stale or unreachable HANA reads: block or degrade auto-execute, surface errors to operators, and record failures in audit (see architecture failure-mode table).

Evaluation & metrics

Seven offline metrics on fixed CSV + regression YAML (see `docs/evaluation-metrics.md`). Baseline vs v2 profiles (e.g. alert dedup) ship to `reports/eval-baseline-vs-v2.md`.

  • MTTD proxy

    Symptom onset → first alert (minutes); lower is better; report-only target.

  • Precision@k (root cause)

    Expected hypothesis in top-k; k=1 and k=3 reported.

  • False alert rate

    On negative cases, fraction with alerts; target 0 on fixture-defined negatives.

  • Allowlist violation rate

    Any proposed action not in allowlist; target 0.

  • Narrative completeness

    Rubric over required markdown sections + template version footer; target 1.0 on synthetic run.

  • Rollback success rate

    RollbackManager returns expected status for proposed actions.

  • Human approval latency (simulated)

    Mean simulated seconds from regression YAML; report-only.

What I'd change next

Tighten the loop between detector YAML thresholds and fixture-defined alert rules so eval stories and dashboard thresholds never diverge silently.

Add a second eval mode that replays production-shaped traces once we have consenting staging data—keeping synthetic gates for regression, adding realism checks separately.

Wire optional SAP AI Core / Cloud SDK AI only behind explicit policy objects and destination binding, with cost and latency caps per incident class.

Repository

Public source: triage pipeline, docs, and offline eval harness.

GitHub