Server modes¶

Operating mode is a per-server toggle that controls how much of the agent pipeline is allowed to run when an incident opens on that server. It exists so operators can onboard a new host without immediately handing the agent permission to act, and so the platform can give them an explicit, audited path from observation to autonomy.

Three modes ship in Stage A of the server-modes plan: live, shadow, and audit. The default is live — no behaviour change for any server that exists today.

What each mode does¶

Mode	Triage	Diagnose	Propose recipe	Execute	Net effect
`live`	yes	yes	yes	gated by trust × risk	current behaviour, unchanged
`shadow`	yes	yes	yes	always `awaiting_approval` regardless of trust × risk	the agent does the thinking, the operator owns every action
`audit`	yes	—	—	—	classification + evidence only; nothing is proposed and nothing executes

`live`¶

The mode every existing server starts in. The pipeline runs end to end. Whether a recipe runs without human approval is decided by should_request_approval(trust_level, risk_level) in swarm/guardrails.py — the same trust × risk gate that has always governed execution.

`shadow`¶

The full agent pipeline runs through diagnose and recipe proposal, but every recipe stops at awaiting_approval. The trust × risk gate is bypassed in the strict direction: even a low-risk recipe on an autonomous-trust agent waits for a human. After diagnose completes the manager flips the incident status to awaiting_approval so the timeline reads correctly even if no recipe gets proposed.

This is the recommended setting for any server during its first weeks under OpenRemedy. The operator gets to see what the agent would do without the agent doing it.

`audit`¶

The pipeline stops after triage. No diagnose round, no recipe proposal, no execution. The incident is closed immediately with:

status = 'resolved'
resolution = 'Auto-classified (audit mode) — no remediation proposed.'
resolution_summary = 'audit-mode classification'

The status is reused rather than introducing a classified status of its own — that keeps the incident state machine, the SLA timer logic, and the existing frontend filters working unchanged. A dedicated bucket may be added later if it proves useful for reporting.

The audit close also publishes an incident_resolved websocket event and writes an audit_resolved agent event titled "Audit mode — incident classified, no remediation proposed".

audit is the right setting for a server you want classified and trended but where you want zero AI involvement beyond that — for instance during a regulated change-freeze window, or for a host whose owners haven't signed off on agent action yet.

Promotion path¶

The intended progression is:

audit ──► shadow ──► live

audit first if the host's owners haven't signed off on AI remediation yet. The platform still classifies and trends the incidents — the operator gets visibility without commitment.
shadow once the owners are comfortable with classification and want to see the agent's recommendations. Every recommendation is reviewed by a human before it runs.
live once enough shadow approvals have accumulated that the operator trusts the pairing. Trust × risk takes over.

There is no enforced ordering — an operator can flip a server between any two modes at any time. The progression above is the intended onboarding shape, not a gate.

Stage D of the server-modes plan adds an automated promotion suggestion engine driven by accumulated (server, recipe) approval counts. Stage A only ships the manual control.

Setting the mode¶

From the dashboard¶

Open the server detail page (/servers/{id}).
Switch to the Settings tab.
Choose the new mode from the Mode dropdown. Each option carries a one-line description of what that mode does.
The change is persisted on selection; a toast confirms.

The server list (/servers) renders a coloured badge in each row when the mode is not live: amber for shadow, sky for audit.

From the API¶

PATCH /api/v1/servers/{id}
Content-Type: application/json

{ "mode": "shadow" }

mode accepts "live", "shadow", or "audit". Any other value is rejected with 422 Unprocessable Entity.

Audit log¶

Every mode change writes an audit row with action server.mode_changed and payload {old, new}. The actor is the authenticated user (or the impersonator's id if the change came from a superadmin acting as another tenant). The row appears in /audit like any other audit event.

Interaction with the trust × risk gate¶

The mode does not replace the trust × risk gate — it sits in front of it and short-circuits when relevant.

Server mode	What the approval gate does
`live`	Unchanged. `should_request_approval(trust, risk)` decides.
`shadow`	Forced to "request approval" regardless of trust or risk. The trust × risk computation is skipped.
`audit`	Never reached. The pipeline stops before recipe proposal.

In guardrails.py the early return for shadow always returns True; for audit the function returns a sentinel that signals "no execution stages should run" and the manager treats that as a trigger to stop the pipeline.

The builder.py agent assembly is the second line of defence: in audit mode propose_recipe and execute_recipe are excluded from the agent's ROLE_TOOLS, so even a misbehaving prompt cannot work around the manager-level skip.

Maintenance lockout — adjacent control, not a safety gate¶

The five-gate safety story (server mode, trust × risk, approval, tool filter, safety classifier) is about preventing unsafe action. Maintenance lockout is a sixth independent layer with a different purpose: operator coordination during planned work. It belongs in this doc because it can suppress automatic remediation in the same way an audit mode would, but the reason is "a human is doing planned work right now," not "this server isn't trusted enough to auto-remediate."

When is_server_in_maintenance(server_id) returns a schedule, the daemon evidence path in api/daemon.py records the resulting incident with suppressed_by_maintenance_id set and skips the Redis publish, so the swarm never sees it. The pipeline never starts. The schedule's lifecycle (approved → running → completed | cancelled | failed | paused) controls when this gate is open or shut.

This layer is independent from the safety gates by construction:

It reads from a different DB table (maintenance_schedules / maintenance_runs), not from servers.mode or agents.trust_level.
It fires for a different reason (planned operator work), not in response to evidence.
Its failure mode is "schedule misconfigured → remediation blocked" — fail-closed in the same direction as the safety gates, but for a different category of input.

When a maintenance schedule transitions to a non-active terminal state (cancelled, completed, failed) — or to paused — the service layer (services/maintenance.py:un_suppress_incidents_for_schedule) clears suppressed_by_maintenance_id on every active incident that pointed at it and re-publishes incident.created, so the swarm picks up anything that accumulated during the window. This self-healing path is what stops a stuck schedule from leaving incidents permanently orphaned.

The whitepaper's "five independent gates" framing intentionally omits maintenance lockout because it is operator-coordination, not safety. Both descriptions are accurate; they describe different slices of the same defence-in-depth posture.

Fleet Map badges¶

The Fleet Map (/incidents map view) reads mode off the FleetTile payload. Tiles render a small mode pill in the tile head, right of the role label:

shadow → amber pill (amber-100 / amber-800 / amber-200).
audit → sky pill (sky-100 / sky-800 / sky-200).
live → no pill. The absence is the signal.

Mode is purely informational on the Map — it does not feed into the worst-of health colour. A server in shadow is not unhealthy on its own.

The pill updates live: a PATCH that changes only the mode still fires publish_tile_changed post-commit, so the Map redraws without needing a manual refresh.

Where the mode is sourced from¶

The mode applied to an incident is server.mode at the moment the incident is created. The pipeline reads it once at the start and threads it through the manager context as server_mode.

In Stage A there is no per-incident snapshot column. Two consequences:

An incident already in flight when an operator flips the mode continues with the mode it started under.
Incidents created after the flip use the new mode.

Stage B of the plan adds an incidents.mode_at_open column so historical incidents stay bucketed by their original mode even if the server is later promoted. Until Stage B lands, "what mode was this incident under" can only be inferred from the timeline.

Observed behaviour by mode¶

Stage	`live`	`shadow`	`audit`
Triage	runs	runs	runs
Diagnose	runs	runs	skipped
Recipe proposal	runs	runs	skipped
Execute	gated by trust × risk	always `awaiting_approval`	skipped
Review	runs after execute	runs after operator-approved execute	skipped
Final status	normal pipeline outcome	`awaiting_approval` (or whatever the operator chooses next)	`resolved` with `resolution_summary='audit-mode classification'`

Edge cases¶

Maintenance windows on a shadow server. Maintenance plans bypass shadow gating — they are an explicit operator action with their own approval gate.
Patrol against an audit server. Agent patrols still run read-only diagnostic verbs and may open incidents. The pipeline then stops before propose_recipe because the server is in audit mode. The incident is classified and resolved exactly like any other audit-mode incident.
In-flight incidents during a mode change. They finish with the mode they started under. The change applies to incidents created after the flip.
Mode revert during shadow approval wait. If the operator flips a server out of shadow while an incident is parked at awaiting_approval, the incident stays at awaiting_approval until a human acts. Nothing kicks it back into the pipeline automatically.

Reading the shadow stats¶

Stage B adds a per-server mode stats card to the server detail page, scoped to a rolling window (30 days by default, configurable). It is the dashboard view operators use to decide when a (server, recipe) pair has earned promotion out of shadow.

What each counter means¶

Proposed. Recipes the agent recommended while the server was in shadow. Counted once per recipe proposal, not per recipe execution.
Approved. Of those proposals, the count that an operator approved (the recipe ran). The headline ratio is approved / proposed.
Rejected. Proposals an operator explicitly rejected, or that expired without being acted on within the approval window.
Avg time to approve. Median wall-clock time between the agent emitting awaiting_approval and an operator clicking approve. Useful for spotting recipes that consistently get approved fast (low operational cost) vs. ones that stall.
Audit observed. Incidents the server saw while in audit mode — classified but never advanced to diagnose. A high audit-observed count alongside zero shadow proposals means the server is generating signal but the operator hasn't yet allowed the agent to think about it.

The 30-day window¶

The card aggregates over the last 30 days by default. The window is configurable on the endpoint (?window_days=N) so an operator investigating a long-running pairing can widen it. Counters outside the window age out, which is intentional — promotion should reflect recent behaviour, not behaviour from before the last prompt or recipe revision.

Promotion heuristic¶

The rule of thumb operators should follow:

Once the approved / proposed ratio is steady at 90%+ over a meaningful sample, the (server, recipe) pair is a candidate for promotion.

"Meaningful sample" is judgement — a single approval is not a trend. In practice, look for at least a dozen proposals of the same recipe on the same server before treating the ratio as load-bearing. Stage D will automate this suggestion; Stage B just surfaces the inputs.

Historical incidents stay in their original bucket¶

Stage B introduces incidents.mode_at_open, snapshotted at incident creation time. When the operator promotes a server from shadow to live, the incidents that opened while it was in shadow keep mode_at_open = 'shadow' forever. The incident list filter (?mode_at_open=shadow) and the mode stats card both key off this column, so historical bucketing survives mode flips. The live server.mode column reflects only the current setting; mode_at_open is the audit trail.

Tenant default mode¶

Stage B also adds a tenant-level default applied to newly-onboarded servers. Find it under /settings → tenant settings; pick live, shadow, or audit. New servers inherit this on creation (existing servers are unaffected). The intended use is to set the default to shadow for tenants whose onboarding policy is "observe before acting," so operators don't have to remember to flip every new host manually.

Per-execution dry-run¶

Stage C adds a per-execution preview flag, distinct from server mode. A dry-run execution runs Ansible with --check, so tasks report what would change without touching the host. No files are written, no services restarted, no packages installed.

Enabling it¶

UI — on the approval screen for a pending execution, tick Run as preview. The execution detail page renders a banner ("Preview run — no host changes") for the duration.
API — POST /executions/{id}/approve with body {"dry_run": true}. The flag is persisted on the execution row (executions.dry_run) and read by the worker.

`preview_completed` status¶

A successful preview run terminates in preview_completed, not success. They are deliberately different: success means the recipe ran for real and the host was changed; preview_completed means the recipe was rehearsed and nothing happened. Filtering and metrics that count "applied changes" should ignore preview_completed. Failures during a preview still go to failed — the dry-run flag does not mask errors, only side effects.

Interaction with server modes¶

Dry-run is orthogonal to mode. It works in any of live, shadow, or audit:

live — the natural use case: sanity-check a risky recipe on a production host before letting it apply.
shadow — operators already see the proposal; ticking preview lets them rehearse the would-be command stream before promoting the server to live.
audit — execute steps don't surface in audit mode in the first place (the pipeline closes after diagnose), so dry-run is rarely reachable here.

Rollback¶

Rollback is not applicable to preview executions — there is nothing to undo. The Rollback button is hidden on executions where dry_run = true, regardless of recipe.

Promotion ladder (Stage D)¶

Stage D automates the shadow → live step by accumulating (server, recipe) decisions and surfacing suggestions when a pairing has earned its way out of shadow.

How a suggestion is generated¶

Every approval, rejection, auto-execute, and post-run failure for a recipe writes a row into recipe_outcomes. A scanner aggregates those rows per (server, recipe) over a rolling window and materialises a promotion_suggestions row when the pairing clears the threshold.

Default thresholds:

>= 10 approvals in the window
0 rejections in the window (a single rejection disqualifies)
30-day rolling window
Safety floors: threshold cannot drop below 5; window cannot drop below 7 days

The scanner runs hourly inside the proactive container (PromotionScanner, alongside CheckScheduler, CheckEvaluator, IncidentWatcher, and MaintenanceScheduler). Re-scans are idempotent — pending or accepted suggestions for the same pair are not duplicated. Once a suggestion is dismissed, the scanner is free to re-suggest the same pair on the next cycle if the metrics still hold.

Accepting or dismissing a suggestion¶

Suggestions appear on the server detail page Overview tab as a card listing each candidate recipe. Operators can:

Accept — flips the server from shadow to live (via the same set_mode helper used by the manual dropdown) and stamps the suggestion as accepted.
Dismiss — marks the suggestion dismissed without changing the server mode. The pair becomes eligible again on the next scan.

Both actions require operator role; listing requires viewer. Endpoints:

GET /api/v1/servers/{id}/promotions
POST /api/v1/promotions/{id}/accept
POST /api/v1/promotions/{id}/dismiss

Audit trail¶

An accept writes two audit rows:

server.mode_changed — emitted by set_mode ({old, new}).
server.promoted — emitted by the accept endpoint, with {from_mode, to_mode, recipe_slug, suggestion_id, metric_snapshot}.

The two-row pattern means the generic mode-flip filter still catches every shadow → live transition, while server.promoted lets reporting distinguish operator-initiated promotions (with the recipe context) from manual dropdown flips.

The metric snapshot¶

Each suggestion stores the counters that triggered it as metric_snapshot JSONB:

approved_count — approvals in the window (the headline number).
rejected_count — rejections (always 0 at suggestion time).
auto_executed_count — recipes that ran without operator review because the server was already in live and trust × risk allowed it. Counted for context — does not factor into the threshold.
failed_count — post-run failures of the recipe in the window. Surfaced for operator awareness; does not block the suggestion.

The snapshot is frozen at scanner time, so the audit row reflects the exact metrics the operator saw when they accepted.

Out of scope¶

Tenant-configurable thresholds. Defaults are hard-coded today; per-tenant overrides are a future iteration.
Auto-promote. Promotion is operator-gated by design. The scanner never flips a server's mode on its own — it only materialises suggestions for a human to act on. This is the same principle as the trust × risk gate on live: the platform proposes, the operator decides.

docs/dashboard/servers.md — the Settings tab where the dropdown lives.
backend/src/openremedy/enums.py — ServerMode enum.
backend/src/openremedy/swarm/guardrails.py — should_request_approval early branches.
backend/src/openremedy/swarm/manager.py — stages_for_mode and the audit close.
backend/src/openremedy/services/server.py — set_mode helper, single source of truth for mode flips and audit emission.
backend/src/openremedy/services/promotion.py — promotion candidate query + materialisation.
backend/src/openremedy/proactive/promotion_scanner.py — the hourly scanner loop.
backend/src/openremedy/api/promotion.py — list/accept/dismiss endpoints.