Skip to content

Server modes

Operating mode is a per-server toggle that controls how much of the agent pipeline is allowed to run when an incident opens on that server. It exists so operators can onboard a new host without immediately handing the agent permission to act, and so the platform can give them an explicit, audited path from observation to autonomy.

Three modes ship in Stage A of the server-modes plan: live, shadow, and audit. The default is live — no behaviour change for any server that exists today.


What each mode does

Mode Triage Diagnose Propose recipe Execute Net effect
live yes yes yes gated by trust × risk current behaviour, unchanged
shadow yes yes yes always awaiting_approval regardless of trust × risk the agent does the thinking, the operator owns every action
audit yes classification + evidence only; nothing is proposed and nothing executes

live

The mode every existing server starts in. The pipeline runs end to end. Whether a recipe runs without human approval is decided by should_request_approval(trust_level, risk_level) in swarm/guardrails.py — the same trust × risk gate that has always governed execution.

shadow

The full agent pipeline runs through diagnose and recipe proposal, but every recipe stops at awaiting_approval. The trust × risk gate is bypassed in the strict direction: even a low-risk recipe on an autonomous-trust agent waits for a human. After diagnose completes the manager flips the incident status to awaiting_approval so the timeline reads correctly even if no recipe gets proposed.

This is the recommended setting for any server during its first weeks under OpenRemedy. The operator gets to see what the agent would do without the agent doing it.

audit

The pipeline stops after triage. No diagnose round, no recipe proposal, no execution. The incident is closed immediately with:

  • status = 'resolved'
  • resolution = 'Auto-classified (audit mode) — no remediation proposed.'
  • resolution_summary = 'audit-mode classification'

The status is reused rather than introducing a classified status of its own — that keeps the incident state machine, the SLA timer logic, and the existing frontend filters working unchanged. A dedicated bucket may be added later if it proves useful for reporting.

The audit close also publishes an incident_resolved websocket event and writes an audit_resolved agent event titled "Audit mode — incident classified, no remediation proposed".

audit is the right setting for a server you want classified and trended but where you want zero AI involvement beyond that — for instance during a regulated change-freeze window, or for a host whose owners haven't signed off on agent action yet.


Promotion path

The intended progression is:

audit ──► shadow ──► live
  • audit first if the host's owners haven't signed off on AI remediation yet. The platform still classifies and trends the incidents — the operator gets visibility without commitment.
  • shadow once the owners are comfortable with classification and want to see the agent's recommendations. Every recommendation is reviewed by a human before it runs.
  • live once enough shadow approvals have accumulated that the operator trusts the pairing. Trust × risk takes over.

There is no enforced ordering — an operator can flip a server between any two modes at any time. The progression above is the intended onboarding shape, not a gate.

Stage D of the server-modes plan adds an automated promotion suggestion engine driven by accumulated (server, recipe) approval counts. Stage A only ships the manual control.


Setting the mode

From the dashboard

  1. Open the server detail page (/servers/{id}).
  2. Switch to the Settings tab.
  3. Choose the new mode from the Mode dropdown. Each option carries a one-line description of what that mode does.
  4. The change is persisted on selection; a toast confirms.

The server list (/servers) renders a coloured badge in each row when the mode is not live: amber for shadow, sky for audit.

From the API

PATCH /api/v1/servers/{id}
Content-Type: application/json

{ "mode": "shadow" }

mode accepts "live", "shadow", or "audit". Any other value is rejected with 422 Unprocessable Entity.

Audit log

Every mode change writes an audit row with action server.mode_changed and payload {old, new}. The actor is the authenticated user (or the impersonator's id if the change came from a superadmin acting as another tenant). The row appears in /audit like any other audit event.


Interaction with the trust × risk gate

The mode does not replace the trust × risk gate — it sits in front of it and short-circuits when relevant.

Server mode What the approval gate does
live Unchanged. should_request_approval(trust, risk) decides.
shadow Forced to "request approval" regardless of trust or risk. The trust × risk computation is skipped.
audit Never reached. The pipeline stops before recipe proposal.

In guardrails.py the early return for shadow always returns True; for audit the function returns a sentinel that signals "no execution stages should run" and the manager treats that as a trigger to stop the pipeline.

The builder.py agent assembly is the second line of defence: in audit mode propose_recipe and execute_recipe are excluded from the agent's ROLE_TOOLS, so even a misbehaving prompt cannot work around the manager-level skip.


Maintenance lockout — adjacent control, not a safety gate

The five-gate safety story (server mode, trust × risk, approval, tool filter, safety classifier) is about preventing unsafe action. Maintenance lockout is a sixth independent layer with a different purpose: operator coordination during planned work. It belongs in this doc because it can suppress automatic remediation in the same way an audit mode would, but the reason is "a human is doing planned work right now," not "this server isn't trusted enough to auto-remediate."

When is_server_in_maintenance(server_id) returns a schedule, the daemon evidence path in api/daemon.py records the resulting incident with suppressed_by_maintenance_id set and skips the Redis publish, so the swarm never sees it. The pipeline never starts. The schedule's lifecycle (approved → running → completed | cancelled | failed | paused) controls when this gate is open or shut.

This layer is independent from the safety gates by construction:

  • It reads from a different DB table (maintenance_schedules / maintenance_runs), not from servers.mode or agents.trust_level.
  • It fires for a different reason (planned operator work), not in response to evidence.
  • Its failure mode is "schedule misconfigured → remediation blocked" — fail-closed in the same direction as the safety gates, but for a different category of input.

When a maintenance schedule transitions to a non-active terminal state (cancelled, completed, failed) — or to paused — the service layer (services/maintenance.py:un_suppress_incidents_for_schedule) clears suppressed_by_maintenance_id on every active incident that pointed at it and re-publishes incident.created, so the swarm picks up anything that accumulated during the window. This self-healing path is what stops a stuck schedule from leaving incidents permanently orphaned.

The whitepaper's "five independent gates" framing intentionally omits maintenance lockout because it is operator-coordination, not safety. Both descriptions are accurate; they describe different slices of the same defence-in-depth posture.


Fleet Map badges

The Fleet Map (/incidents map view) reads mode off the FleetTile payload. Tiles render a small mode pill in the tile head, right of the role label:

  • shadow → amber pill (amber-100 / amber-800 / amber-200).
  • audit → sky pill (sky-100 / sky-800 / sky-200).
  • live → no pill. The absence is the signal.

Mode is purely informational on the Map — it does not feed into the worst-of health colour. A server in shadow is not unhealthy on its own.

The pill updates live: a PATCH that changes only the mode still fires publish_tile_changed post-commit, so the Map redraws without needing a manual refresh.


Where the mode is sourced from

The mode applied to an incident is server.mode at the moment the incident is created. The pipeline reads it once at the start and threads it through the manager context as server_mode.

In Stage A there is no per-incident snapshot column. Two consequences:

  1. An incident already in flight when an operator flips the mode continues with the mode it started under.
  2. Incidents created after the flip use the new mode.

Stage B of the plan adds an incidents.mode_at_open column so historical incidents stay bucketed by their original mode even if the server is later promoted. Until Stage B lands, "what mode was this incident under" can only be inferred from the timeline.


Observed behaviour by mode

Stage live shadow audit
Triage runs runs runs
Diagnose runs runs skipped
Recipe proposal runs runs skipped
Execute gated by trust × risk always awaiting_approval skipped
Review runs after execute runs after operator-approved execute skipped
Final status normal pipeline outcome awaiting_approval (or whatever the operator chooses next) resolved with resolution_summary='audit-mode classification'

Edge cases

  • Maintenance windows on a shadow server. Maintenance plans bypass shadow gating — they are an explicit operator action with their own approval gate.
  • Patrol against an audit server. Agent patrols still run read-only diagnostic verbs and may open incidents. The pipeline then stops before propose_recipe because the server is in audit mode. The incident is classified and resolved exactly like any other audit-mode incident.
  • In-flight incidents during a mode change. They finish with the mode they started under. The change applies to incidents created after the flip.
  • Mode revert during shadow approval wait. If the operator flips a server out of shadow while an incident is parked at awaiting_approval, the incident stays at awaiting_approval until a human acts. Nothing kicks it back into the pipeline automatically.

Reading the shadow stats

Stage B adds a per-server mode stats card to the server detail page, scoped to a rolling window (30 days by default, configurable). It is the dashboard view operators use to decide when a (server, recipe) pair has earned promotion out of shadow.

What each counter means

  • Proposed. Recipes the agent recommended while the server was in shadow. Counted once per recipe proposal, not per recipe execution.
  • Approved. Of those proposals, the count that an operator approved (the recipe ran). The headline ratio is approved / proposed.
  • Rejected. Proposals an operator explicitly rejected, or that expired without being acted on within the approval window.
  • Avg time to approve. Median wall-clock time between the agent emitting awaiting_approval and an operator clicking approve. Useful for spotting recipes that consistently get approved fast (low operational cost) vs. ones that stall.
  • Audit observed. Incidents the server saw while in audit mode — classified but never advanced to diagnose. A high audit-observed count alongside zero shadow proposals means the server is generating signal but the operator hasn't yet allowed the agent to think about it.

The 30-day window

The card aggregates over the last 30 days by default. The window is configurable on the endpoint (?window_days=N) so an operator investigating a long-running pairing can widen it. Counters outside the window age out, which is intentional — promotion should reflect recent behaviour, not behaviour from before the last prompt or recipe revision.

Promotion heuristic

The rule of thumb operators should follow:

Once the approved / proposed ratio is steady at 90%+ over a meaningful sample, the (server, recipe) pair is a candidate for promotion.

"Meaningful sample" is judgement — a single approval is not a trend. In practice, look for at least a dozen proposals of the same recipe on the same server before treating the ratio as load-bearing. Stage D will automate this suggestion; Stage B just surfaces the inputs.

Historical incidents stay in their original bucket

Stage B introduces incidents.mode_at_open, snapshotted at incident creation time. When the operator promotes a server from shadow to live, the incidents that opened while it was in shadow keep mode_at_open = 'shadow' forever. The incident list filter (?mode_at_open=shadow) and the mode stats card both key off this column, so historical bucketing survives mode flips. The live server.mode column reflects only the current setting; mode_at_open is the audit trail.

Tenant default mode

Stage B also adds a tenant-level default applied to newly-onboarded servers. Find it under /settings → tenant settings; pick live, shadow, or audit. New servers inherit this on creation (existing servers are unaffected). The intended use is to set the default to shadow for tenants whose onboarding policy is "observe before acting," so operators don't have to remember to flip every new host manually.


Per-execution dry-run

Stage C adds a per-execution preview flag, distinct from server mode. A dry-run execution runs Ansible with --check, so tasks report what would change without touching the host. No files are written, no services restarted, no packages installed.

Enabling it

  • UI — on the approval screen for a pending execution, tick Run as preview. The execution detail page renders a banner ("Preview run — no host changes") for the duration.
  • APIPOST /executions/{id}/approve with body {"dry_run": true}. The flag is persisted on the execution row (executions.dry_run) and read by the worker.

preview_completed status

A successful preview run terminates in preview_completed, not success. They are deliberately different: success means the recipe ran for real and the host was changed; preview_completed means the recipe was rehearsed and nothing happened. Filtering and metrics that count "applied changes" should ignore preview_completed. Failures during a preview still go to failed — the dry-run flag does not mask errors, only side effects.

Interaction with server modes

Dry-run is orthogonal to mode. It works in any of live, shadow, or audit:

  • live — the natural use case: sanity-check a risky recipe on a production host before letting it apply.
  • shadow — operators already see the proposal; ticking preview lets them rehearse the would-be command stream before promoting the server to live.
  • audit — execute steps don't surface in audit mode in the first place (the pipeline closes after diagnose), so dry-run is rarely reachable here.

Rollback

Rollback is not applicable to preview executions — there is nothing to undo. The Rollback button is hidden on executions where dry_run = true, regardless of recipe.


Promotion ladder (Stage D)

Stage D automates the shadow → live step by accumulating (server, recipe) decisions and surfacing suggestions when a pairing has earned its way out of shadow.

How a suggestion is generated

Every approval, rejection, auto-execute, and post-run failure for a recipe writes a row into recipe_outcomes. A scanner aggregates those rows per (server, recipe) over a rolling window and materialises a promotion_suggestions row when the pairing clears the threshold.

Default thresholds:

  • >= 10 approvals in the window
  • 0 rejections in the window (a single rejection disqualifies)
  • 30-day rolling window
  • Safety floors: threshold cannot drop below 5; window cannot drop below 7 days

The scanner runs hourly inside the proactive container (PromotionScanner, alongside CheckScheduler, CheckEvaluator, IncidentWatcher, and MaintenanceScheduler). Re-scans are idempotent — pending or accepted suggestions for the same pair are not duplicated. Once a suggestion is dismissed, the scanner is free to re-suggest the same pair on the next cycle if the metrics still hold.

Accepting or dismissing a suggestion

Suggestions appear on the server detail page Overview tab as a card listing each candidate recipe. Operators can:

  • Accept — flips the server from shadow to live (via the same set_mode helper used by the manual dropdown) and stamps the suggestion as accepted.
  • Dismiss — marks the suggestion dismissed without changing the server mode. The pair becomes eligible again on the next scan.

Both actions require operator role; listing requires viewer. Endpoints:

  • GET /api/v1/servers/{id}/promotions
  • POST /api/v1/promotions/{id}/accept
  • POST /api/v1/promotions/{id}/dismiss

Audit trail

An accept writes two audit rows:

  1. server.mode_changed — emitted by set_mode ({old, new}).
  2. server.promoted — emitted by the accept endpoint, with {from_mode, to_mode, recipe_slug, suggestion_id, metric_snapshot}.

The two-row pattern means the generic mode-flip filter still catches every shadow → live transition, while server.promoted lets reporting distinguish operator-initiated promotions (with the recipe context) from manual dropdown flips.

The metric snapshot

Each suggestion stores the counters that triggered it as metric_snapshot JSONB:

  • approved_count — approvals in the window (the headline number).
  • rejected_count — rejections (always 0 at suggestion time).
  • auto_executed_count — recipes that ran without operator review because the server was already in live and trust × risk allowed it. Counted for context — does not factor into the threshold.
  • failed_count — post-run failures of the recipe in the window. Surfaced for operator awareness; does not block the suggestion.

The snapshot is frozen at scanner time, so the audit row reflects the exact metrics the operator saw when they accepted.

Out of scope

  • Tenant-configurable thresholds. Defaults are hard-coded today; per-tenant overrides are a future iteration.
  • Auto-promote. Promotion is operator-gated by design. The scanner never flips a server's mode on its own — it only materialises suggestions for a human to act on. This is the same principle as the trust × risk gate on live: the platform proposes, the operator decides.

  • docs/dashboard/servers.md — the Settings tab where the dropdown lives.
  • backend/src/openremedy/enums.pyServerMode enum.
  • backend/src/openremedy/swarm/guardrails.pyshould_request_approval early branches.
  • backend/src/openremedy/swarm/manager.pystages_for_mode and the audit close.
  • backend/src/openremedy/services/server.pyset_mode helper, single source of truth for mode flips and audit emission.
  • backend/src/openremedy/services/promotion.py — promotion candidate query + materialisation.
  • backend/src/openremedy/proactive/promotion_scanner.py — the hourly scanner loop.
  • backend/src/openremedy/api/promotion.py — list/accept/dismiss endpoints.