Server modes¶
Operating mode is a per-server toggle that controls how much of the agent pipeline is allowed to run when an incident opens on that server. It exists so operators can onboard a new host without immediately handing the agent permission to act, and so the platform can give them an explicit, audited path from observation to autonomy.
Three modes ship in Stage A of the server-modes plan: live,
shadow, and audit. The default is live — no behaviour change
for any server that exists today.
What each mode does¶
| Mode | Triage | Diagnose | Propose recipe | Execute | Net effect |
|---|---|---|---|---|---|
live |
yes | yes | yes | gated by trust × risk | current behaviour, unchanged |
shadow |
yes | yes | yes | always awaiting_approval regardless of trust × risk |
the agent does the thinking, the operator owns every action |
audit |
yes | — | — | — | classification + evidence only; nothing is proposed and nothing executes |
live¶
The mode every existing server starts in. The pipeline runs end to
end. Whether a recipe runs without human approval is decided by
should_request_approval(trust_level, risk_level) in
swarm/guardrails.py — the same trust × risk gate that has always
governed execution.
shadow¶
The full agent pipeline runs through diagnose and recipe proposal,
but every recipe stops at awaiting_approval. The trust × risk
gate is bypassed in the strict direction: even a low-risk recipe
on an autonomous-trust agent waits for a human. After diagnose
completes the manager flips the incident status to
awaiting_approval so the timeline reads correctly even if no
recipe gets proposed.
This is the recommended setting for any server during its first weeks under OpenRemedy. The operator gets to see what the agent would do without the agent doing it.
audit¶
The pipeline stops after triage. No diagnose round, no recipe proposal, no execution. The incident is closed immediately with:
status = 'resolved'resolution = 'Auto-classified (audit mode) — no remediation proposed.'resolution_summary = 'audit-mode classification'
The status is reused rather than introducing a classified status
of its own — that keeps the incident state machine, the SLA timer
logic, and the existing frontend filters working unchanged. A
dedicated bucket may be added later if it proves useful for
reporting.
The audit close also publishes an incident_resolved websocket
event and writes an audit_resolved agent event titled
"Audit mode — incident classified, no remediation proposed".
audit is the right setting for a server you want classified and
trended but where you want zero AI involvement beyond that — for
instance during a regulated change-freeze window, or for a host
whose owners haven't signed off on agent action yet.
Promotion path¶
The intended progression is:
auditfirst if the host's owners haven't signed off on AI remediation yet. The platform still classifies and trends the incidents — the operator gets visibility without commitment.shadowonce the owners are comfortable with classification and want to see the agent's recommendations. Every recommendation is reviewed by a human before it runs.liveonce enough shadow approvals have accumulated that the operator trusts the pairing. Trust × risk takes over.
There is no enforced ordering — an operator can flip a server between any two modes at any time. The progression above is the intended onboarding shape, not a gate.
Stage D of the server-modes plan adds an automated promotion
suggestion engine driven by accumulated (server, recipe)
approval counts. Stage A only ships the manual control.
Setting the mode¶
From the dashboard¶
- Open the server detail page (
/servers/{id}). - Switch to the Settings tab.
- Choose the new mode from the Mode dropdown. Each option carries a one-line description of what that mode does.
- The change is persisted on selection; a toast confirms.
The server list (/servers) renders a coloured badge in each row
when the mode is not live: amber for shadow, sky for audit.
From the API¶
mode accepts "live", "shadow", or "audit". Any other value
is rejected with 422 Unprocessable Entity.
Audit log¶
Every mode change writes an audit row with action
server.mode_changed and payload {old, new}. The actor is the
authenticated user (or the impersonator's id if the change came
from a superadmin acting as another tenant). The row appears in
/audit like any other audit event.
Interaction with the trust × risk gate¶
The mode does not replace the trust × risk gate — it sits in front of it and short-circuits when relevant.
| Server mode | What the approval gate does |
|---|---|
live |
Unchanged. should_request_approval(trust, risk) decides. |
shadow |
Forced to "request approval" regardless of trust or risk. The trust × risk computation is skipped. |
audit |
Never reached. The pipeline stops before recipe proposal. |
In guardrails.py the early return for shadow always returns
True; for audit the function returns a sentinel that signals
"no execution stages should run" and the manager treats that as a
trigger to stop the pipeline.
The builder.py agent assembly is the second line of defence: in
audit mode propose_recipe and execute_recipe are excluded
from the agent's ROLE_TOOLS, so even a misbehaving prompt cannot
work around the manager-level skip.
Maintenance lockout — adjacent control, not a safety gate¶
The five-gate safety story (server mode, trust × risk, approval,
tool filter, safety classifier) is about preventing unsafe action.
Maintenance lockout is a sixth independent layer with a different
purpose: operator coordination during planned work. It belongs in
this doc because it can suppress automatic remediation in the same
way an audit mode would, but the reason is "a human is doing
planned work right now," not "this server isn't trusted enough to
auto-remediate."
When is_server_in_maintenance(server_id) returns a schedule, the
daemon evidence path in api/daemon.py records the resulting
incident with suppressed_by_maintenance_id set and skips the
Redis publish, so the swarm never sees it. The pipeline never
starts. The schedule's lifecycle (approved → running → completed
| cancelled | failed | paused) controls when this gate is open or
shut.
This layer is independent from the safety gates by construction:
- It reads from a different DB table (
maintenance_schedules/maintenance_runs), not fromservers.modeoragents.trust_level. - It fires for a different reason (planned operator work), not in response to evidence.
- Its failure mode is "schedule misconfigured → remediation blocked" — fail-closed in the same direction as the safety gates, but for a different category of input.
When a maintenance schedule transitions to a non-active terminal
state (cancelled, completed, failed) — or to paused — the
service layer (services/maintenance.py:un_suppress_incidents_for_schedule)
clears suppressed_by_maintenance_id on every active incident
that pointed at it and re-publishes incident.created, so the
swarm picks up anything that accumulated during the window. This
self-healing path is what stops a stuck schedule from leaving
incidents permanently orphaned.
The whitepaper's "five independent gates" framing intentionally omits maintenance lockout because it is operator-coordination, not safety. Both descriptions are accurate; they describe different slices of the same defence-in-depth posture.
Fleet Map badges¶
The Fleet Map (/incidents map view) reads mode off the
FleetTile payload. Tiles render a small mode pill in the tile
head, right of the role label:
shadow→ amber pill (amber-100/amber-800/amber-200).audit→ sky pill (sky-100/sky-800/sky-200).live→ no pill. The absence is the signal.
Mode is purely informational on the Map — it does not feed into
the worst-of health colour. A server in shadow is not unhealthy
on its own.
The pill updates live: a PATCH that changes only the mode still
fires publish_tile_changed post-commit, so the Map redraws
without needing a manual refresh.
Where the mode is sourced from¶
The mode applied to an incident is server.mode at the moment
the incident is created. The pipeline reads it once at the start
and threads it through the manager context as server_mode.
In Stage A there is no per-incident snapshot column. Two consequences:
- An incident already in flight when an operator flips the mode continues with the mode it started under.
- Incidents created after the flip use the new mode.
Stage B of the plan adds an incidents.mode_at_open column so
historical incidents stay bucketed by their original mode even if
the server is later promoted. Until Stage B lands, "what mode was
this incident under" can only be inferred from the timeline.
Observed behaviour by mode¶
| Stage | live |
shadow |
audit |
|---|---|---|---|
| Triage | runs | runs | runs |
| Diagnose | runs | runs | skipped |
| Recipe proposal | runs | runs | skipped |
| Execute | gated by trust × risk | always awaiting_approval |
skipped |
| Review | runs after execute | runs after operator-approved execute | skipped |
| Final status | normal pipeline outcome | awaiting_approval (or whatever the operator chooses next) |
resolved with resolution_summary='audit-mode classification' |
Edge cases¶
- Maintenance windows on a
shadowserver. Maintenance plans bypass shadow gating — they are an explicit operator action with their own approval gate. - Patrol against an
auditserver. Agent patrols still run read-only diagnostic verbs and may open incidents. The pipeline then stops beforepropose_recipebecause the server is in audit mode. The incident is classified and resolved exactly like any other audit-mode incident. - In-flight incidents during a mode change. They finish with the mode they started under. The change applies to incidents created after the flip.
- Mode revert during shadow approval wait. If the operator
flips a server out of
shadowwhile an incident is parked atawaiting_approval, the incident stays atawaiting_approvaluntil a human acts. Nothing kicks it back into the pipeline automatically.
Reading the shadow stats¶
Stage B adds a per-server mode stats card to the server detail
page, scoped to a rolling window (30 days by default, configurable).
It is the dashboard view operators use to decide when a (server,
recipe) pair has earned promotion out of shadow.
What each counter means¶
- Proposed. Recipes the agent recommended while the server was
in
shadow. Counted once per recipe proposal, not per recipe execution. - Approved. Of those proposals, the count that an operator
approved (the recipe ran). The headline ratio is
approved / proposed. - Rejected. Proposals an operator explicitly rejected, or that expired without being acted on within the approval window.
- Avg time to approve. Median wall-clock time between the
agent emitting
awaiting_approvaland an operator clicking approve. Useful for spotting recipes that consistently get approved fast (low operational cost) vs. ones that stall. - Audit observed. Incidents the server saw while in
auditmode — classified but never advanced to diagnose. A high audit-observed count alongside zero shadow proposals means the server is generating signal but the operator hasn't yet allowed the agent to think about it.
The 30-day window¶
The card aggregates over the last 30 days by default. The window
is configurable on the endpoint (?window_days=N) so an operator
investigating a long-running pairing can widen it. Counters
outside the window age out, which is intentional — promotion
should reflect recent behaviour, not behaviour from before the
last prompt or recipe revision.
Promotion heuristic¶
The rule of thumb operators should follow:
Once the approved / proposed ratio is steady at 90%+ over a meaningful sample, the
(server, recipe)pair is a candidate for promotion.
"Meaningful sample" is judgement — a single approval is not a trend. In practice, look for at least a dozen proposals of the same recipe on the same server before treating the ratio as load-bearing. Stage D will automate this suggestion; Stage B just surfaces the inputs.
Historical incidents stay in their original bucket¶
Stage B introduces incidents.mode_at_open, snapshotted at
incident creation time. When the operator promotes a server from
shadow to live, the incidents that opened while it was in
shadow keep mode_at_open = 'shadow' forever. The incident list
filter (?mode_at_open=shadow) and the mode stats card both key
off this column, so historical bucketing survives mode flips.
The live server.mode column reflects only the current setting;
mode_at_open is the audit trail.
Tenant default mode¶
Stage B also adds a tenant-level default applied to newly-onboarded
servers. Find it under /settings → tenant settings; pick
live, shadow, or audit. New servers inherit this on creation
(existing servers are unaffected). The intended use is to set the
default to shadow for tenants whose onboarding policy is
"observe before acting," so operators don't have to remember to
flip every new host manually.
Per-execution dry-run¶
Stage C adds a per-execution preview flag, distinct from server
mode. A dry-run execution runs Ansible with --check, so tasks
report what would change without touching the host. No files are
written, no services restarted, no packages installed.
Enabling it¶
- UI — on the approval screen for a pending execution, tick Run as preview. The execution detail page renders a banner ("Preview run — no host changes") for the duration.
- API —
POST /executions/{id}/approvewith body{"dry_run": true}. The flag is persisted on the execution row (executions.dry_run) and read by the worker.
preview_completed status¶
A successful preview run terminates in preview_completed, not
success. They are deliberately different: success means the
recipe ran for real and the host was changed; preview_completed
means the recipe was rehearsed and nothing happened. Filtering and
metrics that count "applied changes" should ignore
preview_completed. Failures during a preview still go to
failed — the dry-run flag does not mask errors, only side
effects.
Interaction with server modes¶
Dry-run is orthogonal to mode. It works in any of live,
shadow, or audit:
- live — the natural use case: sanity-check a risky recipe on a production host before letting it apply.
- shadow — operators already see the proposal; ticking preview
lets them rehearse the would-be command stream before
promoting the server to
live. - audit — execute steps don't surface in audit mode in the first place (the pipeline closes after diagnose), so dry-run is rarely reachable here.
Rollback¶
Rollback is not applicable to preview executions — there is
nothing to undo. The Rollback button is hidden on executions where
dry_run = true, regardless of recipe.
Promotion ladder (Stage D)¶
Stage D automates the shadow → live step by accumulating
(server, recipe) decisions and surfacing suggestions when a pairing
has earned its way out of shadow.
How a suggestion is generated¶
Every approval, rejection, auto-execute, and post-run failure for a
recipe writes a row into recipe_outcomes. A scanner aggregates
those rows per (server, recipe) over a rolling window and
materialises a promotion_suggestions row when the pairing clears
the threshold.
Default thresholds:
>= 10approvals in the window0rejections in the window (a single rejection disqualifies)- 30-day rolling window
- Safety floors: threshold cannot drop below 5; window cannot drop below 7 days
The scanner runs hourly inside the proactive container
(PromotionScanner, alongside CheckScheduler, CheckEvaluator,
IncidentWatcher, and MaintenanceScheduler). Re-scans are
idempotent — pending or accepted suggestions for the same pair are
not duplicated. Once a suggestion is dismissed, the scanner is free
to re-suggest the same pair on the next cycle if the metrics still
hold.
Accepting or dismissing a suggestion¶
Suggestions appear on the server detail page Overview tab as a card listing each candidate recipe. Operators can:
- Accept — flips the server from
shadowtolive(via the sameset_modehelper used by the manual dropdown) and stamps the suggestion as accepted. - Dismiss — marks the suggestion dismissed without changing the server mode. The pair becomes eligible again on the next scan.
Both actions require operator role; listing requires viewer.
Endpoints:
GET /api/v1/servers/{id}/promotionsPOST /api/v1/promotions/{id}/acceptPOST /api/v1/promotions/{id}/dismiss
Audit trail¶
An accept writes two audit rows:
server.mode_changed— emitted byset_mode({old, new}).server.promoted— emitted by the accept endpoint, with{from_mode, to_mode, recipe_slug, suggestion_id, metric_snapshot}.
The two-row pattern means the generic mode-flip filter still catches
every shadow → live transition, while server.promoted lets
reporting distinguish operator-initiated promotions (with the recipe
context) from manual dropdown flips.
The metric snapshot¶
Each suggestion stores the counters that triggered it as
metric_snapshot JSONB:
approved_count— approvals in the window (the headline number).rejected_count— rejections (always0at suggestion time).auto_executed_count— recipes that ran without operator review because the server was already inliveand trust × risk allowed it. Counted for context — does not factor into the threshold.failed_count— post-run failures of the recipe in the window. Surfaced for operator awareness; does not block the suggestion.
The snapshot is frozen at scanner time, so the audit row reflects the exact metrics the operator saw when they accepted.
Out of scope¶
- Tenant-configurable thresholds. Defaults are hard-coded today; per-tenant overrides are a future iteration.
- Auto-promote. Promotion is operator-gated by design. The
scanner never flips a server's mode on its own — it only
materialises suggestions for a human to act on. This is the same
principle as the trust × risk gate on
live: the platform proposes, the operator decides.
Related¶
docs/dashboard/servers.md— the Settings tab where the dropdown lives.backend/src/openremedy/enums.py—ServerModeenum.backend/src/openremedy/swarm/guardrails.py—should_request_approvalearly branches.backend/src/openremedy/swarm/manager.py—stages_for_modeand the audit close.backend/src/openremedy/services/server.py—set_modehelper, single source of truth for mode flips and audit emission.backend/src/openremedy/services/promotion.py— promotion candidate query + materialisation.backend/src/openremedy/proactive/promotion_scanner.py— the hourly scanner loop.backend/src/openremedy/api/promotion.py— list/accept/dismiss endpoints.