Proactive monitoring and early detection¶
OpenRemedy is not waiting for your servers to fail. The platform runs six independent proactive mechanisms that watch the fleet continuously and create incidents — or prevent them — the moment a deviation is detected. Whether the incident comes from an external alert manager, a daemon threshold, a scheduled health probe, an agent's unsolicited observation, or a maintenance window opening on schedule, the same pipeline handles it from there.
This document covers what each mechanism does, where it runs, what it is good at, and how to tune it.
Where the loops live¶
Five of the loops run inside the proactive container; the patrol
loop runs inside the swarm container alongside the SwarmManager
because patrols are agent rounds and need direct access to the
manager's executor. Both containers share the same code base; the
split is a deployment concern, not an architectural one.
flowchart LR
subgraph proactive["proactive container"]
CS[CheckScheduler<br/>60 s sweep]
CE[CheckEvaluator<br/>continuous LLM eval]
IW[IncidentWatcher<br/>Redis pub/sub]
MS[MaintenanceScheduler<br/>60 s sweep]
PS[PromotionScanner<br/>1 h sweep]
end
subgraph swarm["swarm container"]
PA[PatrolScheduler<br/>per-agent interval]
end
subgraph external["External"]
W[Webhook<br/>Alertmanager / Grafana<br/>Datadog / PagerDuty]
end
subgraph onserver["Managed server"]
D[Daemon monitors<br/>~15 s cadence]
end
W --> INC[Incident]
D --> INC
CS --> CE --> INC
PA --> INC
MS -. window opens .-> RUN[Maintenance run]
PS -. surfaces .-> SUG[Promotion suggestion]
IW <-. comments / approvals .-> INC
INC --> PIPE[Pipeline<br/>triage → diagnose → execute → review]
Each loop is described below. Push sources (webhooks, daemon alerts) feed the same incident pipeline but are documented in Integrations and Daemon → Install.
1 · CheckScheduler¶
Container: proactive. Source:
backend/src/openremedy/proactive/scheduler.py. Cycle: 60 s.
Every minute the scheduler sweeps the database for alert_policies
rows whose flow_definition includes a recipe_check trigger
node. For each one whose interval has elapsed since its
last_check_at, it dispatches the recipe to the ARQ worker queue.
The worker runs the playbook (Ansible) over SSH against the target
servers and writes the result to check_results. Two things to
note:
- The "trigger" lives on a policy, not on a recipe. Policies define both what to check and who to notify; the recipe is the executable artifact the policy points at.
- The 60 s sweep is the floor. The per-policy
intervalcan be longer (every 5 minutes, every hour) but not shorter — anything below 60 s would be limited by the sweep itself.
What this is for¶
Health checks that don't fit on the daemon:
- Servers without a daemon. Legacy boxes, third-party-managed hosts, agentless segments.
- Multi-step or stateful checks. Authenticated HTTP scrapes, database queries, validations that need values from multiple commands.
- Synthetic probes. "Hit this URL with this payload, expect this field in the response."
Tuning¶
| Knob | Where |
|---|---|
| Sweep interval | CYCLE_INTERVAL = 60 in scheduler.py. |
| Per-policy frequency | The recipe_check trigger in the policy's flow editor. |
| Worker concurrency | OREMEDY_WORKER_MAX_JOBS (ARQ). |
| Maintenance suppression | Active windows on a target host suppress dispatch. |
2 · CheckEvaluator¶
Container: proactive. Source:
backend/src/openremedy/proactive/evaluator.py.
Reads new rows from check_results as the worker writes them and
decides whether each check passed or failed. The decision uses:
- The recipe's structured success criteria (exit code, regex, expected JSON shape) when present.
- An LLM evaluation when the criteria are ambiguous or the recipe asks for context-aware judgement.
A failed result becomes an incident, published to the incidents
Redis channel, which the SwarmManager picks up.
What this is for¶
Checks where the meaning of the output depends on context. "92 % disk" is not the same on a database box during nightly backup as it is on a cache box at 4 AM. The evaluator can apply that context without the operator having to encode every nuance into the recipe.
Tuning¶
| Knob | Where |
|---|---|
| Evaluation model | Per-tenant, in Settings → LLM. |
| Default-pass / default-fail | Per recipe. |
| Suppression windows | Active maintenance windows skip evaluation. |
3 · IncidentWatcher¶
Container: proactive. Source:
backend/src/openremedy/proactive/watcher.py.
Subscribes to the Redis incidents and approvals channels.
- Comments on
escalated/monitoringincidents re-invoke the agent pipeline with the comment as added context. The agent gets a second turn, informed by what the human just said. - Approval / rejection of a pending recipe execution publishes the decision and the worker either runs or cancels the playbook.
monitoringstate with aninterval_minutespayload schedules a re-evaluation after the interval elapses; the incident is re-opened in the pipeline if still inmonitoring.
This is the channel that closes the loop between the human and the agent without forcing the operator to manually re-trigger anything.
Tuning¶
There is nothing to configure on the watcher itself. It reacts to incident state. The behaviour is governed by:
- The agent's
trust_leveland assigned roles. - The recipe's risk level (drives the approval gate; see Security → Approval gate).
- The incident status (
escalated,monitoring,awaiting_approval). - The
interval_minutespayload onmonitoringevents.
4 · MaintenanceScheduler¶
Container: proactive. Source:
backend/src/openremedy/proactive/maintenance.py. Cycle: 60 s.
Sweeps maintenance_schedules for rows whose status = 'approved',
paused_at IS NULL, and scheduled_start <= now(). For each due
schedule it spawns a MaintenanceExecutor task (capped at
max_concurrent_schedules = 3 per process) that walks the target
servers in rolling order, executing each step and pausing on
human_gate or manual step types.
A maintenance run is the opposite of an incident: it is a planned, operator-authored mutation that the platform executes on its own. While a run is active on a server it also acts as a suppressor — webhook ingest, the CheckScheduler, and the CheckEvaluator all consult the active windows table before creating incidents on the same host.
What this is for¶
Recurring upgrades, OS patching, certificate rotations, capacity re-balances — anything you would otherwise script in cron but want to be approval-gated, audited, and aware of agent-detected incidents.
Tuning¶
| Knob | Where |
|---|---|
| Sweep interval | CYCLE_INTERVAL = 60 in maintenance.py. |
| Concurrent schedules | MaintenanceScheduler(max_concurrent_schedules=3). |
| Per-step approval | human_gate and manual step types pause the run. |
See Maintenance plans → Authoring for the markdown DSL and step types.
5 · PromotionScanner¶
Container: proactive. Source:
backend/src/openremedy/proactive/promotion_scanner.py. Cycle:
1 h.
Scans recipe_outcomes over a rolling 30-day window grouped by
(server role, recipe). When a (role, recipe) pair clears the trust
ladder — N consecutive successful, no-rollback executions — the
scanner materialises a row in promotion_suggestions. The
operator sees it on the Promotions page and either accepts
(creating a recipe_role_overrides row that auto-executes the
recipe on that role) or dismisses it.
What this is for¶
The platform's quiet path from supervised to autonomous. Most remediations a fleet runs are repetitive — restart this service, prune that container, vacuum that database. Accepting a promotion once skips the trust × risk gate for that role permanently, until the operator revokes it.
Tuning¶
| Knob | Where |
|---|---|
| Cadence | CYCLE_INTERVAL = 3600 in promotion_scanner.py. |
| Promotion thresholds | services/promotion.py (consecutive successes, lookback days, risk floor). |
| Override revocation | DELETE /api/v1/recipe-role-overrides/{id} or the Promotions UI. |
6 · PatrolScheduler¶
Container: swarm. Source:
backend/src/openremedy/swarm/patrol.py.
Every agent has a patrol_interval (minutes) on its
configuration. When greater than zero, the patrol scheduler
periodically asks the agent to perform a patrol — an
unscheduled round of diagnostic checks across the agent's assigned
servers.
The agent uses its built-in diagnostic verbs (run_diagnostic_command
with top_snapshot, process_list_filter, docker_* etc.) to
look for anomalies that might not have triggered any explicit
alarm: a load that suddenly dropped to zero on a normally-busy
server, a service that restarted three times in an hour, an
unusually large log file. If the agent finds something, it opens
an incident on its own (source = 'patrol') and the pipeline runs.
The patrol loop lives in the swarm container because it shares
the SwarmManager's executor and agent registry; spinning up a
second copy of those in the proactive container would duplicate
state and waste tokens.
What this is for¶
Catching the deviations that are not explicit alarm conditions. Threshold-based monitors miss everything below the threshold; an agent rounding the fleet can notice patterns no one wrote a check for.
Tuning¶
| Knob | Where |
|---|---|
Per-agent patrol_interval |
Agent detail page. Zero disables patrols for that agent. Common values: 15-60 minutes. |
| Server scope | The agent's assigned servers. |
| Token budget | The agent's monthly token allowance — patrols draw from the same gauge. |
Mechanism selection¶
| You have… | Use |
|---|---|
| Daemon installed and standard system metrics | Daemon thresholds (push) |
| Server you cannot install an agent on | CheckScheduler with an Ansible recipe |
| Check whose pass/fail depends on context | CheckScheduler + CheckEvaluator |
| Existing monitoring stack | Webhook (Integrations) |
| High-value server you want continuously eyeballed | Patrol enabled on the assigned agent |
| Live incident where a human just added context | IncidentWatcher (automatic) |
| A recurring planned mutation | Maintenance plan + MaintenanceScheduler |
| Repeated successful auto-remediations on a role | Accept the PromotionScanner suggestion |
The mechanisms are additive — most production setups run all six simultaneously. They funnel into the same pipeline, so the downstream handling is uniform regardless of how the work was born.
Latency profile¶
| Source | Time from condition → incident in DB |
|---|---|
| Webhook | sub-second (push) |
| Daemon | ≤ 15 s (one report cycle) |
| CheckScheduler | up to (60 s sweep + per-policy interval) |
| CheckEvaluator | runtime of the recipe + evaluator latency |
| PatrolScheduler | up to patrol_interval minutes |
| MaintenanceScheduler | up to 60 s after scheduled_start |
| PromotionScanner | up to 1 h (this is a suggestion, not an incident) |
Practical detection floor in a default deployment: ~15 s via the daemon for threshold-based conditions; sub-second for push-based external alerts.
Shared lifecycle¶
All six loops follow the same shape:
- Start —
asyncio.create_task(loop.start())fromproactive/main.py(orswarm/main.pyfor patrols). - Cycle — sleep, sweep, sleep. The sweep is wrapped in
try/exceptso a transient DB error never kills the loop; exceptions are logged and the next cycle proceeds. - Stop —
loop.stop()flips_runningso the next sleep yields and the task exits. SIGTERM / SIGINT in the entrypoint trigger an orderly shutdown of all loops. - Concurrency — each loop is a single coroutine. The MaintenanceScheduler additionally runs schedule executors concurrently behind a semaphore (default 3); everything else is serial per loop.
Failure handling¶
The loops are intentionally fail-loud, retry-soft:
- DB errors during a sweep → log, sleep, retry next cycle.
- LLM errors in CheckEvaluator → mark the row
evaluation_failed, surface to the operator, do not auto-promote to incident. - Maintenance executor failures → flip the run to
failed, emitmaintenance.failedfor the plugin layer, do not retry. The operator decides whether to re-schedule. - Patrol agent failures → log + count against the agent's failure metric; the agent stays patrol-eligible unless the operator disables it.
The proactive container itself restarts on the standard Docker
healthcheck failure path — Compose's restart: unless-stopped
catches process-level crashes.
Tunable parameters¶
| Variable / constant | Default | Effect |
|---|---|---|
CheckScheduler.CYCLE_INTERVAL |
60 s | Sweep cadence. |
MaintenanceScheduler.CYCLE_INTERVAL |
60 s | Sweep cadence for due windows. |
MaintenanceScheduler(max_concurrent_schedules=…) |
3 | Parallel executors per process. |
PromotionScanner.CYCLE_INTERVAL |
3600 s | How often the scanner looks at recipe outcomes. |
OREMEDY_PATROL_DEFAULT_INTERVAL_MINUTES |
30 | Default patrol cadence for newly-created agents. |
OREMEDY_WORKER_MAX_JOBS |
10 | ARQ worker concurrency cap. |
MaintenanceExecutor.POLL_INTERVAL |
5 s | Time between checks for human_gate / manual step approvals. |
Sweep cadences are constants in code rather than env vars on purpose: the floor is the same for every install, and per-policy frequency knobs cover the legitimate tuning surface.
Operational tips¶
- Start with the daemon and webhooks. Add CheckScheduler recipes for things the daemon cannot do.
- Enable patrols selectively. Patrolling every agent on every server burns tokens for marginal returns. Start with one agent patrolling the most critical servers at a 30-minute cadence.
- Tune thresholds in audit, not in panic. Every threshold
change is logged in
/audit; review false positives weekly and adjust. - Use the CheckEvaluator's LLM judgement sparingly. It is the most expensive option. Reserve for checks where structured pass/fail criteria genuinely cannot capture the intent.
- Promotions accumulate quietly. Visit the Promotions page weekly — accepting suggestions is the easiest way to grow the fleet's autonomy without touching the trust × risk gate globally.