Skip to content

Proactive monitoring and early detection

OpenRemedy is not waiting for your servers to fail. The platform runs six independent proactive mechanisms that watch the fleet continuously and create incidents — or prevent them — the moment a deviation is detected. Whether the incident comes from an external alert manager, a daemon threshold, a scheduled health probe, an agent's unsolicited observation, or a maintenance window opening on schedule, the same pipeline handles it from there.

This document covers what each mechanism does, where it runs, what it is good at, and how to tune it.


Where the loops live

Five of the loops run inside the proactive container; the patrol loop runs inside the swarm container alongside the SwarmManager because patrols are agent rounds and need direct access to the manager's executor. Both containers share the same code base; the split is a deployment concern, not an architectural one.

flowchart LR
    subgraph proactive["proactive container"]
      CS[CheckScheduler<br/>60 s sweep]
      CE[CheckEvaluator<br/>continuous LLM eval]
      IW[IncidentWatcher<br/>Redis pub/sub]
      MS[MaintenanceScheduler<br/>60 s sweep]
      PS[PromotionScanner<br/>1 h sweep]
    end
    subgraph swarm["swarm container"]
      PA[PatrolScheduler<br/>per-agent interval]
    end
    subgraph external["External"]
      W[Webhook<br/>Alertmanager / Grafana<br/>Datadog / PagerDuty]
    end
    subgraph onserver["Managed server"]
      D[Daemon monitors<br/>~15 s cadence]
    end

    W --> INC[Incident]
    D --> INC
    CS --> CE --> INC
    PA --> INC
    MS -. window opens .-> RUN[Maintenance run]
    PS -. surfaces .-> SUG[Promotion suggestion]
    IW <-. comments / approvals .-> INC
    INC --> PIPE[Pipeline<br/>triage → diagnose → execute → review]

Each loop is described below. Push sources (webhooks, daemon alerts) feed the same incident pipeline but are documented in Integrations and Daemon → Install.


1 · CheckScheduler

Container: proactive. Source: backend/src/openremedy/proactive/scheduler.py. Cycle: 60 s.

Every minute the scheduler sweeps the database for alert_policies rows whose flow_definition includes a recipe_check trigger node. For each one whose interval has elapsed since its last_check_at, it dispatches the recipe to the ARQ worker queue. The worker runs the playbook (Ansible) over SSH against the target servers and writes the result to check_results. Two things to note:

  • The "trigger" lives on a policy, not on a recipe. Policies define both what to check and who to notify; the recipe is the executable artifact the policy points at.
  • The 60 s sweep is the floor. The per-policy interval can be longer (every 5 minutes, every hour) but not shorter — anything below 60 s would be limited by the sweep itself.

What this is for

Health checks that don't fit on the daemon:

  • Servers without a daemon. Legacy boxes, third-party-managed hosts, agentless segments.
  • Multi-step or stateful checks. Authenticated HTTP scrapes, database queries, validations that need values from multiple commands.
  • Synthetic probes. "Hit this URL with this payload, expect this field in the response."

Tuning

Knob Where
Sweep interval CYCLE_INTERVAL = 60 in scheduler.py.
Per-policy frequency The recipe_check trigger in the policy's flow editor.
Worker concurrency OREMEDY_WORKER_MAX_JOBS (ARQ).
Maintenance suppression Active windows on a target host suppress dispatch.

2 · CheckEvaluator

Container: proactive. Source: backend/src/openremedy/proactive/evaluator.py.

Reads new rows from check_results as the worker writes them and decides whether each check passed or failed. The decision uses:

  1. The recipe's structured success criteria (exit code, regex, expected JSON shape) when present.
  2. An LLM evaluation when the criteria are ambiguous or the recipe asks for context-aware judgement.

A failed result becomes an incident, published to the incidents Redis channel, which the SwarmManager picks up.

What this is for

Checks where the meaning of the output depends on context. "92 % disk" is not the same on a database box during nightly backup as it is on a cache box at 4 AM. The evaluator can apply that context without the operator having to encode every nuance into the recipe.

Tuning

Knob Where
Evaluation model Per-tenant, in Settings → LLM.
Default-pass / default-fail Per recipe.
Suppression windows Active maintenance windows skip evaluation.

3 · IncidentWatcher

Container: proactive. Source: backend/src/openremedy/proactive/watcher.py.

Subscribes to the Redis incidents and approvals channels.

  • Comments on escalated / monitoring incidents re-invoke the agent pipeline with the comment as added context. The agent gets a second turn, informed by what the human just said.
  • Approval / rejection of a pending recipe execution publishes the decision and the worker either runs or cancels the playbook.
  • monitoring state with an interval_minutes payload schedules a re-evaluation after the interval elapses; the incident is re-opened in the pipeline if still in monitoring.

This is the channel that closes the loop between the human and the agent without forcing the operator to manually re-trigger anything.

Tuning

There is nothing to configure on the watcher itself. It reacts to incident state. The behaviour is governed by:

  • The agent's trust_level and assigned roles.
  • The recipe's risk level (drives the approval gate; see Security → Approval gate).
  • The incident status (escalated, monitoring, awaiting_approval).
  • The interval_minutes payload on monitoring events.

4 · MaintenanceScheduler

Container: proactive. Source: backend/src/openremedy/proactive/maintenance.py. Cycle: 60 s.

Sweeps maintenance_schedules for rows whose status = 'approved', paused_at IS NULL, and scheduled_start <= now(). For each due schedule it spawns a MaintenanceExecutor task (capped at max_concurrent_schedules = 3 per process) that walks the target servers in rolling order, executing each step and pausing on human_gate or manual step types.

A maintenance run is the opposite of an incident: it is a planned, operator-authored mutation that the platform executes on its own. While a run is active on a server it also acts as a suppressor — webhook ingest, the CheckScheduler, and the CheckEvaluator all consult the active windows table before creating incidents on the same host.

What this is for

Recurring upgrades, OS patching, certificate rotations, capacity re-balances — anything you would otherwise script in cron but want to be approval-gated, audited, and aware of agent-detected incidents.

Tuning

Knob Where
Sweep interval CYCLE_INTERVAL = 60 in maintenance.py.
Concurrent schedules MaintenanceScheduler(max_concurrent_schedules=3).
Per-step approval human_gate and manual step types pause the run.

See Maintenance plans → Authoring for the markdown DSL and step types.


5 · PromotionScanner

Container: proactive. Source: backend/src/openremedy/proactive/promotion_scanner.py. Cycle: 1 h.

Scans recipe_outcomes over a rolling 30-day window grouped by (server role, recipe). When a (role, recipe) pair clears the trust ladder — N consecutive successful, no-rollback executions — the scanner materialises a row in promotion_suggestions. The operator sees it on the Promotions page and either accepts (creating a recipe_role_overrides row that auto-executes the recipe on that role) or dismisses it.

What this is for

The platform's quiet path from supervised to autonomous. Most remediations a fleet runs are repetitive — restart this service, prune that container, vacuum that database. Accepting a promotion once skips the trust × risk gate for that role permanently, until the operator revokes it.

Tuning

Knob Where
Cadence CYCLE_INTERVAL = 3600 in promotion_scanner.py.
Promotion thresholds services/promotion.py (consecutive successes, lookback days, risk floor).
Override revocation DELETE /api/v1/recipe-role-overrides/{id} or the Promotions UI.

6 · PatrolScheduler

Container: swarm. Source: backend/src/openremedy/swarm/patrol.py.

Every agent has a patrol_interval (minutes) on its configuration. When greater than zero, the patrol scheduler periodically asks the agent to perform a patrol — an unscheduled round of diagnostic checks across the agent's assigned servers.

The agent uses its built-in diagnostic verbs (run_diagnostic_command with top_snapshot, process_list_filter, docker_* etc.) to look for anomalies that might not have triggered any explicit alarm: a load that suddenly dropped to zero on a normally-busy server, a service that restarted three times in an hour, an unusually large log file. If the agent finds something, it opens an incident on its own (source = 'patrol') and the pipeline runs.

The patrol loop lives in the swarm container because it shares the SwarmManager's executor and agent registry; spinning up a second copy of those in the proactive container would duplicate state and waste tokens.

What this is for

Catching the deviations that are not explicit alarm conditions. Threshold-based monitors miss everything below the threshold; an agent rounding the fleet can notice patterns no one wrote a check for.

Tuning

Knob Where
Per-agent patrol_interval Agent detail page. Zero disables patrols for that agent. Common values: 15-60 minutes.
Server scope The agent's assigned servers.
Token budget The agent's monthly token allowance — patrols draw from the same gauge.

Mechanism selection

You have… Use
Daemon installed and standard system metrics Daemon thresholds (push)
Server you cannot install an agent on CheckScheduler with an Ansible recipe
Check whose pass/fail depends on context CheckScheduler + CheckEvaluator
Existing monitoring stack Webhook (Integrations)
High-value server you want continuously eyeballed Patrol enabled on the assigned agent
Live incident where a human just added context IncidentWatcher (automatic)
A recurring planned mutation Maintenance plan + MaintenanceScheduler
Repeated successful auto-remediations on a role Accept the PromotionScanner suggestion

The mechanisms are additive — most production setups run all six simultaneously. They funnel into the same pipeline, so the downstream handling is uniform regardless of how the work was born.


Latency profile

Source Time from condition → incident in DB
Webhook sub-second (push)
Daemon ≤ 15 s (one report cycle)
CheckScheduler up to (60 s sweep + per-policy interval)
CheckEvaluator runtime of the recipe + evaluator latency
PatrolScheduler up to patrol_interval minutes
MaintenanceScheduler up to 60 s after scheduled_start
PromotionScanner up to 1 h (this is a suggestion, not an incident)

Practical detection floor in a default deployment: ~15 s via the daemon for threshold-based conditions; sub-second for push-based external alerts.


Shared lifecycle

All six loops follow the same shape:

  • Startasyncio.create_task(loop.start()) from proactive/main.py (or swarm/main.py for patrols).
  • Cycle — sleep, sweep, sleep. The sweep is wrapped in try/except so a transient DB error never kills the loop; exceptions are logged and the next cycle proceeds.
  • Stoploop.stop() flips _running so the next sleep yields and the task exits. SIGTERM / SIGINT in the entrypoint trigger an orderly shutdown of all loops.
  • Concurrency — each loop is a single coroutine. The MaintenanceScheduler additionally runs schedule executors concurrently behind a semaphore (default 3); everything else is serial per loop.

Failure handling

The loops are intentionally fail-loud, retry-soft:

  • DB errors during a sweep → log, sleep, retry next cycle.
  • LLM errors in CheckEvaluator → mark the row evaluation_failed, surface to the operator, do not auto-promote to incident.
  • Maintenance executor failures → flip the run to failed, emit maintenance.failed for the plugin layer, do not retry. The operator decides whether to re-schedule.
  • Patrol agent failures → log + count against the agent's failure metric; the agent stays patrol-eligible unless the operator disables it.

The proactive container itself restarts on the standard Docker healthcheck failure path — Compose's restart: unless-stopped catches process-level crashes.


Tunable parameters

Variable / constant Default Effect
CheckScheduler.CYCLE_INTERVAL 60 s Sweep cadence.
MaintenanceScheduler.CYCLE_INTERVAL 60 s Sweep cadence for due windows.
MaintenanceScheduler(max_concurrent_schedules=…) 3 Parallel executors per process.
PromotionScanner.CYCLE_INTERVAL 3600 s How often the scanner looks at recipe outcomes.
OREMEDY_PATROL_DEFAULT_INTERVAL_MINUTES 30 Default patrol cadence for newly-created agents.
OREMEDY_WORKER_MAX_JOBS 10 ARQ worker concurrency cap.
MaintenanceExecutor.POLL_INTERVAL 5 s Time between checks for human_gate / manual step approvals.

Sweep cadences are constants in code rather than env vars on purpose: the floor is the same for every install, and per-policy frequency knobs cover the legitimate tuning surface.


Operational tips

  • Start with the daemon and webhooks. Add CheckScheduler recipes for things the daemon cannot do.
  • Enable patrols selectively. Patrolling every agent on every server burns tokens for marginal returns. Start with one agent patrolling the most critical servers at a 30-minute cadence.
  • Tune thresholds in audit, not in panic. Every threshold change is logged in /audit; review false positives weekly and adjust.
  • Use the CheckEvaluator's LLM judgement sparingly. It is the most expensive option. Reserve for checks where structured pass/fail criteria genuinely cannot capture the intent.
  • Promotions accumulate quietly. Visit the Promotions page weekly — accepting suggestions is the easiest way to grow the fleet's autonomy without touching the trust × risk gate globally.