Maintenance plan authoring¶
A maintenance plan is a reusable, markdown-authored runbook: front-matter declares the plan's risk and rollout strategy, the "Steps" section is a sequence of typed YAML blocks the platform executes one server at a time (or in parallel, depending on the strategy you pick).
Plans cover the scheduled side of the platform — patches, kernel rollouts, TLS renewals, software installs. The reactive side is recipes bound to incidents.
The shape of a plan¶
Every plan is a row in maintenance_plans whose canonical form is
its markdown column. Saving the plan parses the markdown and
caches the structured steps in parsed_steps (JSONB), but the
markdown is the source of truth — if the two ever drift, the
markdown wins on the next save.
Operator-relevant fields:
| Field | Meaning |
|---|---|
markdown |
The full plan text (front-matter + steps). Edited via the dashboard or AI editor. |
risk_level |
low / medium / high. Influences default approval gating. |
strategy |
rolling / parallel / parallel_batched / rings. How runs are dispatched across targets. |
strategy_config |
JSONB. Strategy-specific knobs (batch size, ring percentages). |
snapshot_dimensions |
What system state to snapshot before/after the run, for after-action review. Subset of: packages, kernel, systemd, configs, disk, containers. |
auto_approve_rules |
List of rules that pre-approve specific step types (skip the human pause). |
target_selector |
Default target query (e.g. {by: "role", value: "web"}). Usually overridden per schedule. |
A plan becomes a MaintenanceSchedule when scheduled. The schedule
freezes the markdown (frozen_markdown) at approve time so later
plan edits don't affect a run already greenlit.
Markdown DSL¶
The format is YAML front-matter + level-3 step headings, each with a fenced YAML block:
---
subject: nginx
risk_level: medium
snapshot_dimensions: [configs, systemd]
strategy: rolling
---
# Maintenance: rolling nginx config push
What this plan does, why, when to run it.
## Steps
### Step 1: pre-flight checks
\`\`\`yaml
type: validate
requires_approval: false
checks:
- "nginx -t"
- "test $(df / --output=avail | tail -1) -gt 1048576"
\`\`\`
### Step 2: deploy new config
\`\`\`yaml
type: custom_tool
requires_approval: true
command: "cp /opt/nginx/staged.conf /etc/nginx/nginx.conf && nginx -s reload"
timeout: 60
\`\`\`
Front-matter fields are required (subject, risk_level); the rest
have sensible defaults.
Step types¶
Seven types are supported. Each step is one fenced YAML block.
validate¶
Runs one or more shell checks; the step fails if any returns a non-zero exit. Auto-approved by default.
type: validate
requires_approval: false
checks:
- "test $(df / --output=avail | tail -1) -gt 524288"
- "systemctl is-active --quiet cron"
custom_tool¶
Free-form shell command on the target host. This is the workhorse —
most steps are custom_tool. Default: requires approval (you can
toggle per-step).
type: custom_tool
requires_approval: true
command: "apt-get install -y --only-upgrade nginx"
timeout: 600
recipe¶
Invokes a recipe by slug. The recipe's own risk level still drives the approval gate.
wait¶
Sleeps for N seconds. Auto-approved.
wait_until¶
Polls a check until it returns 0 or the timeout fires. Auto-approved.
type: wait_until
requires_approval: false
condition: "systemctl is-system-running --wait"
timeout_seconds: 300
interval_seconds: 15
notify¶
Fires a fleet notification (Discord webhook etc.). Auto-approved.
human_gate¶
A hard pause until an operator clicks Continue. Useful before destructive steps in an otherwise auto-approved plan.
Strategies¶
strategy (in front-matter) decides how the plan is dispatched
across the schedule's target servers.
| Strategy | Behaviour | strategy_config |
|---|---|---|
rolling |
One target at a time, sequentially. Stop on the first failure. Use this for high-risk ops where one bad host should halt the fleet. | none |
parallel |
All targets run concurrently. Use only for low-risk ops that are safe to fan out (disk cleanup, cache flush). | none |
parallel_batched |
Fixed-size batches; wait for one batch to finish before starting the next. | {batch_size: 5} |
rings |
Staged: canary 10% → pilot 30% → broad 100%, with optional soak windows and approval gates between rings. | {rings: [{name, percent, soak_minutes, approval_between}, …]} |
Example for kernel rollouts:
strategy: rings
strategy_config:
rings:
- name: canary
percent: 10
soak_minutes: 15
approval_between: false
- name: pilot
percent: 30
soak_minutes: 15
approval_between: true
- name: broad
percent: 100
soak_minutes: 0
Auto-approve rules¶
Plans of any risk level can skip the human pause for specific step
types using auto_approve_rules:
auto_approve_rules:
- step_type: validate
auto_approve: true
- step_type: wait
auto_approve: true
- step_type: notify
auto_approve: true
- step_type: custom_tool
recipe_slug: disk-cleanup # optional: scope to one recipe
auto_approve: true
By default validate, wait, wait_until, and notify only run
without approval if their step also has requires_approval: false.
Auto-approve rules let you say "all validate steps in this plan
are pre-approved" without setting it on each step.
The AI markdown editor¶
Authoring a plan from a blank page is friction. The dashboard's schedule detail page has a Maintenance Assistant panel where you chat with an LLM that drafts and revises plan markdown.
The contract:
- Your prompt → the LLM responds conversationally and may stream a
proposed markdown block enclosed in
===PROPOSED-MARKDOWN-BEGIN===/===PROPOSED-MARKDOWN-END===. - The proposal does not auto-apply. You click "Apply to editor"
to load it into the textarea, review it, then "Save markdown" to
persist it to the schedule's
frozen_markdown. - Once you approve a schedule, the markdown is frozen — later plan edits don't retroactively affect that scheduled run, and post-approve edits to the schedule itself are preserved through approval (this is the #68 fix; before it, approving wiped the operator's edits).
The AI is given the current markdown as context, so iterative edits ("add a 30-second wait between step 2 and 3") work cleanly.
Variables and placeholders¶
Plans can be parameterised with {{ variable }} markers. At
schedule-create time, the operator passes a variables_override
dict that the platform substitutes into the markdown before
execution.
Example template:
### Step 3: install {{ package_name }}
\`\`\`yaml
type: custom_tool
command: "apt-get install -y {{ package_name }}"
\`\`\`
The schedule POST body:
This is how the bundled "Install Software" plan works — one template, many invocations with different package names.
Jinja-template gotcha (already handled)¶
If your custom_tool command contains literal {{ ... }} (Docker
--format '{{.Names}}', kubectl -o jsonpath, Helm), you might
worry that Ansible's Jinja layer will mangle it. It would, but the
platform's command runner escapes the operator's literal at dispatch
time so Ansible renders the right thing on the target. Author your
commands as you'd write them in a shell — no extra escaping required.
This was a real bug (#69) fixed before this documentation existed.
The shipped catalogue¶
OpenRemedy seeds these plans at tenant creation:
| Plan | Risk | Strategy | Subject |
|---|---|---|---|
| Monthly Security Patches | medium | rolling | os |
| Kernel Patch — Rings with Reboot | high | rings | kernel |
| Disk Cleanup Cycle | low | parallel | os |
| TLS Certificate Renewal | low | parallel_batched | tls |
| Nginx Config Deployment | medium | rolling | nginx |
| MySQL Minor Version Upgrade | high | rolling | mysql |
| Install Software | medium | rolling | os (parametric — set the package per schedule) |
All seven are is_builtin=true. Customise by duplicating one and
editing — the originals are idempotently re-seeded on tenant
creation, so don't edit them in place.
Authoring workflow¶
- Create a plan from a blank template or duplicate a built-in:
Plans → New planin the dashboard. - Author the markdown (or generate via the AI editor and apply the proposal).
- Save — the parser surfaces schema errors (missing front-matter
key, invalid
step_type, malformed YAML in a step block) before the row commits. - Schedule the plan against target servers:
Plans → <plan> → Schedule. Pick the start time, optional cron, and the targets. - Approve the schedule. Approval freezes the markdown.
- Run is dispatched at
scheduled_startby theMaintenanceSchedulerproactive loop. - Review runs in
Maintenances → Schedules → <id>: per-server run status, per-step output, before/after snapshots.
What not to put in a plan¶
- Unbounded shell loops. Always use
wait_untilwith a realtimeout_secondsif you're polling. - Destructive
custom_toolsteps withrequires_approval: false. Default totrueand let auto-approve rules opt specific recipes into auto-execution if you've vetted them. - Hardcoded credentials. Use a recipe that pulls from an encrypted SSH config row, or pull from your secrets manager via a discovery step.
validatesteps that always pass. A check thatecho okexits 0 is theatre. Validate something real (disk space, service state, port reachable).
See also¶
- Recipe authoring — when to write a
recipe and reference it from a
recipestep rather than reaching forcustom_tool. - Server modes —
live/shadow/auditaffects how the schedule's runs are gated.