Maintenance plan authoring¶

A maintenance plan is a reusable, markdown-authored runbook: front-matter declares the plan's risk and rollout strategy, the "Steps" section is a sequence of typed YAML blocks the platform executes one server at a time (or in parallel, depending on the strategy you pick).

Plans cover the scheduled side of the platform — patches, kernel rollouts, TLS renewals, software installs. The reactive side is recipes bound to incidents.

The shape of a plan¶

Every plan is a row in maintenance_plans whose canonical form is its markdown column. Saving the plan parses the markdown and caches the structured steps in parsed_steps (JSONB), but the markdown is the source of truth — if the two ever drift, the markdown wins on the next save.

Operator-relevant fields:

Field	Meaning
`markdown`	The full plan text (front-matter + steps). Edited via the dashboard or AI editor.
`risk_level`	`low` / `medium` / `high`. Influences default approval gating.
`strategy`	`rolling` / `parallel` / `parallel_batched` / `rings`. How runs are dispatched across targets.
`strategy_config`	JSONB. Strategy-specific knobs (batch size, ring percentages).
`snapshot_dimensions`	What system state to snapshot before/after the run, for after-action review. Subset of: `packages`, `kernel`, `systemd`, `configs`, `disk`, `containers`.
`auto_approve_rules`	List of rules that pre-approve specific step types (skip the human pause).
`target_selector`	Default target query (e.g. `{by: "role", value: "web"}`). Usually overridden per schedule.

A plan becomes a MaintenanceSchedule when scheduled. The schedule freezes the markdown (frozen_markdown) at approve time so later plan edits don't affect a run already greenlit.

Markdown DSL¶

The format is YAML front-matter + level-3 step headings, each with a fenced YAML block:

---
subject: nginx
risk_level: medium
snapshot_dimensions: [configs, systemd]
strategy: rolling
---

# Maintenance: rolling nginx config push

What this plan does, why, when to run it.

## Steps

### Step 1: pre-flight checks
\`\`\`yaml
type: validate
requires_approval: false
checks:
  - "nginx -t"
  - "test $(df / --output=avail | tail -1) -gt 1048576"
\`\`\`

### Step 2: deploy new config
\`\`\`yaml
type: custom_tool
requires_approval: true
command: "cp /opt/nginx/staged.conf /etc/nginx/nginx.conf && nginx -s reload"
timeout: 60
\`\`\`

Front-matter fields are required (subject, risk_level); the rest have sensible defaults.

Step types¶

Seven types are supported. Each step is one fenced YAML block.

`validate`¶

Runs one or more shell checks; the step fails if any returns a non-zero exit. Auto-approved by default.

type: validate
requires_approval: false
checks:
  - "test $(df / --output=avail | tail -1) -gt 524288"
  - "systemctl is-active --quiet cron"

`custom_tool`¶

Free-form shell command on the target host. This is the workhorse — most steps are custom_tool. Default: requires approval (you can toggle per-step).

type: custom_tool
requires_approval: true
command: "apt-get install -y --only-upgrade nginx"
timeout: 600

`recipe`¶

Invokes a recipe by slug. The recipe's own risk level still drives the approval gate.

type: recipe
requires_approval: true
recipe: db-backup
variables:
  db_name: production

`wait`¶

Sleeps for N seconds. Auto-approved.

type: wait
requires_approval: false
seconds: 60

`wait_until`¶

Polls a check until it returns 0 or the timeout fires. Auto-approved.

type: wait_until
requires_approval: false
condition: "systemctl is-system-running --wait"
timeout_seconds: 300
interval_seconds: 15

`notify`¶

Fires a fleet notification (Discord webhook etc.). Auto-approved.

`human_gate`¶

A hard pause until an operator clicks Continue. Useful before destructive steps in an otherwise auto-approved plan.

type: human_gate
requires_approval: true

Strategies¶

strategy (in front-matter) decides how the plan is dispatched across the schedule's target servers.

Strategy	Behaviour	`strategy_config`
`rolling`	One target at a time, sequentially. Stop on the first failure. Use this for high-risk ops where one bad host should halt the fleet.	none
`parallel`	All targets run concurrently. Use only for low-risk ops that are safe to fan out (disk cleanup, cache flush).	none
`parallel_batched`	Fixed-size batches; wait for one batch to finish before starting the next.	`{batch_size: 5}`
`rings`	Staged: canary 10% → pilot 30% → broad 100%, with optional soak windows and approval gates between rings.	`{rings: [{name, percent, soak_minutes, approval_between}, …]}`

Example for kernel rollouts:

strategy: rings
strategy_config:
  rings:
    - name: canary
      percent: 10
      soak_minutes: 15
      approval_between: false
    - name: pilot
      percent: 30
      soak_minutes: 15
      approval_between: true
    - name: broad
      percent: 100
      soak_minutes: 0

Auto-approve rules¶

Plans of any risk level can skip the human pause for specific step types using auto_approve_rules:

auto_approve_rules:
  - step_type: validate
    auto_approve: true
  - step_type: wait
    auto_approve: true
  - step_type: notify
    auto_approve: true
  - step_type: custom_tool
    recipe_slug: disk-cleanup     # optional: scope to one recipe
    auto_approve: true

By default validate, wait, wait_until, and notify only run without approval if their step also has requires_approval: false. Auto-approve rules let you say "all validate steps in this plan are pre-approved" without setting it on each step.

The AI markdown editor¶

Authoring a plan from a blank page is friction. The dashboard's schedule detail page has a Maintenance Assistant panel where you chat with an LLM that drafts and revises plan markdown.

The contract:

Your prompt → the LLM responds conversationally and may stream a proposed markdown block enclosed in ===PROPOSED-MARKDOWN-BEGIN=== / ===PROPOSED-MARKDOWN-END===.
The proposal does not auto-apply. You click "Apply to editor" to load it into the textarea, review it, then "Save markdown" to persist it to the schedule's frozen_markdown.
Once you approve a schedule, the markdown is frozen — later plan edits don't retroactively affect that scheduled run, and post-approve edits to the schedule itself are preserved through approval (this is the #68 fix; before it, approving wiped the operator's edits).

The AI is given the current markdown as context, so iterative edits ("add a 30-second wait between step 2 and 3") work cleanly.

Variables and placeholders¶

Plans can be parameterised with {{ variable }} markers. At schedule-create time, the operator passes a variables_override dict that the platform substitutes into the markdown before execution.

Example template:

### Step 3: install {{ package_name }}
\`\`\`yaml
type: custom_tool
command: "apt-get install -y {{ package_name }}"
\`\`\`

The schedule POST body:

{"plan_id": "...", "variables_override": {"package_name": "nginx"}}

This is how the bundled "Install Software" plan works — one template, many invocations with different package names.

Jinja-template gotcha (already handled)¶

If your custom_tool command contains literal {{ ... }} (Docker --format '{{.Names}}', kubectl -o jsonpath, Helm), you might worry that Ansible's Jinja layer will mangle it. It would, but the platform's command runner escapes the operator's literal at dispatch time so Ansible renders the right thing on the target. Author your commands as you'd write them in a shell — no extra escaping required.

This was a real bug (#69) fixed before this documentation existed.

The shipped catalogue¶

OpenRemedy seeds these plans at tenant creation:

Plan	Risk	Strategy	Subject
Monthly Security Patches	medium	rolling	os
Kernel Patch — Rings with Reboot	high	rings	kernel
Disk Cleanup Cycle	low	parallel	os
TLS Certificate Renewal	low	parallel_batched	tls
Nginx Config Deployment	medium	rolling	nginx
MySQL Minor Version Upgrade	high	rolling	mysql
Install Software	medium	rolling	os (parametric — set the package per schedule)

All seven are is_builtin=true. Customise by duplicating one and editing — the originals are idempotently re-seeded on tenant creation, so don't edit them in place.

Authoring workflow¶

Create a plan from a blank template or duplicate a built-in: Plans → New plan in the dashboard.
Author the markdown (or generate via the AI editor and apply the proposal).
Save — the parser surfaces schema errors (missing front-matter key, invalid step_type, malformed YAML in a step block) before the row commits.
Schedule the plan against target servers: Plans → <plan> → Schedule. Pick the start time, optional cron, and the targets.
Approve the schedule. Approval freezes the markdown.
Run is dispatched at scheduled_start by the MaintenanceScheduler proactive loop.
Review runs in Maintenances → Schedules → <id>: per-server run status, per-step output, before/after snapshots.

What not to put in a plan¶

Unbounded shell loops. Always use wait_until with a real timeout_seconds if you're polling.
Destructive custom_tool steps with requires_approval: false. Default to true and let auto-approve rules opt specific recipes into auto-execution if you've vetted them.
Hardcoded credentials. Use a recipe that pulls from an encrypted SSH config row, or pull from your secrets manager via a discovery step.
validate steps that always pass. A check that echo ok exits 0 is theatre. Validate something real (disk space, service state, port reachable).