Skip to content

Maintenance plan authoring

A maintenance plan is a reusable, markdown-authored runbook: front-matter declares the plan's risk and rollout strategy, the "Steps" section is a sequence of typed YAML blocks the platform executes one server at a time (or in parallel, depending on the strategy you pick).

Plans cover the scheduled side of the platform — patches, kernel rollouts, TLS renewals, software installs. The reactive side is recipes bound to incidents.

The shape of a plan

Every plan is a row in maintenance_plans whose canonical form is its markdown column. Saving the plan parses the markdown and caches the structured steps in parsed_steps (JSONB), but the markdown is the source of truth — if the two ever drift, the markdown wins on the next save.

Operator-relevant fields:

Field Meaning
markdown The full plan text (front-matter + steps). Edited via the dashboard or AI editor.
risk_level low / medium / high. Influences default approval gating.
strategy rolling / parallel / parallel_batched / rings. How runs are dispatched across targets.
strategy_config JSONB. Strategy-specific knobs (batch size, ring percentages).
snapshot_dimensions What system state to snapshot before/after the run, for after-action review. Subset of: packages, kernel, systemd, configs, disk, containers.
auto_approve_rules List of rules that pre-approve specific step types (skip the human pause).
target_selector Default target query (e.g. {by: "role", value: "web"}). Usually overridden per schedule.

A plan becomes a MaintenanceSchedule when scheduled. The schedule freezes the markdown (frozen_markdown) at approve time so later plan edits don't affect a run already greenlit.

Markdown DSL

The format is YAML front-matter + level-3 step headings, each with a fenced YAML block:

---
subject: nginx
risk_level: medium
snapshot_dimensions: [configs, systemd]
strategy: rolling
---

# Maintenance: rolling nginx config push

What this plan does, why, when to run it.

## Steps

### Step 1: pre-flight checks
\`\`\`yaml
type: validate
requires_approval: false
checks:
  - "nginx -t"
  - "test $(df / --output=avail | tail -1) -gt 1048576"
\`\`\`

### Step 2: deploy new config
\`\`\`yaml
type: custom_tool
requires_approval: true
command: "cp /opt/nginx/staged.conf /etc/nginx/nginx.conf && nginx -s reload"
timeout: 60
\`\`\`

Front-matter fields are required (subject, risk_level); the rest have sensible defaults.

Step types

Seven types are supported. Each step is one fenced YAML block.

validate

Runs one or more shell checks; the step fails if any returns a non-zero exit. Auto-approved by default.

type: validate
requires_approval: false
checks:
  - "test $(df / --output=avail | tail -1) -gt 524288"
  - "systemctl is-active --quiet cron"

custom_tool

Free-form shell command on the target host. This is the workhorse — most steps are custom_tool. Default: requires approval (you can toggle per-step).

type: custom_tool
requires_approval: true
command: "apt-get install -y --only-upgrade nginx"
timeout: 600

recipe

Invokes a recipe by slug. The recipe's own risk level still drives the approval gate.

type: recipe
requires_approval: true
recipe: db-backup
variables:
  db_name: production

wait

Sleeps for N seconds. Auto-approved.

type: wait
requires_approval: false
seconds: 60

wait_until

Polls a check until it returns 0 or the timeout fires. Auto-approved.

type: wait_until
requires_approval: false
condition: "systemctl is-system-running --wait"
timeout_seconds: 300
interval_seconds: 15

notify

Fires a fleet notification (Discord webhook etc.). Auto-approved.

human_gate

A hard pause until an operator clicks Continue. Useful before destructive steps in an otherwise auto-approved plan.

type: human_gate
requires_approval: true

Strategies

strategy (in front-matter) decides how the plan is dispatched across the schedule's target servers.

Strategy Behaviour strategy_config
rolling One target at a time, sequentially. Stop on the first failure. Use this for high-risk ops where one bad host should halt the fleet. none
parallel All targets run concurrently. Use only for low-risk ops that are safe to fan out (disk cleanup, cache flush). none
parallel_batched Fixed-size batches; wait for one batch to finish before starting the next. {batch_size: 5}
rings Staged: canary 10% → pilot 30% → broad 100%, with optional soak windows and approval gates between rings. {rings: [{name, percent, soak_minutes, approval_between}, …]}

Example for kernel rollouts:

strategy: rings
strategy_config:
  rings:
    - name: canary
      percent: 10
      soak_minutes: 15
      approval_between: false
    - name: pilot
      percent: 30
      soak_minutes: 15
      approval_between: true
    - name: broad
      percent: 100
      soak_minutes: 0

Auto-approve rules

Plans of any risk level can skip the human pause for specific step types using auto_approve_rules:

auto_approve_rules:
  - step_type: validate
    auto_approve: true
  - step_type: wait
    auto_approve: true
  - step_type: notify
    auto_approve: true
  - step_type: custom_tool
    recipe_slug: disk-cleanup     # optional: scope to one recipe
    auto_approve: true

By default validate, wait, wait_until, and notify only run without approval if their step also has requires_approval: false. Auto-approve rules let you say "all validate steps in this plan are pre-approved" without setting it on each step.

The AI markdown editor

Authoring a plan from a blank page is friction. The dashboard's schedule detail page has a Maintenance Assistant panel where you chat with an LLM that drafts and revises plan markdown.

The contract:

  • Your prompt → the LLM responds conversationally and may stream a proposed markdown block enclosed in ===PROPOSED-MARKDOWN-BEGIN=== / ===PROPOSED-MARKDOWN-END===.
  • The proposal does not auto-apply. You click "Apply to editor" to load it into the textarea, review it, then "Save markdown" to persist it to the schedule's frozen_markdown.
  • Once you approve a schedule, the markdown is frozen — later plan edits don't retroactively affect that scheduled run, and post-approve edits to the schedule itself are preserved through approval (this is the #68 fix; before it, approving wiped the operator's edits).

The AI is given the current markdown as context, so iterative edits ("add a 30-second wait between step 2 and 3") work cleanly.

Variables and placeholders

Plans can be parameterised with {{ variable }} markers. At schedule-create time, the operator passes a variables_override dict that the platform substitutes into the markdown before execution.

Example template:

### Step 3: install {{ package_name }}
\`\`\`yaml
type: custom_tool
command: "apt-get install -y {{ package_name }}"
\`\`\`

The schedule POST body:

{"plan_id": "...", "variables_override": {"package_name": "nginx"}}

This is how the bundled "Install Software" plan works — one template, many invocations with different package names.

Jinja-template gotcha (already handled)

If your custom_tool command contains literal {{ ... }} (Docker --format '{{.Names}}', kubectl -o jsonpath, Helm), you might worry that Ansible's Jinja layer will mangle it. It would, but the platform's command runner escapes the operator's literal at dispatch time so Ansible renders the right thing on the target. Author your commands as you'd write them in a shell — no extra escaping required.

This was a real bug (#69) fixed before this documentation existed.

The shipped catalogue

OpenRemedy seeds these plans at tenant creation:

Plan Risk Strategy Subject
Monthly Security Patches medium rolling os
Kernel Patch — Rings with Reboot high rings kernel
Disk Cleanup Cycle low parallel os
TLS Certificate Renewal low parallel_batched tls
Nginx Config Deployment medium rolling nginx
MySQL Minor Version Upgrade high rolling mysql
Install Software medium rolling os (parametric — set the package per schedule)

All seven are is_builtin=true. Customise by duplicating one and editing — the originals are idempotently re-seeded on tenant creation, so don't edit them in place.

Authoring workflow

  1. Create a plan from a blank template or duplicate a built-in: Plans → New plan in the dashboard.
  2. Author the markdown (or generate via the AI editor and apply the proposal).
  3. Save — the parser surfaces schema errors (missing front-matter key, invalid step_type, malformed YAML in a step block) before the row commits.
  4. Schedule the plan against target servers: Plans → <plan> → Schedule. Pick the start time, optional cron, and the targets.
  5. Approve the schedule. Approval freezes the markdown.
  6. Run is dispatched at scheduled_start by the MaintenanceScheduler proactive loop.
  7. Review runs in Maintenances → Schedules → <id>: per-server run status, per-step output, before/after snapshots.

What not to put in a plan

  • Unbounded shell loops. Always use wait_until with a real timeout_seconds if you're polling.
  • Destructive custom_tool steps with requires_approval: false. Default to true and let auto-approve rules opt specific recipes into auto-execution if you've vetted them.
  • Hardcoded credentials. Use a recipe that pulls from an encrypted SSH config row, or pull from your secrets manager via a discovery step.
  • validate steps that always pass. A check that echo ok exits 0 is theatre. Validate something real (disk space, service state, port reachable).

See also

  • Recipe authoring — when to write a recipe and reference it from a recipe step rather than reaching for custom_tool.
  • Server modeslive / shadow / audit affects how the schedule's runs are gated.