Recipe authoring¶

A recipe is the unit of safe, repeatable remediation in OpenRemedy: an Ansible playbook bundled with metadata (slug, risk level, incident type, parameter schema) that the agent can propose and execute on your behalf.

This page is for operators who want to write or modify recipes for their tenant. The shipped catalogue covers the common cases — start there before authoring custom ones.

Recipe shape¶

Every recipe is a row in the recipes table backed by a YAML playbook on disk. The fields that matter to authors:

Field	Type	Meaning
`slug`	unique string	Machine name. The agent calls `execute_recipe(slug=...)` with this. Stable across versions.
`name`	string	Human-readable name shown in the UI ("Safe Systemd Service Restart").
`description`	text	What the recipe does, when to use it, when not to. Operator-facing.
`incident_type`	string	The `Incident.incident_type` this recipe applies to (`service_down`, `disk_full`, `port_unavailable`, `custom`, …). The agent filters by this when listing candidates.
`risk_level`	enum	One of `none` / `low` / `medium` / `high`. Drives the trust × risk gate.
`playbook_path`	string	Path to the `.yml` playbook on disk inside the backend image (`recipes/<file>.yml`).
`variables`	JSONB	Parameter schema — the keys the agent can pass via `execute_recipe(variables=...)`.
`pre_checks` / `post_checks`	text	Human description of what the playbook validates before / after. Not executed by the platform — they're for the operator reviewing the recipe.
`prerequisites`	text	OS/software dependencies. Informational.
`os_family`	JSONB array	Allowed OS families (e.g. `["debian", "rhel"]`). Used at execution time to filter incompatible servers.
`tags`	array	Free-form filters for search.
`category`	string	Loose grouping (`diagnostic`, `remediation`, …).
`version`	semver string	Operator-managed. Bump when you change the playbook.
`is_parameterized`	bool	True if the recipe accepts variables.
`is_proactive`	bool	True if the recipe is safe to schedule before an incident (e.g. log rotation).

The catalogue is global by default — every tenant sees the same recipes. Per-tenant variants and per-tenant overrides (recipe_role_overrides) live in their own tables; see below.

Risk levels¶

The risk_level you pick is the single most consequential authoring decision. It determines whether the agent can auto-execute the recipe on its own or has to wait for a human.

Level	Examples	Auto-executes for trust ∈
`none`	`system-info`, `disk-usage`, `log-read`	`autonomous`, `supervised`, `manual`
`low`	`systemd-restart`, `port-validation`, `config-validation`	`autonomous`, `supervised`
`medium`	`disk-cleanup`, `log-cleanup`	(always requires approval)
`high`	(rare)	(always requires approval)

Rule of thumb:

none = pure read. No become: true. No file writes. No service changes. If your playbook only runs command: calls that gather data, it's none.
low = idempotent writes with proper pre/post checks. Restart a service that's already-supposed-to-be-up. Validate config without reloading. Anything where retrying is safe and a partial run can't corrupt state.
medium = destructive but recoverable. Clean a cache. Vacuum logs. Remove orphaned containers. Failure mode is "I deleted something I shouldn't have"; data loss is bounded but real.
high = rare — rebuilds, rolling reboots, partition resizes, schema migrations on production. Anything where a wrong invocation is a Saturday-morning incident.

When in doubt, pick the higher level. You can always loosen later.

Lifecycle¶

sequenceDiagram
    autonumber
    participant Agent
    participant Gate as guardrails
    participant Op as Operator
    participant Worker as ARQ worker
    participant Host as Target server

    Agent->>Gate: execute_recipe(slug, variables)
    Gate->>Gate: trust × risk + safety classifier + role override
    alt auto-execute (low risk + autonomous, or override)
      Gate->>Worker: dispatch (status=approved)
    else approval required
      Gate->>Op: status=awaiting_approval
      Op->>Gate: approve / reject
      Gate->>Worker: dispatch
    end
    Worker->>Host: ansible-playbook --extravars '{...}'
    Host-->>Worker: stdout / stderr / rc
    Worker->>Worker: persist output to S3, mark execution
    Worker-->>Agent: execution.completed

The full path through the code is documented in architecture flow E.

Variables and parameter substitution¶

If your recipe needs runtime parameters, declare them in the variables JSONB field as a flat dict of defaults:

{
  "service_name": "nginx",
  "timeout_seconds": 30
}

The agent's call site looks like:

execute_recipe(slug="systemd-restart", variables='{"service_name":"redis"}')

Variables are passed to Ansible as extravars. Inside the playbook, reference them with "{{ service_name }}" Jinja syntax. Ansible escapes them at template time, so an LLM-supplied parameter value cannot break out of an argument slot.

One caveat: literal {{ ... }} in a recipe's command string (Docker --format '{{.Names}}', kubectl -o jsonpath=..., Helm) needs no special handling — the platform escapes the operator's literal at dispatch time and Ansible renders the right thing on the target. This was a real bug (#69) fixed in v0.1.x.

Recipe role overrides¶

The recipe_role_overrides table lets a tenant admin promote a specific (recipe_slug, server_role) tuple out of the trust × risk gate. The next time the agent proposes that recipe on a server with that role, the gate is short-circuited and the call auto-executes even though the agent's trust_level would normally require approval.

Use this when:

You've manually approved the same recipe on the same role enough times that the dashboard surfaces a "Promote?" suggestion.
You're confident the recipe is safe on this role specifically and you want to stop being asked.

Don't use this when:

The recipe is medium-or-higher risk and you haven't run it many times. The override skips the safety classifier too — there's no fallback layer.

Revoke an override from the Agents page in the dashboard; on the next agent run, the gate re-engages.

The shipped catalogue¶

OpenRemedy seeds these recipes at install time. Most operator needs are covered; check here before authoring:

Slug	Risk	Type	What it does
`systemd-restart`	low	`service_down`	Restarts a systemd unit with pre/post checks.
`service-restart`	low	`service_down`	Generic service restart (init.d / systemd / OpenRC).
`config-validation`	low	`service_down`	Runs `nginx -t` / `apachectl configtest` / `mysqld --validate-config` before any restart-based recipe.
`port-validation`	low	`port_unavailable`	Confirms a port is listening and reports the owning process.
`firewall-allow`	medium	`port_unavailable`	UFW or firewalld rule to allow a TCP/UDP port.
`log-cleanup`	low	`disk_full`	Vacuum journald + rotate / delete old `/var/log/.log.`.
`disk-cleanup`	medium	`disk_full`	Aggressive: apt cache, dnf cache, /tmp, large `*.log` files.
`systemd-override`	medium	`service_down`	Edits a unit's `[Service]` block (e.g., `Restart=always`). Preventive; not a first-line fix.
`system-info`, `disk-usage`, `log-read`, `log-search`, `service-status`, …	none	(diagnostic)	Read-only fact gathering.

Full list: seed.py:STARTER_RECIPES in the backend.

Authoring workflow¶

Recipes are managed via the platform's REST API (UI for editing is on the roadmap). Only superadmin can create / update / delete; any tenant admin can read and execute.

# Create
curl -X POST https://app.example.com/api/v1/recipes \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "slug": "redis-flush-cache",
    "name": "Redis FLUSHDB on a single key namespace",
    "description": "Clears redis keys matching a prefix without touching other DBs.",
    "incident_type": "custom",
    "risk_level": "medium",
    "playbook_path": "recipes/redis_flush_namespace.yml",
    "variables": {"namespace": "session:"},
    "tags": ["redis", "cache"],
    "category": "remediation",
    "version": "1.0.0",
    "is_parameterized": true
  }'

# Update — bump version explicitly when the playbook changes
curl -X PATCH https://app.example.com/api/v1/recipes/redis-flush-cache \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"version": "1.0.1", "description": "...updated..."}'

# Delete
curl -X DELETE https://app.example.com/api/v1/recipes/redis-flush-cache \
  -H "Authorization: Bearer $TOKEN"

The playbook file (recipes/redis_flush_namespace.yml) lives in the backend image — operators ship it via a custom backend image build or a volume mount on top of the GHCR image.

Testing a recipe before exposing it¶

Set dry_run=true when you call POST /api/v1/incidents/{id}/executions. The worker passes --check to Ansible, no changes are applied, and the execution lands in status preview_completed (distinct from success). The output you get back is what the recipe would have done. Promote the recipe only after a clean dry run on a real target.

What not to do¶

Patterns that look harmless but break the security model:

Shell module with un-quoted variables. shell: "rm {{ path }}" is shell-injection-prone if path is operator-controlled. Use command: rm "{{ path }}" (the command module never invokes a shell) or ansible.builtin.file: state=absent (ideal).
Missing become: true on tasks that need root. The playbook silently runs as the daemon user, fails cryptically. Default to become: true and only drop it on diagnostic playbooks.
Hardcoded paths that vary by distro. path: /var/log/nginx/ exists on Debian; on RHEL the path is /var/log/nginx/ too — but on Alpine it's /var/log/nginx/ and the binary is in /etc/init.d/nginx not systemd. Use when: ansible_os_family == "Debian" guards or set the recipe's os_family field to a single family.
Unbounded command: loops. No timeout:, no register:, no failed_when:. The worker's overall timeout will eventually kill the playbook, but you'll lose the partial state.
Recipes that mutate state without a corresponding rollback. If your recipe edits /etc/systemd/system/foo.service.d/override.conf, ship a sibling recipe (systemd-override-revert) that removes it. Operators will need to reverse course at some point.