Security model¶
This document consolidates the security posture of OpenRemedy: the boundaries the platform enforces, the credentials it requires, and the mitigations applied to its agentic surface. It is intended for operators deploying OpenRemedy and for auditors reviewing the platform.
The model has been hardened through a focused security pass; every mitigation below corresponds to a closed finding. Where relevant, the source file and behaviour are cited.
Required environment variables¶
The deploy aborts at compose interpolation time if any of these are missing. The application also validates the values on boot and refuses to start with known dev defaults.
| Variable | Purpose | Constraints |
|---|---|---|
OREMEDY_SECRET_KEY |
JWT signing key (HS256) | At least 32 characters. Rejected if it matches the historical dev defaults (changeme-dev-secret-key-32chars!!, dev-secret-key-change-in-production, changeme). Generate with openssl rand -base64 48. |
OREMEDY_ENCRYPTION_KEY |
AES-256-GCM data key for stored secrets | Exactly 64 hex characters (32 bytes). Rejected if it matches dev placeholders such as 64×a or the example 0123…. Generate with openssl rand -hex 32. |
POSTGRES_PASSWORD |
Database password | Required, no default. |
In production:
OREMEDY_ENV=productionactivates the production validators.OREMEDY_DEBUG=trueis rejected whenOREMEDY_ENV=production.OREMEDY_CORS_ORIGINSmust not contain*. Default in the production compose file ishttps://${DOMAIN}; override per deployment to add origins.
Authentication¶
Web sessions: HttpOnly cookies¶
Browser sessions use two cookies, both HttpOnly + Secure +
SameSite=strict, set on a successful POST /auth/login or
POST /auth/register:
| Cookie | Lifetime | Purpose |
|---|---|---|
access_token |
OREMEDY_ACCESS_TOKEN_EXPIRE_MINUTES (default 480) |
API auth |
refresh_token |
OREMEDY_REFRESH_TOKEN_EXPIRE_DAYS (default 30) |
Refresh path |
POST /auth/refresh reads the refresh cookie, validates the JWT, and
re-issues a new pair via Set-Cookie. POST /auth/logout clears
both. Tokens never appear in JavaScript scope, so an XSS payload
cannot read them out of localStorage or out of a fetch() response
body.
Programmatic clients: Bearer header¶
The get_current_user dependency reads the JWT from the
access_token cookie first and falls back to
Authorization: Bearer <jwt>. CLI tools, the Go daemon, and any
non-browser caller can keep using Bearer.
WebSocket handshake¶
/ws/incidents and /ws/executions/{id} accept the cookie (the
browser sends it automatically on a same-origin upgrade) or, as a
fallback for non-browser clients, the
Sec-WebSocket-Protocol: bearer, <jwt> slot. URL query params are
not supported because they leak into proxy access logs. Pre-handshake
auth failures close the WS with policy-violation status.
Login rate limiting¶
POST /auth/login is rate-limited at 10 requests per minute per
client IP via slowapi (core/rate_limit.py). The bucket key uses
the leftmost X-Forwarded-For value only when the immediate TCP peer
is in trusted (RFC1918 / loopback) space; otherwise the actual peer
address.
Webhook authentication¶
POST /api/v1/webhooks/alerts/{tenant_slug} requires every request
to carry an HMAC-SHA256 signature of the raw body, computed against
the tenant's webhook_secret:
Each tenant has a unique 32-byte URL-safe webhook_secret,
auto-generated at tenant creation (or backfilled by Alembic
migration m9c2e8f1a4d3 for pre-existing tenants). Verification
uses hmac.compare_digest for constant-time comparison.
The endpoint is also rate-limited at 60 requests per minute per client IP.
Signing examples in bash, Python, and Node.js are in
integrations.md.
Daemon authentication and command signing¶
Session token¶
The Go daemon authenticates every call with its session token. On
/daemon/v1/heartbeat and /daemon/v1/evidence the token sits in
the JSON body. On /daemon/v1/tasks the token is sent in the
Authorization: Bearer header. The legacy query-string form
(?session_token=…) is still accepted for backwards compatibility
but logs a deprecation warning on every call — tokens leak into
reverse-proxy access logs and the migration to header-based auth is
in progress.
Custom monitor command signatures¶
Monitors of type=custom carry an HMAC-SHA256 signature in the
/daemon/v1/tasks response. The signature is keyed by the daemon's
own session token:
The daemon recomputes the HMAC before exec and refuses to run unsigned or mismatched commands.
agent_version >= 0.2.0 enforcement¶
A daemon below 0.2.0 decodes the heartbeat / task JSON with a
strict schema and silently drops unknown fields — including
signature. If the backend handed it a custom monitor anyway, the
daemon would compute an empty HMAC, mismatch, and refuse to run —
but the failure mode would look like a permanent regression.
To avoid that, /daemon/v1/tasks checks server.agent_version
before emitting a type=custom monitor. NULL is treated as
0.0.0. If the version is below 0.2.0 the backend returns
HTTP 426 Upgrade Required instead of the task list, with a
body that names the daemon's current version and the required
floor. Operators see the failure on the Servers page; non-custom
monitors are unaffected by the gate, so the daemon keeps reporting
its other checks while the upgrade is scheduled.
The threat closed: an attacker with DB write access (SQL injection,
leaked credentials) who flips a custom monitor's command no longer
gets RCE. The platform-computed HMAC will not match their tampered
command and the daemon catches the mismatch before exec.
Tenant isolation¶
Database scoping¶
Most resources carry a non-nullable tenant_id column with an index.
The exceptions:
audit_logs.tenant_idis nullable (Alembick4f7a3b2c8d9) so system events such as failed logins from unknown emails can be recorded without inventing a placeholder UUID.- The
recipestable is global by design — recipes are a curated catalog. Read and execute are open to all tenant roles. Write (create, update, delete) requiressuperadminso a tenant admin cannot injectplaybook_paths that other tenants would execute.
WebSocket fanout¶
Both real-time channels are tenant-scoped server-side. swarm/events.py
and worker/notify.py resolve tenant_id from the incident before
publishing; the WS handler then drops messages that do not match the
connection's JWT-bound tenant_id. Superadmin connections see all
tenants.
/ws/executions/{id} additionally verifies, before subscribing, that
the connection's tenant owns the execution.
Cross-tenant server_id from request body¶
API endpoints that accept a server_id in the request body —
incident creation and maintenance schedule creation — validate that
the supplied UUID belongs to the caller's tenant before persisting.
Without that check, a malicious caller could craft a request that
makes the agent pipeline (or maintenance executor) act on a foreign
tenant's host, using that host's stored SSH credentials.
The fix lives in two places:
- Service layer guard —
services/incident.py:create_incidentandservices/maintenance.py:create_scheduleraiseValueErrorwhen the caller's tenant doesn't own the supplied server. The API layer translates that into a generic404 Server not foundresponse that doesn't echo the UUID back (no probe-based discovery). - Defense-in-depth —
swarm/manager.py:build_incident_contextrefuses to load anIncidentContextwhoseincident.tenant_iddoesn't matchserver.tenant_id. The maintenance executor's_run_onedoes the same check before SSH-ing into the host.
Regression coverage: backend/tests/test_cross_tenant_isolation.py
exercises both critical surfaces with foreign and mixed-tenant inputs.
Impersonation¶
Superadmin can POST /admin/impersonate/{tenant_id} to switch their
session into another tenant. The endpoint writes:
- a fresh tenant-scoped
access_tokencookie (30-minute TTL), - the original superadmin token preserved in
original_access_token(alsoHttpOnly).
POST /admin/stop-impersonating swaps them back. The frontend banner
reads the impersonated tenant name from sessionStorage (cosmetic
only); the tokens themselves are never visible to JavaScript.
Approval gate¶
The platform uses a two-stage gate in swarm/guardrails.py. A
recipe auto-executes only when both stages clear it; either stage
can escalate to human approval, neither can relax the other.
Stage 1 — trust × risk × mode¶
| Trust × risk | none |
low |
medium |
high |
|---|---|---|---|---|
autonomous |
auto | auto | approval | approval |
supervised |
auto | auto | approval | approval |
manual |
auto | approval | approval | approval |
The mapping lives in TRUST_RISK_MAP: autonomous → {none, low},
supervised → {none, low}, manual → {none}. Anything outside the
allowed set escalates. The LLM cannot self-approve, and the recipe's
risk_level is operator-controlled at create time — the agent at
runtime cannot rewrite it.
Server mode takes precedence over the trust × risk grid:
shadowmode → always request approval (the operator decides every action, regardless of trust).auditmode → defensive default; the manager skips the execute stage entirely, so this branch should never run.livemode → trust × risk applies as above.
Stage 2 — safety classifier¶
When stage 1 clears (auto), should_request_approval_with_safety
runs an LLM-based safety classifier with the incident, recipe, and
server as input. Only a safe verdict keeps auto-execute.
unsafe, abstain, classifier timeout, missing inputs, and any
other error all escalate to human approval (fail-closed). The
escalation source is recorded on the timeline as safety_unsafe,
safety_abstain, or safety_error so an operator can tell which
gate fired.
Recipe role override (issue #68)¶
Operators can grant a per-(tenant, recipe, role) auto-execute
exemption — for instance "always auto-run nginx-restart on
servers tagged webserver" — without dialing the whole tenant up
to autonomous. An active row in recipe_role_overrides
short-circuits the trust × risk gate before stage 1 even runs (see
swarm/tools/recipes.py:execute_recipe and manager.py validate
stage). Granting flows through the promotion accept UI; revoking
is one API call (DELETE /api/v1/recipe-role-overrides/{id}).
Custom tool sandbox¶
Tenant operators can define custom tools the agents may call. Each type has explicit guardrails.
shell_command¶
Operator template + LLM-supplied parameters, executed via Ansible's
shell module on the target server.
_render_shell_template (in swarm/tools/custom.py) wraps every
parameter value with shlex.quote before substituting it into the
template. Operator-controlled shell features (|, &&, redirects)
keep working as written; LLM-supplied values cannot break out of
their argument slot. A value such as "nginx; rm -rf /" becomes the
literal argument 'nginx; rm -rf /' and is rejected by the target
binary as invalid input.
http_request¶
Outbound HTTP call.
_is_safe_public_url resolves the URL host and rejects anything that
falls into:
- RFC1918 (
10/8,172.16/12,192.168/16) - Loopback (
127/8,::1) - Link-local / cloud metadata (
169.254/16) - IPv6 ULA (
fc00::/7) and link-local (fe80::/10)
This blocks SSRF to internal services and to cloud-provider metadata endpoints. Header values containing CRLF are rejected. TLS verification is enabled.
python_script¶
Disabled. Running LLM-supplied Python on the API container is a
remote shell with a JSON Schema; there is no safe lightweight
sandbox available. Existing tools of this type return a clear error
pointing at the migration paths (a fixed shell_command template, or
an Ansible playbook recipe).
run_diagnostic_command (built-in)¶
The most-used built-in diagnostic tool no longer accepts a free-form
shell command. It accepts an enum verb plus a regex-validated
argument (see dashboard/tools.md for the
verb list). Anything outside the enum requires the agent to propose
a recipe — a curated, operator-reviewed playbook — instead of
improvising a shell command.
Maintenance ad-hoc commands — Jinja escape¶
Maintenance plans expose a command: step type that Ansible's
shell module runs against the target host. Ansible templates
module_args with Jinja2 before exec, which means an operator
command of the form
— or any kubectl -o jsonpath, Helm, Hugo, or Consul-template
invocation — was being parsed as a Jinja expression and dying with
rc=2 / templating error long before the shell module ever saw it
(issue #69).
_escape_jinja_for_ansible (in proactive/maintenance.py) replaces
{{ and }} with their Jinja-escaped form ({{ '{{' }} /
{{ '}}' }}) before handing the command to ansible_runner. The
shell module then receives the literal Go-template syntax and
forwards it intact. Result: the Jinja layer cannot be smuggled
through an operator-supplied command, and Go-template syntax
round-trips losslessly.
LLM client TLS¶
build_client in services/llm/clients.py disables TLS verification
only when the resolved base URL points at a literal local host
(localhost, 127.0.0.1, ::1, 0.0.0.0). A provider configured
with verify_ssl: false still gets full certificate validation if
the URL points anywhere else.
Encryption at rest¶
SSH private keys, bearer tokens, and other secrets stored in the
secrets table are encrypted with AES-256-GCM keyed by
OREMEDY_ENCRYPTION_KEY. A fresh 12-byte nonce is generated for
every encryption operation. The implementation lives in
encryption.py; the key never appears in the database.
Audit log¶
Every state-changing action writes a row to audit_logs:
- Resource type and ID.
- Action (
created,updated,deleted,executed,approved,auth.login,auth.login_failed, etc.). - Actor (user email, agent name, or
NULLfor unauthenticated events such as failed logins). - Tenant (nullable for system events).
- Timestamp (UTC, second precision).
- Detail JSON.
- IP address — captured from
X-Forwarded-Foronly when the immediate TCP peer is in trusted (RFC1918 / loopback) space. Otherwise the actual peer is used. Pure header trust (the unconditional pre-hardening behaviour) is gone.
The table is append-only; the application never updates or deletes rows. Read access is tenant-scoped.
Phoenix tracing¶
Phoenix (Arize) ingests every LLM prompt and response for debugging and observability. Because that data is sensitive, the Phoenix container is on the internal Docker network only — it has no public Caddy route. Operator access is via SSH tunnel:
CORS¶
The API uses CORSMiddleware with explicit origins. The production
compose file passes https://${DOMAIN} as the default; multiple
origins can be supplied as a comma-separated list via
OREMEDY_CORS_ORIGINS. * is rejected when OREMEDY_ENV=production
because wildcard origins paired with credentialed cookies form a
known XSS amplifier.
Threat model recap¶
What is mitigated:
- Tenant admin compromise injecting cross-tenant playbook paths.
- DB tampering injecting daemon shell commands.
- LLM prompt injection escalating to RCE on the API container or managed servers.
- XSS exfiltrating session tokens.
- Cross-tenant leakage on real-time channels.
- Credential stuffing on
/auth/login. - Webhook spam and unauthenticated alert injection.
- SSRF from custom HTTP tools to internal services.
- TLS bypass on remote LLM providers.
What is not mitigated by this layer (out of scope, requires the infrastructure underneath):
- Compromise of the host running the API process.
- Compromise of the deployed
OREMEDY_SECRET_KEYorOREMEDY_ENCRYPTION_KEY. - Compromise of the underlying PostgreSQL / Redis / SeaweedFS.
- Network-level attacks against Caddy.
These are deployment-layer concerns and require the same standard hardening as any other production service (private networking, disk-level encryption, OS hardening, runtime EDR, etc.).