Security model¶

This document consolidates the security posture of OpenRemedy: the boundaries the platform enforces, the credentials it requires, and the mitigations applied to its agentic surface. It is intended for operators deploying OpenRemedy and for auditors reviewing the platform.

The model has been hardened through a focused security pass; every mitigation below corresponds to a closed finding. Where relevant, the source file and behaviour are cited.

Required environment variables¶

The deploy aborts at compose interpolation time if any of these are missing. The application also validates the values on boot and refuses to start with known dev defaults.

Variable	Purpose	Constraints
`OREMEDY_SECRET_KEY`	JWT signing key (HS256)	At least 32 characters. Rejected if it matches the historical dev defaults (`changeme-dev-secret-key-32chars!!`, `dev-secret-key-change-in-production`, `changeme`). Generate with `openssl rand -base64 48`.
`OREMEDY_ENCRYPTION_KEY`	AES-256-GCM data key for stored secrets	Exactly 64 hex characters (32 bytes). Rejected if it matches dev placeholders such as 64×`a` or the example `0123…`. Generate with `openssl rand -hex 32`.
`POSTGRES_PASSWORD`	Database password	Required, no default.

In production:

OREMEDY_ENV=production activates the production validators.
OREMEDY_DEBUG=true is rejected when OREMEDY_ENV=production.
OREMEDY_CORS_ORIGINS must not contain *. Default in the production compose file is https://${DOMAIN}; override per deployment to add origins.

Authentication¶

Web sessions: HttpOnly cookies¶

Browser sessions use two cookies, both HttpOnly + Secure + SameSite=strict, set on a successful POST /auth/login or POST /auth/register:

Cookie	Lifetime	Purpose
`access_token`	`OREMEDY_ACCESS_TOKEN_EXPIRE_MINUTES` (default 480)	API auth
`refresh_token`	`OREMEDY_REFRESH_TOKEN_EXPIRE_DAYS` (default 30)	Refresh path

POST /auth/refresh reads the refresh cookie, validates the JWT, and re-issues a new pair via Set-Cookie. POST /auth/logout clears both. Tokens never appear in JavaScript scope, so an XSS payload cannot read them out of localStorage or out of a fetch() response body.

Programmatic clients: Bearer header¶

The get_current_user dependency reads the JWT from the access_token cookie first and falls back to Authorization: Bearer <jwt>. CLI tools, the Go daemon, and any non-browser caller can keep using Bearer.

WebSocket handshake¶

/ws/incidents and /ws/executions/{id} accept the cookie (the browser sends it automatically on a same-origin upgrade) or, as a fallback for non-browser clients, the Sec-WebSocket-Protocol: bearer, <jwt> slot. URL query params are not supported because they leak into proxy access logs. Pre-handshake auth failures close the WS with policy-violation status.

POST /auth/login is rate-limited at 10 requests per minute per client IP via slowapi (core/rate_limit.py). The bucket key uses the leftmost X-Forwarded-For value only when the immediate TCP peer is in trusted (RFC1918 / loopback) space; otherwise the actual peer address.

Webhook authentication¶

POST /api/v1/webhooks/alerts/{tenant_slug} requires every request to carry an HMAC-SHA256 signature of the raw body, computed against the tenant's webhook_secret:

X-OpenRemedy-Signature: sha256=<lowercase hex digest>

Each tenant has a unique 32-byte URL-safe webhook_secret, auto-generated at tenant creation (or backfilled by Alembic migration m9c2e8f1a4d3 for pre-existing tenants). Verification uses hmac.compare_digest for constant-time comparison.

The endpoint is also rate-limited at 60 requests per minute per client IP.

Signing examples in bash, Python, and Node.js are in integrations.md.

Daemon authentication and command signing¶

Session token¶

The Go daemon authenticates every call with its session token. On /daemon/v1/heartbeat and /daemon/v1/evidence the token sits in the JSON body. On /daemon/v1/tasks the token is sent in the Authorization: Bearer header. The legacy query-string form (?session_token=…) is still accepted for backwards compatibility but logs a deprecation warning on every call — tokens leak into reverse-proxy access logs and the migration to header-based auth is in progress.

Custom monitor command signatures¶

Monitors of type=custom carry an HMAC-SHA256 signature in the /daemon/v1/tasks response. The signature is keyed by the daemon's own session token:

signature = HMAC-SHA256(session_token, command).hex()

The daemon recomputes the HMAC before exec and refuses to run unsigned or mismatched commands.

`agent_version >= 0.2.0` enforcement¶

A daemon below 0.2.0 decodes the heartbeat / task JSON with a strict schema and silently drops unknown fields — including signature. If the backend handed it a custom monitor anyway, the daemon would compute an empty HMAC, mismatch, and refuse to run — but the failure mode would look like a permanent regression.

To avoid that, /daemon/v1/tasks checks server.agent_version before emitting a type=custom monitor. NULL is treated as 0.0.0. If the version is below 0.2.0 the backend returns HTTP 426 Upgrade Required instead of the task list, with a body that names the daemon's current version and the required floor. Operators see the failure on the Servers page; non-custom monitors are unaffected by the gate, so the daemon keeps reporting its other checks while the upgrade is scheduled.

The threat closed: an attacker with DB write access (SQL injection, leaked credentials) who flips a custom monitor's command no longer gets RCE. The platform-computed HMAC will not match their tampered command and the daemon catches the mismatch before exec.

Tenant isolation¶

Database scoping¶

Most resources carry a non-nullable tenant_id column with an index. The exceptions:

audit_logs.tenant_id is nullable (Alembic k4f7a3b2c8d9) so system events such as failed logins from unknown emails can be recorded without inventing a placeholder UUID.
The recipes table is global by design — recipes are a curated catalog. Read and execute are open to all tenant roles. Write (create, update, delete) requires superadmin so a tenant admin cannot inject playbook_paths that other tenants would execute.

WebSocket fanout¶

Both real-time channels are tenant-scoped server-side. swarm/events.py and worker/notify.py resolve tenant_id from the incident before publishing; the WS handler then drops messages that do not match the connection's JWT-bound tenant_id. Superadmin connections see all tenants.

/ws/executions/{id} additionally verifies, before subscribing, that the connection's tenant owns the execution.

Cross-tenant `server_id` from request body¶

API endpoints that accept a server_id in the request body — incident creation and maintenance schedule creation — validate that the supplied UUID belongs to the caller's tenant before persisting. Without that check, a malicious caller could craft a request that makes the agent pipeline (or maintenance executor) act on a foreign tenant's host, using that host's stored SSH credentials.

The fix lives in two places:

Service layer guard — services/incident.py:create_incident and services/maintenance.py:create_schedule raise ValueError when the caller's tenant doesn't own the supplied server. The API layer translates that into a generic 404 Server not found response that doesn't echo the UUID back (no probe-based discovery).
Defense-in-depth — swarm/manager.py:build_incident_context refuses to load an IncidentContext whose incident.tenant_id doesn't match server.tenant_id. The maintenance executor's _run_one does the same check before SSH-ing into the host.

Regression coverage: backend/tests/test_cross_tenant_isolation.py exercises both critical surfaces with foreign and mixed-tenant inputs.

Impersonation¶

Superadmin can POST /admin/impersonate/{tenant_id} to switch their session into another tenant. The endpoint writes:

a fresh tenant-scoped access_token cookie (30-minute TTL),
the original superadmin token preserved in original_access_token (also HttpOnly).

POST /admin/stop-impersonating swaps them back. The frontend banner reads the impersonated tenant name from sessionStorage (cosmetic only); the tokens themselves are never visible to JavaScript.

Approval gate¶

The platform uses a two-stage gate in swarm/guardrails.py. A recipe auto-executes only when both stages clear it; either stage can escalate to human approval, neither can relax the other.

Stage 1 — trust × risk × mode¶

Trust × risk	`none`	`low`	`medium`	`high`
`autonomous`	auto	auto	approval	approval
`supervised`	auto	auto	approval	approval
`manual`	auto	approval	approval	approval

The mapping lives in TRUST_RISK_MAP: autonomous → {none, low}, supervised → {none, low}, manual → {none}. Anything outside the allowed set escalates. The LLM cannot self-approve, and the recipe's risk_level is operator-controlled at create time — the agent at runtime cannot rewrite it.

Server mode takes precedence over the trust × risk grid:

shadow mode → always request approval (the operator decides every action, regardless of trust).
audit mode → defensive default; the manager skips the execute stage entirely, so this branch should never run.
live mode → trust × risk applies as above.

Stage 2 — safety classifier¶

When stage 1 clears (auto), should_request_approval_with_safety runs an LLM-based safety classifier with the incident, recipe, and server as input. Only a safe verdict keeps auto-execute. unsafe, abstain, classifier timeout, missing inputs, and any other error all escalate to human approval (fail-closed). The escalation source is recorded on the timeline as safety_unsafe, safety_abstain, or safety_error so an operator can tell which gate fired.

Recipe role override (issue #68)¶

Operators can grant a per-(tenant, recipe, role) auto-execute exemption — for instance "always auto-run nginx-restart on servers tagged webserver" — without dialing the whole tenant up to autonomous. An active row in recipe_role_overrides short-circuits the trust × risk gate before stage 1 even runs (see swarm/tools/recipes.py:execute_recipe and manager.py validate stage). Granting flows through the promotion accept UI; revoking is one API call (DELETE /api/v1/recipe-role-overrides/{id}).

Custom tool sandbox¶

Tenant operators can define custom tools the agents may call. Each type has explicit guardrails.

`shell_command`¶

Operator template + LLM-supplied parameters, executed via Ansible's shell module on the target server.

_render_shell_template (in swarm/tools/custom.py) wraps every parameter value with shlex.quote before substituting it into the template. Operator-controlled shell features (|, &&, redirects) keep working as written; LLM-supplied values cannot break out of their argument slot. A value such as "nginx; rm -rf /" becomes the literal argument 'nginx; rm -rf /' and is rejected by the target binary as invalid input.

`http_request`¶

Outbound HTTP call.

_is_safe_public_url resolves the URL host and rejects anything that falls into:

RFC1918 (10/8, 172.16/12, 192.168/16)
Loopback (127/8, ::1)
Link-local / cloud metadata (169.254/16)
IPv6 ULA (fc00::/7) and link-local (fe80::/10)

This blocks SSRF to internal services and to cloud-provider metadata endpoints. Header values containing CRLF are rejected. TLS verification is enabled.

`python_script`¶

Disabled. Running LLM-supplied Python on the API container is a remote shell with a JSON Schema; there is no safe lightweight sandbox available. Existing tools of this type return a clear error pointing at the migration paths (a fixed shell_command template, or an Ansible playbook recipe).

`run_diagnostic_command` (built-in)¶

The most-used built-in diagnostic tool no longer accepts a free-form shell command. It accepts an enum verb plus a regex-validated argument (see dashboard/tools.md for the verb list). Anything outside the enum requires the agent to propose a recipe — a curated, operator-reviewed playbook — instead of improvising a shell command.

Maintenance ad-hoc commands — Jinja escape¶

Maintenance plans expose a command: step type that Ansible's shell module runs against the target host. Ansible templates module_args with Jinja2 before exec, which means an operator command of the form

docker ps --format '{{ "{{.Names}}" }}'

— or any kubectl -o jsonpath, Helm, Hugo, or Consul-template invocation — was being parsed as a Jinja expression and dying with rc=2 / templating error long before the shell module ever saw it (issue #69).

_escape_jinja_for_ansible (in proactive/maintenance.py) replaces {{ and }} with their Jinja-escaped form ({{ '{{' }} / {{ '}}' }}) before handing the command to ansible_runner. The shell module then receives the literal Go-template syntax and forwards it intact. Result: the Jinja layer cannot be smuggled through an operator-supplied command, and Go-template syntax round-trips losslessly.

LLM client TLS¶

build_client in services/llm/clients.py disables TLS verification only when the resolved base URL points at a literal local host (localhost, 127.0.0.1, ::1, 0.0.0.0). A provider configured with verify_ssl: false still gets full certificate validation if the URL points anywhere else.

Encryption at rest¶

SSH private keys, bearer tokens, and other secrets stored in the secrets table are encrypted with AES-256-GCM keyed by OREMEDY_ENCRYPTION_KEY. A fresh 12-byte nonce is generated for every encryption operation. The implementation lives in encryption.py; the key never appears in the database.

Audit log¶

Every state-changing action writes a row to audit_logs:

Resource type and ID.
Action (created, updated, deleted, executed, approved, auth.login, auth.login_failed, etc.).
Actor (user email, agent name, or NULL for unauthenticated events such as failed logins).
Tenant (nullable for system events).
Timestamp (UTC, second precision).
Detail JSON.
IP address — captured from X-Forwarded-For only when the immediate TCP peer is in trusted (RFC1918 / loopback) space. Otherwise the actual peer is used. Pure header trust (the unconditional pre-hardening behaviour) is gone.

The table is append-only; the application never updates or deletes rows. Read access is tenant-scoped.

Phoenix tracing¶

Phoenix (Arize) ingests every LLM prompt and response for debugging and observability. Because that data is sensitive, the Phoenix container is on the internal Docker network only — it has no public Caddy route. Operator access is via SSH tunnel:

ssh -L 6006:phoenix:6006 alberto@<host>
# then open http://localhost:6006

CORS¶

The API uses CORSMiddleware with explicit origins. The production compose file passes https://${DOMAIN} as the default; multiple origins can be supplied as a comma-separated list via OREMEDY_CORS_ORIGINS. * is rejected when OREMEDY_ENV=production because wildcard origins paired with credentialed cookies form a known XSS amplifier.

Threat model recap¶

What is mitigated:

Tenant admin compromise injecting cross-tenant playbook paths.
DB tampering injecting daemon shell commands.
LLM prompt injection escalating to RCE on the API container or managed servers.
XSS exfiltrating session tokens.
Cross-tenant leakage on real-time channels.
Credential stuffing on /auth/login.
Webhook spam and unauthenticated alert injection.
SSRF from custom HTTP tools to internal services.
TLS bypass on remote LLM providers.

What is not mitigated by this layer (out of scope, requires the infrastructure underneath):

Compromise of the host running the API process.
Compromise of the deployed OREMEDY_SECRET_KEY or OREMEDY_ENCRYPTION_KEY.
Compromise of the underlying PostgreSQL / Redis / SeaweedFS.
Network-level attacks against Caddy.

These are deployment-layer concerns and require the same standard hardening as any other production service (private networking, disk-level encryption, OS hardening, runtime EDR, etc.).