Architecture¶
This document describes the runtime architecture of OpenRemedy: how
the components are deployed, how a request travels through them, and
how each major flow is implemented. It is the technical companion to
the home page and proactive.md.
Service map¶
flowchart LR
subgraph customer["Customer server"]
D[openremedy-client<br/>Go daemon]
end
subgraph public["Public network"]
USR[Browser]
EXT[External monitoring stack]
end
subgraph platform["OpenRemedy VPS — docker-compose"]
CADDY[caddy<br/>TLS, reverse proxy]
UI[ui<br/>Next.js]
API[api<br/>FastAPI]
WORKER[worker<br/>ARQ async]
SWARM[swarm<br/>SwarmManager + PatrolScheduler]
PROACT[proactive<br/>CheckScheduler + Evaluator + IncidentWatcher]
PG[(postgres<br/>PostgreSQL 16)]
REDIS[(redis<br/>pub/sub + ARQ queue)]
S3[(seaweedfs<br/>S3-compatible)]
PHX[phoenix<br/>LLM tracing<br/>internal-only]
end
USR --> CADDY
EXT --> CADDY
D --> CADDY
CADDY --> API
CADDY --> UI
API --> PG
WORKER --> PG
SWARM --> PG
PROACT --> PG
API <--> REDIS
WORKER <--> REDIS
SWARM <--> REDIS
PROACT <--> REDIS
API --> S3
WORKER --> S3
WORKER -- SSH + Ansible --> D
SWARM -. tracing .-> PHX
Containers per docker-compose.prod.yml:
| Container | Image source | Role |
|---|---|---|
caddy |
caddy:2-alpine |
TLS, automatic Let's Encrypt, reverse proxy |
ui |
frontend/Dockerfile |
Next.js dashboard |
api |
backend/Dockerfile |
FastAPI REST + WebSocket |
worker |
backend/Dockerfile |
ARQ workers (recipe execution, classification) |
swarm |
backend/Dockerfile |
Swarm Manager + Patrol Scheduler |
proactive |
backend/Dockerfile |
Check Scheduler + Evaluator + Incident Watcher + Maintenance Scheduler |
postgres |
postgres:16-alpine |
Primary datastore |
redis |
redis:7-alpine |
Pub/sub + ARQ queue |
seaweedfs |
chrislusf/seaweedfs |
S3-compatible object storage for evidence blobs |
phoenix |
arizephoenix/phoenix:latest |
LLM tracing, internal-only network |
The four backend services (api, worker, swarm, proactive)
share the same image — they differ only in entrypoint command.
Networks: public (caddy + ui + api + worker + swarm + proactive),
internal (postgres, redis, seaweedfs, phoenix, plus the backend
containers). Phoenix is internal-only by design (see
security.md).
Flow A — Daemon → backend¶
The daemon runs three loops on the customer server.
sequenceDiagram
autonumber
participant Agent as openremedy-client
participant Caddy as caddy
participant API as api (FastAPI)
participant DB as postgres
loop every 15 s
Agent->>Caddy: POST /daemon/v1/heartbeat (token in body)
Caddy->>API: forward
API->>DB: update servers.last_seen
API-->>Agent: 200
end
loop every 15 s
Agent->>Caddy: POST /daemon/v1/evidence (token in body, monitors, alerts)
Caddy->>API: forward
API->>DB: write evidence + create incidents for new alerts
API-->>Agent: 200
end
loop every cycle
Agent->>Caddy: GET /daemon/v1/tasks (Authorization: Bearer)
Caddy->>API: forward
API->>DB: compile_monitors_for_server
API->>API: HMAC-sign each type=custom command
API-->>Agent: monitors[] with signature on customs
end
All three calls are wrapped in httputil.DoWithRetry (capped
exponential backoff 1 s → 30 s + 25 % jitter, four attempts) so a
transient platform 5xx does not cost a cycle.
Implementation:
- Daemon:
daemon/internal/{heartbeat,reporter,httputil}/,daemon/cmd/openremedy-client/main.go. - Backend:
backend/src/openremedy/api/daemon.py.
Flow B — Incident creation and fanout¶
flowchart LR
EV[Evidence POST] --> CREATE[create_incident]
WH[Webhook POST<br/>HMAC-verified] --> CREATE
SCH[CheckEvaluator<br/>fail decision] --> CREATE
PAT[Agent patrol<br/>finds anomaly] --> CREATE
MAN[Manual UI] --> CREATE
CREATE --> DB[(postgres<br/>incidents table)]
CREATE --> PUB[publish_incident_event<br/>tenant_id resolved]
PUB --> REDIS[(redis 'incidents' channel)]
REDIS --> SWARM_M[swarm Manager<br/>spawn pipeline]
REDIS --> WATCH[IncidentWatcher<br/>re-invoke on comments]
REDIS --> WSAPI[api WS handler<br/>filter by tenant_id]
WSAPI --> WS[/ws/incidents to UI/]
Key file: swarm/events.py:publish_incident_event. The publisher
resolves the incident's tenant_id (or accepts it via kwargs) and
attaches it to every payload so subscribers can filter cross-tenant
fanout. worker/notify.py:notify_incident_update does the same
lookup.
Flow C — Swarm pipeline¶
When the SwarmManager picks up a new incident from Redis, it spawns an async pipeline task.
sequenceDiagram
autonumber
participant Mgr as SwarmManager
participant Atlas as Triage agent
participant Forge as Diagnose agent
participant Gate as guardrails
participant Exec as Execute agent
participant Rev as Review agent
participant DB as postgres
participant SDK as Agents SDK
Mgr->>DB: load incident + assigned agent
Mgr->>SDK: Runner.run(triage stage prompt, ROLE_TOOLS["triage"])
SDK->>Atlas: tool calls (search_past_resolutions, record_event, ...)
Atlas-->>Mgr: stage_complete
Mgr->>SDK: Runner.run(diagnose stage prompt, ROLE_TOOLS["diagnose"])
SDK->>Forge: tool calls (run_diagnostic_command, propose_recipe, ...)
Forge-->>Mgr: stage_complete (with proposed recipe)
Mgr->>Gate: should_request_approval(trust, risk)
alt low risk + autonomous
Mgr->>SDK: Runner.run(execute stage prompt, ROLE_TOOLS["execute"])
SDK->>Exec: execute_recipe(slug)
else medium+ or supervised/manual
Mgr->>DB: status = awaiting_approval
Note over Mgr,DB: human approves via UI<br/>or rejects
end
Mgr->>SDK: Runner.run(review stage prompt, ROLE_TOOLS["review"])
SDK->>Rev: tool calls (mark_resolved, generate_report, ...)
Rev-->>Mgr: stage_complete
Mgr->>DB: status = resolved
Each stage:
- Renders the matching prompt template (
templates/prompts/stage_*.jinja2), optionally overridden per tenant viaprompt_templatestable. - Builds the agent with the role's tools (
builder.py:ROLE_TOOLS). - Runs the SDK Runner with a tool-call budget.
- Persists the trace as
agent_eventsrows.
Custom-type incidents skip the validate / execute / review stages and resolve immediately after diagnose with the requested data.
Mode-based gating¶
The server's mode column shapes which stages run before the
trust × risk gate even gets evaluated:
| Mode | Pipeline shape |
|---|---|
audit |
Triage runs only. Incident auto-resolves with resolution_summary='audit-mode classification'. The agent never proposes a recipe. |
shadow |
Full pipeline. Every execute call requires human approval, regardless of trust × risk. |
live |
Full pipeline. Trust × risk gate decides per-call (see below). |
Trust × risk gate¶
Inside live mode, the gate compares the assigned agent's
trust_level to the proposed recipe's risk_level:
| Agent trust | Auto-executes risk levels |
|---|---|
autonomous |
none, low |
supervised |
none, low |
manual |
none (read-only only) |
Anything not in the auto-execute set requires human approval. After the trust × risk gate passes, an optional ML safety classifier may veto auto-execution (it can only tighten, never relax).
A recipe_role_override short-circuits the gate entirely for one
specific (tenant, recipe_slug, server_role) tuple — used when an
operator has manually approved a recipe enough times on a given role
that they want it to auto-execute thereafter. See
recipe authoring.
Implementation:
- Pipeline:
swarm/manager.py,swarm/builder.py. - Stages:
templates/prompts/stage_*.jinja2+ DB overrides viaservices/prompts.py. - Approval gate:
swarm/guardrails.py.
Flow D — Tool call execution¶
When the LLM emits a tool_call, the SDK invokes the matching Python
function in swarm/tools/.
flowchart LR
LLM[LLM emits tool_call<br/>e.g. run_diagnostic_command<br/>verb=docker_disk_usage] --> SDK[Agents SDK<br/>on_invoke_tool]
SDK --> FN[Python function<br/>swarm/tools/diagnostic.py]
FN --> ADHOC[_run_adhoc<br/>module=shell, args=...]
ADHOC --> AR[ansible_runner.run]
AR --> SSH[SSH to target server<br/>SSHConfig from DB<br/>decrypted with encryption_key]
SSH --> CAPTURE[capture output]
CAPTURE --> TRUNC[_truncate<br/>5000 char cap]
TRUNC --> LLM
Custom tools follow the same path with extra guardrails:
shell_command→_render_shell_template(shlex-quoted params) →_run_adhoc(shell, ...).http_request→_is_safe_public_url(SSRF block) → aiohttp.python_script→ disabled, returns a clear error.
Built-in run_diagnostic_command accepts only the curated verb enum;
free-form shell is rejected.
Flow E — Recipe execution (worker path)¶
Recipes are executed asynchronously, separate from the in-process swarm tool path.
sequenceDiagram
autonumber
participant Agent as Agent (execute stage)
participant API as api
participant DB as postgres
participant Q as ARQ queue (redis)
participant W as worker
participant Tgt as Target server
participant WS as WS /ws/executions/{id}
Agent->>API: tool: execute_recipe(slug)
API->>DB: create_execution row
API->>Q: enqueue dispatch_recipe_execution
Q->>W: pull task
W->>DB: load execution + recipe + server
W->>Tgt: ansible_runner.run(playbook=...)
loop streaming
Tgt-->>W: stdout chunks
W->>WS: publish to redis 'execution:{id}'
end
W->>DB: update execution status (success/failed)
W->>API: notify_incident_update
Live output reaches the browser through /ws/executions/{id}. The
WS handler verifies tenant ownership of the execution before
subscribing to the channel.
Implementation:
- Dispatch:
worker/dispatch.py. - Worker:
worker/tasks.py. - Playbooks:
backend/src/openremedy/playbooks/.
Flow F — Frontend ↔ backend¶
sequenceDiagram
autonumber
participant Browser
participant Caddy
participant API
participant Redis
Browser->>Caddy: HTTP / WS request<br/>Cookie: access_token
Caddy->>API: forward
alt REST /api/v1/...
API->>API: get_current_user (cookie or Bearer)
API-->>Caddy: 200 + JSON
else WS /ws/incidents
API->>API: _authenticate_ws (cookie or subprotocol)
API->>Redis: SUBSCRIBE 'incidents'
loop fanout
Redis-->>API: message
API->>API: filter by tenant_id
API->>Browser: send_json (if tenant matches)
end
end
The frontend never sees the JWTs. Login sets the cookies; refresh is
automatic on a 401 (tryRefresh in frontend/src/api/client.ts);
logout calls /auth/logout to clear them. WebSocket constructs use
the cookie path by default — new WebSocket(url) sends the cookie
on the upgrade automatically.
Impersonation runs through /admin/impersonate/{tenant_id} and
/admin/stop-impersonating. The cookie swap is server-side; JS only
holds the impersonated tenant name in sessionStorage for the
banner.
Implementation:
- Frontend client:
frontend/src/api/client.ts,frontend/src/api/auth.ts,frontend/src/hooks/useAuth.ts,frontend/src/hooks/useWebSocket.ts. - Backend deps:
backend/src/openremedy/dependencies.py,backend/src/openremedy/api/auth.py,backend/src/openremedy/api/admin.py.
Flow G — Schedulers and watchers¶
The async background loops, all running in their respective containers.
flowchart TB
subgraph swarm_ctr["swarm container"]
SM[SwarmManager<br/>subscribes 'incidents', 'approvals']
PS[PatrolScheduler<br/>every patrol_interval per agent]
end
subgraph proactive_ctr["proactive container"]
CS[CheckScheduler<br/>60 s sweep of recipe_check policies]
CE[CheckEvaluator<br/>processes check_results]
IW[IncidentWatcher<br/>subscribes 'incidents', 'approvals']
MS[MaintenanceScheduler<br/>activates pending_approval/approved schedules]
PM[PromotionScanner<br/>surfaces auto-execute promotion suggestions]
TL[TelemetryLoop<br/>24 h anonymous metrics + version check]
end
SM -. via Redis .-> SM
PS -. creates incidents .-> SM
CS -. enqueues to ARQ .-> Worker[worker]
Worker -. writes .-> CR[(check_results)]
CR --> CE
CE -. creates incident .-> SM
IW -. re-invokes .-> SM
MS -. starts schedule .-> Worker
PM -. surfaces in UI .-> UI[Dashboard]
TL -. POST /v1/ping .-> TR[telemetry.openremedy.io]
TR -. GET /v1/latest .-> TL
Six loops total today: CheckScheduler, CheckEvaluator,
IncidentWatcher, MaintenanceScheduler, PromotionScanner,
TelemetryLoop (proactive container) plus SwarmManager and
PatrolScheduler in the swarm container. The telemetry loop is opt-out
via OREMEDY_OFFLINE_MODE=true (everything off) or
OREMEDY_TELEMETRY_DISABLED=true (metrics-only off, version check still
runs).
Detailed mechanism descriptions in proactive.md. Privacy
posture for telemetry in privacy.md.
Flow H — Sidechain transcripts (per-stage reasoning record)¶
Every stage of the agent pipeline streams its full reasoning to a
JSONL file in SeaweedFS, separate from the summary that lands in
agent_events. The dashboard's incident page renders the summary;
the JSONL is the audit-grade detail.
- Layout:
tenants/{slug}/incidents/{incident_id}/transcripts/{stage}.jsonlwhere{stage}∈{triage, diagnose, validate, execute, review}. - Lines: each is a JSON object of shape
{ts, stage, type, payload}wheretypeis one ofmodel_call,tool_call,tool_result,reasoning,stage_end. - Writer:
services/sidechain.py'sSidechainWriteris instantiated once per stage and finalised with astage_endline when the stage completes (success or failure). - Reader:
GET /api/v1/incidents/{id}/transcripts(lists available stages) and/transcripts/{stage}(streams the JSONL). - Retention: tied to incident retention; not pruned independently.
The transcripts are never sent off-host (telemetry payloads contain counters only, never these). The reason for the separate JSONL is so post-incident debugging — operator answering "why did the agent propose this recipe" — has a single file to grep, regardless of the summary the timeline view chose to show.
Data model highlights¶
The full schema lives in Alembic migrations (backend/migrations/versions/).
Tables worth knowing:
| Table | Notes |
|---|---|
tenants |
Includes webhook_secret (auto-generated 32-byte URL-safe). |
users |
Roles: user / admin / superadmin. Argon2id hashes via argon2-cffi. |
servers |
last_seen driven by daemon heartbeats. labels JSONB stores the daemon report (_daemon_report) and discovery (_discovery) blobs, capped at 256 KiB. |
incidents |
Lifecycle in status. suppressed_by_maintenance_id links to an active maintenance schedule when set. |
executions |
One row per recipe execution. WS streams reference execution.id. |
recipes |
Global catalog. No tenant_id; superadmin-only writes. |
policies |
flow_definition JSONB holds the visual flow (trigger → action). |
agents, skills, agent_skills |
Agent configuration + assigned skills. |
custom_tools |
Operator-defined tools. tool_type is shell_command or http_request. python_script is disabled. |
audit_logs |
Append-only. tenant_id is nullable for system events. |
agent_events |
Pipeline trace per incident. Foreign-key incident_id, indexed. |
check_results |
Output of CheckScheduler runs, fed to the Evaluator. |
maintenance_* |
Plans, schedules, runs, step executions. |
prompt_templates |
Per-tenant overrides of the shipped Jinja stage prompts. |
Deployment topology (production)¶
The reference deployment is a single VPS running the
docker-compose.prod.yml stack. Caddy fronts the public network;
all data services and Phoenix sit on the internal network. The
managed servers run only the Go daemon, communicating outbound to
the VPS over HTTPS.
Migrations run separately via:
ssh alberto@<host> "cd /home/alberto/openremedy && \
source .env.prod && \
docker compose -f docker-compose.prod.yml --env-file .env.prod \
exec -e OREMEDY_DB_URL_SYNC=postgresql://openremedy:${POSTGRES_PASSWORD}@postgres:5432/openremedy \
api alembic upgrade heads"
The startup hook in main.py will silently skip in-process
migration when alembic.ini is not present in the wheel (the
production case) and raise on any actual upgrade error so the
orchestrator restart loop fires instead of silently booting on a
half-migrated database.
Where to look in the code¶
| Concern | Path |
|---|---|
| Routes | backend/src/openremedy/api/ |
| Business logic | backend/src/openremedy/services/ |
| ORM models | backend/src/openremedy/models/ |
| Pydantic schemas | backend/src/openremedy/schemas/ |
| Swarm pipeline | backend/src/openremedy/swarm/ |
| Built-in tools | backend/src/openremedy/swarm/tools/ |
| Prompt templates | backend/src/openremedy/templates/prompts/ |
| Migrations | backend/migrations/versions/ |
| Worker tasks | backend/src/openremedy/worker/ |
| Proactive loops | backend/src/openremedy/proactive/ |
| Daemon | daemon/cmd/openremedy-client/, daemon/internal/ |
| Frontend pages | frontend/src/app/(dashboard)/ |
| Frontend API client | frontend/src/api/ |