Skip to content

Architecture

This document describes the runtime architecture of OpenRemedy: how the components are deployed, how a request travels through them, and how each major flow is implemented. It is the technical companion to the home page and proactive.md.


Service map

flowchart LR
    subgraph customer["Customer server"]
      D[openremedy-client<br/>Go daemon]
    end

    subgraph public["Public network"]
      USR[Browser]
      EXT[External monitoring stack]
    end

    subgraph platform["OpenRemedy VPS — docker-compose"]
      CADDY[caddy<br/>TLS, reverse proxy]
      UI[ui<br/>Next.js]
      API[api<br/>FastAPI]
      WORKER[worker<br/>ARQ async]
      SWARM[swarm<br/>SwarmManager + PatrolScheduler]
      PROACT[proactive<br/>CheckScheduler + Evaluator + IncidentWatcher]
      PG[(postgres<br/>PostgreSQL 16)]
      REDIS[(redis<br/>pub/sub + ARQ queue)]
      S3[(seaweedfs<br/>S3-compatible)]
      PHX[phoenix<br/>LLM tracing<br/>internal-only]
    end

    USR --> CADDY
    EXT --> CADDY
    D --> CADDY
    CADDY --> API
    CADDY --> UI
    API --> PG
    WORKER --> PG
    SWARM --> PG
    PROACT --> PG
    API <--> REDIS
    WORKER <--> REDIS
    SWARM <--> REDIS
    PROACT <--> REDIS
    API --> S3
    WORKER --> S3
    WORKER -- SSH + Ansible --> D
    SWARM -. tracing .-> PHX

Containers per docker-compose.prod.yml:

Container Image source Role
caddy caddy:2-alpine TLS, automatic Let's Encrypt, reverse proxy
ui frontend/Dockerfile Next.js dashboard
api backend/Dockerfile FastAPI REST + WebSocket
worker backend/Dockerfile ARQ workers (recipe execution, classification)
swarm backend/Dockerfile Swarm Manager + Patrol Scheduler
proactive backend/Dockerfile Check Scheduler + Evaluator + Incident Watcher + Maintenance Scheduler
postgres postgres:16-alpine Primary datastore
redis redis:7-alpine Pub/sub + ARQ queue
seaweedfs chrislusf/seaweedfs S3-compatible object storage for evidence blobs
phoenix arizephoenix/phoenix:latest LLM tracing, internal-only network

The four backend services (api, worker, swarm, proactive) share the same image — they differ only in entrypoint command.

Networks: public (caddy + ui + api + worker + swarm + proactive), internal (postgres, redis, seaweedfs, phoenix, plus the backend containers). Phoenix is internal-only by design (see security.md).


Flow A — Daemon → backend

The daemon runs three loops on the customer server.

sequenceDiagram
    autonumber
    participant Agent as openremedy-client
    participant Caddy as caddy
    participant API as api (FastAPI)
    participant DB as postgres

    loop every 15 s
      Agent->>Caddy: POST /daemon/v1/heartbeat (token in body)
      Caddy->>API: forward
      API->>DB: update servers.last_seen
      API-->>Agent: 200
    end

    loop every 15 s
      Agent->>Caddy: POST /daemon/v1/evidence (token in body, monitors, alerts)
      Caddy->>API: forward
      API->>DB: write evidence + create incidents for new alerts
      API-->>Agent: 200
    end

    loop every cycle
      Agent->>Caddy: GET /daemon/v1/tasks (Authorization: Bearer)
      Caddy->>API: forward
      API->>DB: compile_monitors_for_server
      API->>API: HMAC-sign each type=custom command
      API-->>Agent: monitors[] with signature on customs
    end

All three calls are wrapped in httputil.DoWithRetry (capped exponential backoff 1 s → 30 s + 25 % jitter, four attempts) so a transient platform 5xx does not cost a cycle.

Implementation:

  • Daemon: daemon/internal/{heartbeat,reporter,httputil}/, daemon/cmd/openremedy-client/main.go.
  • Backend: backend/src/openremedy/api/daemon.py.

Flow B — Incident creation and fanout

flowchart LR
    EV[Evidence POST] --> CREATE[create_incident]
    WH[Webhook POST<br/>HMAC-verified] --> CREATE
    SCH[CheckEvaluator<br/>fail decision] --> CREATE
    PAT[Agent patrol<br/>finds anomaly] --> CREATE
    MAN[Manual UI] --> CREATE

    CREATE --> DB[(postgres<br/>incidents table)]
    CREATE --> PUB[publish_incident_event<br/>tenant_id resolved]
    PUB --> REDIS[(redis 'incidents' channel)]

    REDIS --> SWARM_M[swarm Manager<br/>spawn pipeline]
    REDIS --> WATCH[IncidentWatcher<br/>re-invoke on comments]
    REDIS --> WSAPI[api WS handler<br/>filter by tenant_id]
    WSAPI --> WS[/ws/incidents to UI/]

Key file: swarm/events.py:publish_incident_event. The publisher resolves the incident's tenant_id (or accepts it via kwargs) and attaches it to every payload so subscribers can filter cross-tenant fanout. worker/notify.py:notify_incident_update does the same lookup.


Flow C — Swarm pipeline

When the SwarmManager picks up a new incident from Redis, it spawns an async pipeline task.

sequenceDiagram
    autonumber
    participant Mgr as SwarmManager
    participant Atlas as Triage agent
    participant Forge as Diagnose agent
    participant Gate as guardrails
    participant Exec as Execute agent
    participant Rev as Review agent
    participant DB as postgres
    participant SDK as Agents SDK

    Mgr->>DB: load incident + assigned agent
    Mgr->>SDK: Runner.run(triage stage prompt, ROLE_TOOLS["triage"])
    SDK->>Atlas: tool calls (search_past_resolutions, record_event, ...)
    Atlas-->>Mgr: stage_complete
    Mgr->>SDK: Runner.run(diagnose stage prompt, ROLE_TOOLS["diagnose"])
    SDK->>Forge: tool calls (run_diagnostic_command, propose_recipe, ...)
    Forge-->>Mgr: stage_complete (with proposed recipe)
    Mgr->>Gate: should_request_approval(trust, risk)
    alt low risk + autonomous
      Mgr->>SDK: Runner.run(execute stage prompt, ROLE_TOOLS["execute"])
      SDK->>Exec: execute_recipe(slug)
    else medium+ or supervised/manual
      Mgr->>DB: status = awaiting_approval
      Note over Mgr,DB: human approves via UI<br/>or rejects
    end
    Mgr->>SDK: Runner.run(review stage prompt, ROLE_TOOLS["review"])
    SDK->>Rev: tool calls (mark_resolved, generate_report, ...)
    Rev-->>Mgr: stage_complete
    Mgr->>DB: status = resolved

Each stage:

  • Renders the matching prompt template (templates/prompts/stage_*.jinja2), optionally overridden per tenant via prompt_templates table.
  • Builds the agent with the role's tools (builder.py:ROLE_TOOLS).
  • Runs the SDK Runner with a tool-call budget.
  • Persists the trace as agent_events rows.

Custom-type incidents skip the validate / execute / review stages and resolve immediately after diagnose with the requested data.

Mode-based gating

The server's mode column shapes which stages run before the trust × risk gate even gets evaluated:

Mode Pipeline shape
audit Triage runs only. Incident auto-resolves with resolution_summary='audit-mode classification'. The agent never proposes a recipe.
shadow Full pipeline. Every execute call requires human approval, regardless of trust × risk.
live Full pipeline. Trust × risk gate decides per-call (see below).

Trust × risk gate

Inside live mode, the gate compares the assigned agent's trust_level to the proposed recipe's risk_level:

Agent trust Auto-executes risk levels
autonomous none, low
supervised none, low
manual none (read-only only)

Anything not in the auto-execute set requires human approval. After the trust × risk gate passes, an optional ML safety classifier may veto auto-execution (it can only tighten, never relax).

A recipe_role_override short-circuits the gate entirely for one specific (tenant, recipe_slug, server_role) tuple — used when an operator has manually approved a recipe enough times on a given role that they want it to auto-execute thereafter. See recipe authoring.

Implementation:

  • Pipeline: swarm/manager.py, swarm/builder.py.
  • Stages: templates/prompts/stage_*.jinja2 + DB overrides via services/prompts.py.
  • Approval gate: swarm/guardrails.py.

Flow D — Tool call execution

When the LLM emits a tool_call, the SDK invokes the matching Python function in swarm/tools/.

flowchart LR
    LLM[LLM emits tool_call<br/>e.g. run_diagnostic_command<br/>verb=docker_disk_usage] --> SDK[Agents SDK<br/>on_invoke_tool]
    SDK --> FN[Python function<br/>swarm/tools/diagnostic.py]
    FN --> ADHOC[_run_adhoc<br/>module=shell, args=...]
    ADHOC --> AR[ansible_runner.run]
    AR --> SSH[SSH to target server<br/>SSHConfig from DB<br/>decrypted with encryption_key]
    SSH --> CAPTURE[capture output]
    CAPTURE --> TRUNC[_truncate<br/>5000 char cap]
    TRUNC --> LLM

Custom tools follow the same path with extra guardrails:

  • shell_command_render_shell_template (shlex-quoted params) → _run_adhoc(shell, ...).
  • http_request_is_safe_public_url (SSRF block) → aiohttp.
  • python_script → disabled, returns a clear error.

Built-in run_diagnostic_command accepts only the curated verb enum; free-form shell is rejected.


Flow E — Recipe execution (worker path)

Recipes are executed asynchronously, separate from the in-process swarm tool path.

sequenceDiagram
    autonumber
    participant Agent as Agent (execute stage)
    participant API as api
    participant DB as postgres
    participant Q as ARQ queue (redis)
    participant W as worker
    participant Tgt as Target server
    participant WS as WS /ws/executions/{id}

    Agent->>API: tool: execute_recipe(slug)
    API->>DB: create_execution row
    API->>Q: enqueue dispatch_recipe_execution
    Q->>W: pull task
    W->>DB: load execution + recipe + server
    W->>Tgt: ansible_runner.run(playbook=...)
    loop streaming
      Tgt-->>W: stdout chunks
      W->>WS: publish to redis 'execution:{id}'
    end
    W->>DB: update execution status (success/failed)
    W->>API: notify_incident_update

Live output reaches the browser through /ws/executions/{id}. The WS handler verifies tenant ownership of the execution before subscribing to the channel.

Implementation:

  • Dispatch: worker/dispatch.py.
  • Worker: worker/tasks.py.
  • Playbooks: backend/src/openremedy/playbooks/.

Flow F — Frontend ↔ backend

sequenceDiagram
    autonumber
    participant Browser
    participant Caddy
    participant API
    participant Redis

    Browser->>Caddy: HTTP / WS request<br/>Cookie: access_token
    Caddy->>API: forward
    alt REST /api/v1/...
      API->>API: get_current_user (cookie or Bearer)
      API-->>Caddy: 200 + JSON
    else WS /ws/incidents
      API->>API: _authenticate_ws (cookie or subprotocol)
      API->>Redis: SUBSCRIBE 'incidents'
      loop fanout
        Redis-->>API: message
        API->>API: filter by tenant_id
        API->>Browser: send_json (if tenant matches)
      end
    end

The frontend never sees the JWTs. Login sets the cookies; refresh is automatic on a 401 (tryRefresh in frontend/src/api/client.ts); logout calls /auth/logout to clear them. WebSocket constructs use the cookie path by default — new WebSocket(url) sends the cookie on the upgrade automatically.

Impersonation runs through /admin/impersonate/{tenant_id} and /admin/stop-impersonating. The cookie swap is server-side; JS only holds the impersonated tenant name in sessionStorage for the banner.

Implementation:

  • Frontend client: frontend/src/api/client.ts, frontend/src/api/auth.ts, frontend/src/hooks/useAuth.ts, frontend/src/hooks/useWebSocket.ts.
  • Backend deps: backend/src/openremedy/dependencies.py, backend/src/openremedy/api/auth.py, backend/src/openremedy/api/admin.py.

Flow G — Schedulers and watchers

The async background loops, all running in their respective containers.

flowchart TB
    subgraph swarm_ctr["swarm container"]
      SM[SwarmManager<br/>subscribes 'incidents', 'approvals']
      PS[PatrolScheduler<br/>every patrol_interval per agent]
    end
    subgraph proactive_ctr["proactive container"]
      CS[CheckScheduler<br/>60 s sweep of recipe_check policies]
      CE[CheckEvaluator<br/>processes check_results]
      IW[IncidentWatcher<br/>subscribes 'incidents', 'approvals']
      MS[MaintenanceScheduler<br/>activates pending_approval/approved schedules]
      PM[PromotionScanner<br/>surfaces auto-execute promotion suggestions]
      TL[TelemetryLoop<br/>24 h anonymous metrics + version check]
    end

    SM -. via Redis .-> SM
    PS -. creates incidents .-> SM
    CS -. enqueues to ARQ .-> Worker[worker]
    Worker -. writes .-> CR[(check_results)]
    CR --> CE
    CE -. creates incident .-> SM
    IW -. re-invokes .-> SM
    MS -. starts schedule .-> Worker
    PM -. surfaces in UI .-> UI[Dashboard]
    TL -. POST /v1/ping .-> TR[telemetry.openremedy.io]
    TR -. GET /v1/latest .-> TL

Six loops total today: CheckScheduler, CheckEvaluator, IncidentWatcher, MaintenanceScheduler, PromotionScanner, TelemetryLoop (proactive container) plus SwarmManager and PatrolScheduler in the swarm container. The telemetry loop is opt-out via OREMEDY_OFFLINE_MODE=true (everything off) or OREMEDY_TELEMETRY_DISABLED=true (metrics-only off, version check still runs).

Detailed mechanism descriptions in proactive.md. Privacy posture for telemetry in privacy.md.


Flow H — Sidechain transcripts (per-stage reasoning record)

Every stage of the agent pipeline streams its full reasoning to a JSONL file in SeaweedFS, separate from the summary that lands in agent_events. The dashboard's incident page renders the summary; the JSONL is the audit-grade detail.

  • Layout: tenants/{slug}/incidents/{incident_id}/transcripts/{stage}.jsonl where {stage}{triage, diagnose, validate, execute, review}.
  • Lines: each is a JSON object of shape {ts, stage, type, payload} where type is one of model_call, tool_call, tool_result, reasoning, stage_end.
  • Writer: services/sidechain.py's SidechainWriter is instantiated once per stage and finalised with a stage_end line when the stage completes (success or failure).
  • Reader: GET /api/v1/incidents/{id}/transcripts (lists available stages) and /transcripts/{stage} (streams the JSONL).
  • Retention: tied to incident retention; not pruned independently.

The transcripts are never sent off-host (telemetry payloads contain counters only, never these). The reason for the separate JSONL is so post-incident debugging — operator answering "why did the agent propose this recipe" — has a single file to grep, regardless of the summary the timeline view chose to show.


Data model highlights

The full schema lives in Alembic migrations (backend/migrations/versions/). Tables worth knowing:

Table Notes
tenants Includes webhook_secret (auto-generated 32-byte URL-safe).
users Roles: user / admin / superadmin. Argon2id hashes via argon2-cffi.
servers last_seen driven by daemon heartbeats. labels JSONB stores the daemon report (_daemon_report) and discovery (_discovery) blobs, capped at 256 KiB.
incidents Lifecycle in status. suppressed_by_maintenance_id links to an active maintenance schedule when set.
executions One row per recipe execution. WS streams reference execution.id.
recipes Global catalog. No tenant_id; superadmin-only writes.
policies flow_definition JSONB holds the visual flow (trigger → action).
agents, skills, agent_skills Agent configuration + assigned skills.
custom_tools Operator-defined tools. tool_type is shell_command or http_request. python_script is disabled.
audit_logs Append-only. tenant_id is nullable for system events.
agent_events Pipeline trace per incident. Foreign-key incident_id, indexed.
check_results Output of CheckScheduler runs, fed to the Evaluator.
maintenance_* Plans, schedules, runs, step executions.
prompt_templates Per-tenant overrides of the shipped Jinja stage prompts.

Deployment topology (production)

The reference deployment is a single VPS running the docker-compose.prod.yml stack. Caddy fronts the public network; all data services and Phoenix sit on the internal network. The managed servers run only the Go daemon, communicating outbound to the VPS over HTTPS.

Migrations run separately via:

ssh alberto@<host> "cd /home/alberto/openremedy && \
  source .env.prod && \
  docker compose -f docker-compose.prod.yml --env-file .env.prod \
    exec -e OREMEDY_DB_URL_SYNC=postgresql://openremedy:${POSTGRES_PASSWORD}@postgres:5432/openremedy \
    api alembic upgrade heads"

The startup hook in main.py will silently skip in-process migration when alembic.ini is not present in the wheel (the production case) and raise on any actual upgrade error so the orchestrator restart loop fires instead of silently booting on a half-migrated database.


Where to look in the code

Concern Path
Routes backend/src/openremedy/api/
Business logic backend/src/openremedy/services/
ORM models backend/src/openremedy/models/
Pydantic schemas backend/src/openremedy/schemas/
Swarm pipeline backend/src/openremedy/swarm/
Built-in tools backend/src/openremedy/swarm/tools/
Prompt templates backend/src/openremedy/templates/prompts/
Migrations backend/migrations/versions/
Worker tasks backend/src/openremedy/worker/
Proactive loops backend/src/openremedy/proactive/
Daemon daemon/cmd/openremedy-client/, daemon/internal/
Frontend pages frontend/src/app/(dashboard)/
Frontend API client frontend/src/api/