How a one-architect shop runs a 24/7 multi-agent operation: a laptop that steers, an edge that gates, a three-node fleet that executes, and GPU that appears only when summoned.
The doctrine in one line: “If Roberto closes the laptop, does it keep working?” — anything that must answer yes lives on the Fleet or the edge, never on the laptop.
1
Operator & Direction
iThe human-in-the-loop layer. The architect designs and reviews — he doesn’t execute. Work is relayed down to agents, so the laptop can close at night without anything stopping.not 24/7 · steers, doesn't serve
AirJA1 · MacBook
The cockpit. Where direction, review, and Claude Code sessions happen. It relays every substantive command down to the CEO agent — it is not a server.
Claude Code sessions & QA
Relay-first: hands work to the agent fleet
Closes at night — nothing breaks
MiniJA2 · Mac M4 Pro
An always-on local-AI inference node. Bare metal (no containers) running open models for cheap, private, high-volume work.
Ollama serving Gemma / Qwen
Embeddings, transcription, draft generation
Health heartbeats & canaries — token-free
Internal/dev inference — not client production
Air— relay →CEO·Mini— inference →Fleet
2
Edge · Security · Data Plane
iA zero-trust front door. No server port is open to the internet — every request enters through an encrypted Tunnel and is authenticated by Access (Google SSO) before it reaches the Fleet. The edge also holds state (D1/KV/R2) and fronts the AI gateway.a first-class dependency, not just DNS
Cloudflarethe spine
Every internal UI and origin sits behind it. Nothing on the Fleet is exposed to the raw internet — traffic enters through Tunnels and is gated by Access (Google SSO).
DNS · WAFAccess (SSO)TunnelsWorkers · PagesD1 · KV · R2AI Gateway → OpenRoutercron audit workercross-agent sync bus
Cloudflare— Tunnel + Access →every Fleet node
3
The Fleet
iThree inexpensive CPU VPSes, each with one job: control plane, core runtime, and executive agents. Separation of concerns means a problem on one node doesn’t cascade — and the whole production surface runs 24/7 independent of any laptop.CPU VPSes · all production · 24/7
iron-pc
Control Plane / CEO
The brain that routes work. Holds the task board and the agent org.
Paperclip app + Postgres + watchdog
CEO + 17 agents (migrating to Hermes)
Label reconciler cron · SRE canary
iron-worker-02
Core Runtime / NB OS
The workhorse. Runs the product and the client-facing services.
Dashboard + voice proxy · voice core
MCP bridge · Graphiti / FalkorDB
CMO marketing swarm · client app containers
Outbound success monitor · scheduled audits
iron-vps-01
Broker + Hermes C-Suite
Locked-down egress broker plus the executive agent pods.
iGPU is the priciest resource, so it’s never left idling. Rented per-job for media, 3D, video, and one-off model evals, then terminated — heavy compute on tap without a standing monthly bill.on-demand · ephemeral · never standing
Runpodrented GPU
Spun up for jobs too big for the Mini — media, 3D, video, one-off open-model evals. Pay-per-hour, terminated after each job.
FALmanaged media
Managed media generation on the owned pipeline. Pay-per-job — no always-on GPU bill.
∞
Memory — the persistent brain
iThe difference between a chatbot and an operation. Four layers keep context across every session, and an automated sync loop makes any memory edit searchable by every agent within minutes — with freshness heartbeats running on local Gemma so the monitoring itself costs nothing.cross-cutting · survives every session
Agents forget when the context window closes. This is what makes them not. Four layers, one source of truth, kept in sync automatically.
irt-vault (Git)source of truth
Human- & agent-readable memory as Markdown — standing orders + topic files — versioned in a private Git repo. Every change is committed & pushed; nothing is lost, everything is diffable.
Edge memoryTier 1 · Cloudflare
Fast recall at the edge: D1 (session state) · KV (heartbeats) · Vectorize (embeddings), fronted by a Worker.
Graphiti + FalkorDBTier 2 · knowledge graph
Entities & relationships with semantic search. Auto-ingests the vault, so agents can ask “what do we know about X?” — not just grep files.
PaperclipTier 3 · orchestration
Operational memory: the task board, agent roster, who-owns-what, on Postgres. The “what’s happening right now” layer.
Sync checks, freshness heartbeats, and health canaries run on the Mini’s local Gemma — not a paid frontier model. Constant background monitoring with effectively zero token cost; it only escalates to the cloud when something’s actually wrong.
Model Routing — two-tier, via AI Gateway iCost engineering in one diagram. Cheap and local models do the high-volume work; frontier models make the decisions and sign off anything client-facing. Quality where it counts, pennies everywhere else — with automatic failover across three frontier vendors.
Tier 1 — decisions & client-facing
Opus 4.8 (default) → failover GPT-5.5 / Codex → Grok
Tier 2 — high-volume execution
open-weight default (Mini local / OpenRouter), Opus on escalation
Utility — embeddings / transcription / heartbeats
Mini local Gemma, always (token-free monitoring)
Media / 3D / video
Runpod / FAL (ephemeral GPU)
The rule: client-facing output is always frontier-model quality, regardless of which model drafted it. Cheap models do volume; the frontier signs off.
Want to remix it?
Download the editable source and drag it onto excalidraw.com — fully editable, no account needed.