IRON Fleet — AI Operations Architecture

The doctrine in one line: “If Roberto closes the laptop, does it keep working?” — anything that must answer yes lives on the Fleet or the edge, never on the laptop.

Operator & Direction

iThe human-in-the-loop layer. The architect designs and reviews — he doesn’t execute. Work is relayed down to agents, so the laptop can close at night without anything stopping.not 24/7 · steers, doesn't serve

AirJA1 · MacBook

The cockpit. Where direction, review, and Claude Code sessions happen. It relays every substantive command down to the CEO agent — it is not a server.

Claude Code sessions & QA
Relay-first: hands work to the agent fleet
Closes at night — nothing breaks

MiniJA2 · Mac M4 Pro

An always-on local-AI inference node. Bare metal (no containers) running open models for cheap, private, high-volume work.

Ollama serving Gemma / Qwen
Embeddings, transcription, draft generation
Health heartbeats & canaries — token-free
Internal/dev inference — not client production

Air— relay →CEO·Mini— inference →Fleet

Edge · Security · Data Plane

iA zero-trust front door. No server port is open to the internet — every request enters through an encrypted Tunnel and is authenticated by Access (Google SSO) before it reaches the Fleet. The edge also holds state (D1/KV/R2) and fronts the AI gateway.a first-class dependency, not just DNS

Cloudflarethe spine

Every internal UI and origin sits behind it. Nothing on the Fleet is exposed to the raw internet — traffic enters through Tunnels and is gated by Access (Google SSO).

DNS · WAFAccess (SSO)Tunnels Workers · PagesD1 · KV · R2 AI Gateway → OpenRoutercron audit workercross-agent sync bus

Cloudflare— Tunnel + Access →every Fleet node

The Fleet

iThree inexpensive CPU VPSes, each with one job: control plane, core runtime, and executive agents. Separation of concerns means a problem on one node doesn’t cascade — and the whole production surface runs 24/7 independent of any laptop.CPU VPSes · all production · 24/7

iron-pc

Control Plane / CEO

The brain that routes work. Holds the task board and the agent org.

Paperclip app + Postgres + watchdog
CEO + 17 agents (migrating to Hermes)
Label reconciler cron · SRE canary

iron-worker-02

Core Runtime / NB OS

The workhorse. Runs the product and the client-facing services.

Dashboard + voice proxy · voice core
MCP bridge · Graphiti / FalkorDB
CMO marketing swarm · client app containers
Outbound success monitor · scheduled audits

iron-vps-01

Broker + Hermes C-Suite

Locked-down egress broker plus the executive agent pods.

Egress broker (tailnet only)
Hermes Engineering / CTO
Hermes HR · Hermes COO · Hermes CFO

org wiring:CEO → COO → { CMO · CTO · HR }·CFO → CEO

Elastic GPU

iGPU is the priciest resource, so it’s never left idling. Rented per-job for media, 3D, video, and one-off model evals, then terminated — heavy compute on tap without a standing monthly bill.on-demand · ephemeral · never standing

Runpodrented GPU

Spun up for jobs too big for the Mini — media, 3D, video, one-off open-model evals. Pay-per-hour, terminated after each job.

FALmanaged media

Managed media generation on the owned pipeline. Pay-per-job — no always-on GPU bill.

∞

Memory — the persistent brain

iThe difference between a chatbot and an operation. Four layers keep context across every session, and an automated sync loop makes any memory edit searchable by every agent within minutes — with freshness heartbeats running on local Gemma so the monitoring itself costs nothing.cross-cutting · survives every session

Agents forget when the context window closes. This is what makes them not. Four layers, one source of truth, kept in sync automatically.

irt-vault (Git)source of truth

Human- & agent-readable memory as Markdown — standing orders + topic files — versioned in a private Git repo. Every change is committed & pushed; nothing is lost, everything is diffable.

Edge memoryTier 1 · Cloudflare

Fast recall at the edge: D1 (session state) · KV (heartbeats) · Vectorize (embeddings), fronted by a Worker.

Graphiti + FalkorDBTier 2 · knowledge graph

Entities & relationships with semantic search. Auto-ingests the vault, so agents can ask “what do we know about X?” — not just grep files.

PaperclipTier 3 · orchestration

Operational memory: the task board, agent roster, who-owns-what, on Postgres. The “what’s happening right now” layer.

sync loop: edit→ git push (vault) →fleet pulls →Graphiti ingests →R2 backup

Heartbeats on local Gemmatoken-saver

Sync checks, freshness heartbeats, and health canaries run on the Mini’s local Gemma — not a paid frontier model. Constant background monitoring with effectively zero token cost; it only escalates to the cloud when something’s actually wrong.

Model Routing — two-tier, via AI Gateway iCost engineering in one diagram. Cheap and local models do the high-volume work; frontier models make the decisions and sign off anything client-facing. Quality where it counts, pennies everywhere else — with automatic failover across three frontier vendors.

Tier 1 — decisions & client-facing	Opus 4.8 (default) → failover GPT-5.5 / Codex → Grok
Tier 2 — high-volume execution	open-weight default (Mini local / OpenRouter), Opus on escalation
Utility — embeddings / transcription / heartbeats	Mini local Gemma, always (token-free monitoring)
Media / 3D / video	Runpod / FAL (ephemeral GPU)

The rule: client-facing output is always frontier-model quality, regardless of which model drafted it. Cheap models do volume; the frontier signs off.

Want to remix it?

Download the editable source and drag it onto excalidraw.com — fully editable, no account needed.

Download .excalidraw Open Excalidraw