Iron Noodle · NB OS

IRON Fleet — AI Operations Architecture

How a one-architect shop runs a 24/7 multi-agent operation: a laptop that steers, an edge that gates, a three-node fleet that executes, and GPU that appears only when summoned.

Conceptual view · vendor/tech names shown · IPs & ports redacted for sharing
The doctrine in one line: “If Roberto closes the laptop, does it keep working?” — anything that must answer yes lives on the Fleet or the edge, never on the laptop.
1

Operator & Direction

iThe human-in-the-loop layer. The architect designs and reviews — he doesn’t execute. Work is relayed down to agents, so the laptop can close at night without anything stopping.not 24/7 · steers, doesn't serve
AirJA1 · MacBook

The cockpit. Where direction, review, and Claude Code sessions happen. It relays every substantive command down to the CEO agent — it is not a server.

  • Claude Code sessions & QA
  • Relay-first: hands work to the agent fleet
  • Closes at night — nothing breaks
MiniJA2 · Mac M4 Pro

An always-on local-AI inference node. Bare metal (no containers) running open models for cheap, private, high-volume work.

  • Ollama serving Gemma / Qwen
  • Embeddings, transcription, draft generation
  • Health heartbeats & canaries — token-free
  • Internal/dev inference — not client production
Air— relay →CEO·Mini— inference →Fleet
2

Edge · Security · Data Plane

iA zero-trust front door. No server port is open to the internet — every request enters through an encrypted Tunnel and is authenticated by Access (Google SSO) before it reaches the Fleet. The edge also holds state (D1/KV/R2) and fronts the AI gateway.a first-class dependency, not just DNS
Cloudflarethe spine

Every internal UI and origin sits behind it. Nothing on the Fleet is exposed to the raw internet — traffic enters through Tunnels and is gated by Access (Google SSO).

DNS · WAFAccess (SSO)Tunnels Workers · PagesD1 · KV · R2 AI Gateway → OpenRoutercron audit workercross-agent sync bus
Cloudflare— Tunnel + Access →every Fleet node
3

The Fleet

iThree inexpensive CPU VPSes, each with one job: control plane, core runtime, and executive agents. Separation of concerns means a problem on one node doesn’t cascade — and the whole production surface runs 24/7 independent of any laptop.CPU VPSes · all production · 24/7
iron-pc
Control Plane / CEO

The brain that routes work. Holds the task board and the agent org.

  • Paperclip app + Postgres + watchdog
  • CEO + 17 agents (migrating to Hermes)
  • Label reconciler cron · SRE canary
iron-worker-02
Core Runtime / NB OS

The workhorse. Runs the product and the client-facing services.

  • Dashboard + voice proxy · voice core
  • MCP bridge · Graphiti / FalkorDB
  • CMO marketing swarm · client app containers
  • Outbound success monitor · scheduled audits
iron-vps-01
Broker + Hermes C-Suite

Locked-down egress broker plus the executive agent pods.

  • Egress broker (tailnet only)
  • Hermes Engineering / CTO
  • Hermes HR · Hermes COO · Hermes CFO
org wiring:CEO → COO → { CMO · CTO · HR }·CFO → CEO
4

Elastic GPU

iGPU is the priciest resource, so it’s never left idling. Rented per-job for media, 3D, video, and one-off model evals, then terminated — heavy compute on tap without a standing monthly bill.on-demand · ephemeral · never standing
Runpodrented GPU

Spun up for jobs too big for the Mini — media, 3D, video, one-off open-model evals. Pay-per-hour, terminated after each job.

FALmanaged media

Managed media generation on the owned pipeline. Pay-per-job — no always-on GPU bill.

Memory — the persistent brain

iThe difference between a chatbot and an operation. Four layers keep context across every session, and an automated sync loop makes any memory edit searchable by every agent within minutes — with freshness heartbeats running on local Gemma so the monitoring itself costs nothing.cross-cutting · survives every session

Agents forget when the context window closes. This is what makes them not. Four layers, one source of truth, kept in sync automatically.

irt-vault (Git)source of truth

Human- & agent-readable memory as Markdown — standing orders + topic files — versioned in a private Git repo. Every change is committed & pushed; nothing is lost, everything is diffable.

Edge memoryTier 1 · Cloudflare

Fast recall at the edge: D1 (session state) · KV (heartbeats) · Vectorize (embeddings), fronted by a Worker.

Graphiti + FalkorDBTier 2 · knowledge graph

Entities & relationships with semantic search. Auto-ingests the vault, so agents can ask “what do we know about X?” — not just grep files.

PaperclipTier 3 · orchestration

Operational memory: the task board, agent roster, who-owns-what, on Postgres. The “what’s happening right now” layer.

sync loop: edit→ git push (vault) fleet pulls Graphiti ingests R2 backup
Heartbeats on local Gemmatoken-saver

Sync checks, freshness heartbeats, and health canaries run on the Mini’s local Gemma — not a paid frontier model. Constant background monitoring with effectively zero token cost; it only escalates to the cloud when something’s actually wrong.

Model Routing — two-tier, via AI Gateway iCost engineering in one diagram. Cheap and local models do the high-volume work; frontier models make the decisions and sign off anything client-facing. Quality where it counts, pennies everywhere else — with automatic failover across three frontier vendors.

Tier 1 — decisions & client-facingOpus 4.8 (default) → failover GPT-5.5 / Codex → Grok
Tier 2 — high-volume executionopen-weight default (Mini local / OpenRouter), Opus on escalation
Utility — embeddings / transcription / heartbeatsMini local Gemma, always (token-free monitoring)
Media / 3D / videoRunpod / FAL (ephemeral GPU)
The rule: client-facing output is always frontier-model quality, regardless of which model drafted it. Cheap models do volume; the frontier signs off.

Want to remix it?

Download the editable source and drag it onto excalidraw.com — fully editable, no account needed.

Download .excalidraw Open Excalidraw