All writing

Writing

Building Bosun: a chief-of-staff for a crew of coding agents

agentsorchestrationomnigentai

I’ve been having a lot of fun with agentic coding lately, honestly a bit too much. Somewhere along the way I looked up and realised I was running four projects at once: GuideX, a travel-tech marketplace; DXJ, a weekly stock-signal engine; Cygnal, a real-time fraud platform; and the agent harness sitting underneath all of them. Each had its own repo, its own coding agents, its own slice of context living in my head.

That last part, the context that lived only in my head, is where it started to hurt.

The overload nobody warns you about

The dirty secret of agentic coding is that the model stops being the bottleneck almost immediately. Claude, Codex, Gemini: they’re all good enough to do real work. The bottleneck becomes you.

With one agent you’re a pair programmer. With a dozen you’re an air-traffic controller with no radar. My day had become a carousel of terminal tabs: this one’s waiting on my approval, that one finished ten minutes ago and is sitting idle, a third has quietly gone off the rails and I won’t notice for an hour. I was spending more time remembering what each agent was doing than actually directing the work.

The worst of it was the handoffs. I’d want one agent to pick up where another left off: take this plan, here’s the context, go. But there was no clean way to pass the baton. I was copy-pasting state between windows like it was 2010.

Enter Omnigent

Around then, Databricks announced Omnigent, an open-source control plane for coding agents, and the pitch landed for me immediately. Instead of N disconnected sessions, you get one substrate that can spawn, isolate and track them. Each agent runs as a session in its own git worktree, so they never step on each other, and you can reach the whole fleet from anywhere, including your phone.

I got my hands dirty fast, building my own agent pack on top of it: Cynthia, a governed multi-model workflow where Claude plans and reviews, Codex writes, and nothing ships without clearing an external gate. Omnigent gave me the isolation and the plumbing, but it also surfaced a new question: who’s driving?

Omnigent could run a crew, but on its own it didn’t give me a single place to talk to it from. I was still attaching to each session by hand to see what each one was doing.

The chief-of-staff idea

That’s the gap Bosun fills. The name is deliberate: a bosun (boatswain) is the officer who runs the ship’s crew, organising the work on deck rather than steering from the wheel. I wanted exactly that: one always-on agent I talk to, that turns around and directs everything else.

I borrowed the shape openly from Kun Chen’s Firstmate and its line, “talk to one agent, ship with a crew.” Firstmate is a prompt-OS chief-of-staff; Bosun is my own take on the same instinct, wired specifically to drive Omnigent crewmates from where I actually live: a chat thread on my phone.

Here’s the whole system at a glance.

🧭Youone chat threadBosundeterministic loop🗂️Kanbanthe state🚤crewmateGuideX🚤crewmateDXJ🚤crewmateCygnal
You talk to Bosun · Bosun dispatches crewmates & gates the risky moves · the crew reports back

You talk to Bosun. Bosun reads Kanban boards, one per project, then dispatches crewmates (Omnigent worktree sessions) to do the work, watches them, and reports back. That’s the whole loop. The rest of this post is about the parts I deliberately kept simple, starting with the orchestration itself.

What’s actually under the hood

Bosun isn’t an agent I wrote from scratch. It’s a thin layer over two things that already existed.

The runtime is Hermes, NousResearch’s agent framework. Hermes already knew how to be an always-on assistant: a Telegram front-door so I can drive it from my phone, a tool-calling loop, persistent memory, and a kanban dispatcher I could bend to my needs. Bosun is essentially a Hermes profile with a custom outer loop bolted on top.

The brain is GLM-5.2, running inside Hermes over OpenRouter. I picked it for two things: a 1M-token context window, so a crewmate’s whole world fits in a single prompt, and genuinely strong tool use. (When spend spiked, I temporarily dropped to a much cheaper DeepSeek model as a stop-gap; more on that below.)

So the stack stays deliberately small: Hermes for the runtime, GLM-5.2 for the reasoning, Omnigent for the isolated crewmates, and a few hundred lines of plain Python for the outer loop that ties them together.

The trick: keep the loop dumb

Here’s the decision everything else hangs on: the orchestration loop is plain code, not an LLM.

It’s tempting to make the orchestrator itself an agent and let a model decide what runs next. I tried it. It’s expensive, non-deterministic, and it fails in ways you can’t debug. So Bosun’s outer loop is boring on purpose: a deterministic supervisor that wakes every 90 seconds, does two things, and goes back to sleep.

every 90s🔭WATCHevery crewmate🚀DISPATCHclaim ready work
One deterministic loop, plain code, no magic: WATCH then DISPATCH, every 90s, forever

WATCH: look at every crewmate. Did one finish? Fail? Go quiet? Is one waiting on me? Write the answer back to its card, and ping me only if I’m actually needed.

DISPATCH: look at the boards. Any card marked ready that nobody’s working? Claim it, spawn a crewmate, point it at the repo.

The intelligence, the part that reasons and writes code, lives in the inner loops, inside each crewmate. The outer loop just schedules. Keeping autonomy in the inner loop and determinism in the outer one is the single most important thing I got right. It means I can reason about scheduling and cost at the outer layer without a model quietly changing the plan underneath me.

The board is the database

Bosun has no separate state store. The Kanban board is the state. Each project gets a board; each unit of work is a card that moves todo → ready → claimed → done. A crewmate writes its progress straight onto its card.

todoqueuedreadyclaimableclaimedin-flightdoneshipped📦📦
The board is the durable state · a card is cargo that slides todo → ready → claimed → done

Two properties fall out of this for free. First, only ready cards are ever claimed, so the backlog can never auto-run. Nothing starts without me deliberately marking it ready. Second, if Bosun crashes, it rebuilds its entire worldview from card status on the next tick. There’s no separate in-memory plan or cache that can drift out of sync with the board, so a restart costs a tick, not a recovery procedure. This is the Karpathy idea taken literally: the repo (here, the board) is the source of truth; the agent is allowed to forget.

The part that lets me sleep

Handing autonomous agents a shell is exactly as scary as it sounds. So every crewmate runs behind a tiered, fail-closed approval gate. Routine, worktree-local work runs free. But anything with blast radius (force-push, push to main, rm -r, a deploy, anything touching secrets or the network) pauses and surfaces the actual command to my phone for a tap-to-approve.

📱tap to approveforce-pushheld · waiting✓ approved
High-blast-radius commands halt at the gate & pulse · they pass only on your tap · fail-closed

“Fail-closed” is the important half: if the gate can’t be attached, the crewmate doesn’t spawn. I live-verified both paths, approve and deny, end-to-end before I ever let it dispatch on its own: I watched it actually halt a force-push and wait for my tap, rather than trusting that it would.

Don’t let the crew mark its own homework

An agent grading its own work almost always passes itself: in practice it reports “done” on code that doesn’t compile, or quietly skips the test that fails. So Bosun splits the roles: a generator crewmate does the work, and a separate evaluator crewmate reviews it, primed to assume it’s broken until proven otherwise and to verify with tools and tests before a card is allowed to move on.

default verdict: REJECT, assume broken until proven🚤Generatormakes the work🔎Evaluatorsecond officer · runs the tests🗂️card advancesonly when proven
Maker builds · an independent checker runs the tests · only a PASS lets the card advance, a REJECT loops it back

It’s the maker–checker pattern from finance, which is where I spend my day job. You don’t let the person who moved the money also sign off on it.

Watching the meter

Cost isn’t what Bosun is about, but the same long-lived agents that do the work can also run up a bill, re-sending a growing context to a pricey model over and over. So spend gets a guardrail of its own, much like the approval gate.

$time →daily cap🛟🔔
Spend climbs, a 🔔 pings at the spike, then telemetry & a daily cap flatten it out

The controls are deliberately boring: real-time spend telemetry that DMs me on a spike, a far cheaper default model with the expensive one reserved for hard tasks, opt-in dispatch so nothing auto-runs, and a hard daily cap on the API key as the backstop.

What I actually learned

Most of what makes Bosun work isn’t novel; it’s loop engineering, a discipline a lot of smart people have been writing down. Anthropic’s Building effective agents makes the case for simple, composable patterns over clever ones. The agentic-loop field guides trace the lineage from ReAct to AutoGPT to today’s orchestrators. The throughline is unglamorous: bound your loops (max iterations, no-progress detection, a spend ceiling), make a human the checker on anything irreversible, and keep the context lean. Bosun is mostly me applying those rules with a steady hand rather than inventing anything.

Where it’s going

Bosun runs on a Mac mini in my flat right now, dispatching across all my projects from a single Telegram thread. The next steps are the obvious ones: turn the verifier on by default, extend the gate to more harnesses, and keep widening the set of work I can hand off without thinking about it. The goal was never full autonomy. It’s to hand the day-to-day running of my projects to something I just talk to, and keep myself for the calls that actually need me: what’s worth doing, what’s risky, and whether a result is any good.


References & inspiration

  • Omnigent: the open-source agent control plane Bosun runs crewmates on (Databricks announced it; I’ve been contributing).
  • Firstmate: Kun Chen’s prompt-OS chief-of-staff; the project Bosun is modelled after.
  • Andrej Karpathy: for the knowledge-base-as-context idea (the repo is the source of truth; let the agent forget).
  • Anthropic, Building effective agents (Dec 2024).
  • The Agentic Loop: A Practical Field Guide: on the lineage from ReAct to modern orchestration.
  • My other projects, for the curious: GuideX, DXJ Signal Engine, Cygnal, and Cynthia. More at davidtandoh.com.