Multi-machine fleet — darkmux user guide

The big picture

darkmux's multi-machine substrate has three layers:

Identity: each machine declares a logical fleet name (DARKMUX_MACHINE_ID). That name is what flow records carry as machine_id and what --machine hints reference.
Coordination: a Redis instance running on the always-on member. Two stream classes: darkmux:work — one global work queue (publishers XADD; the first available runner claims via a shared consumer group) — and darkmux:flow for the fleet-wide event log (every machine writes; any daemon's /flow endpoint reads).
Observability: the daemon hosts the live viewer at http://localhost:8765/; the demo at darkmux.com/demo shows the bundled showcase scenario. The viewer aggregates events from every machine writing to the shared stream. For fleet-wide access from any peer's browser, see always-on hub → upgrading to substrate + viewer (Tailscale Serve recommended). The darkmux fleet status --deep CLI does the same per-machine specs view (RAM-free, loaded models, version) from the terminal.

Redis is optional. Without it, single-machine usage works exactly as before; flow records still land on disk per-machine via LocalFileSink. With it, the fleet members start seeing each other.

Scope: single operator, multiple machines.

The trust boundary is your tailnet (your call: Tailscale, ZeroTier, WireGuard), not enforcement in darkmux's code. DARKMUX_REDIS_URL carries no auth beyond what your mesh VPN + Redis ACLs already provide; provenance fields are operator-asserted. This works because everyone on the substrate is you. Multi-tenant deployment is explicitly out of scope. See DESIGN.md for the rationale and fork-friendly invitation.

Before you start: harden your hub

One of the most painful failure modes in a multi-machine darkmux deployment is the operator-default-macOS one: hub machine sleeps overnight, drops Tailscale, your other Macs' dispatch starts timing out, and you don't notice until you try to use it. That happened in the maintainer's own setup; the lesson is in the field notebook.

Going deeper: the checklist below is the minimum (sleep, autostart). For a production-grade hub — Redis with AOF persistence, darkmux serve under launchd KeepAlive, the audit substrate enabled, log rotation, daily integrity checks — see always-on hub operations. The two pages compose: this page tells you how to wire up the fleet; that page tells you how to make the hub at the center of it survive crashes, reboots, and weeks of operator absence.

Before joining a Mac as your hub, walk through this checklist on it once:

# Stop sleeping. Mac defaults assume "laptop closed = sleep"; that's wrong for a 24/7 hub.
sudo pmset -a sleep 0 disksleep 0 autorestart 1

# Verify the settings landed.
pmset -g

# Tailscale: System Settings → Tailscale → enable "Run at login" + "Always on"
# Auto-login: System Settings → Users & Groups → Login Options → Automatic login
# Software updates: switch to "Install in background" OFF; manual install scheduled.

The pmset config keeps the machine awake; autorestart 1 recovers from a brief power blip. Tailscale "Run at login" + an auto-login user means: after a reboot (intentional or accidental), the machine boots straight into a logged-in session where Tailscale + your daemon services start automatically. Without that, the machine sits at the login screen with Tailscale down and your other Macs see "the hub is offline" with no remote way to recover.

The hardening matters more than the model choice. Same hardware class, same OS. Different config produces different reliability. Apply it once when you set the machine up.

Set up the fleet

On the hub machine (always-on, runs Redis)

Install Redis if you don't already have one. macOS:

brew install redis
brew services start redis

Decide on a Redis URL: typically redis://default:<password>@<tailnet-addr>:6379 where <tailnet-addr> is the hub's Tailscale IP or Magic DNS name. Set a password (requirepass in redis.conf) even on a trusted tailnet. Defense-in-depth is cheap.

Then in the hub's shell rc (~/.zshrc):

# darkmux fleet membership — always-on hub member
export DARKMUX_MACHINE_ID=studio        # operator-named, not hostname
export DARKMUX_REDIS_URL=redis://default:<password>@<hub-tailnet-addr>:6379
export DARKMUX_ORCHESTRATOR=claude-code   # or whichever frontier you use

Reload your shell and verify with darkmux doctor. machine_id and flow sink health should both read ✓.

On each peer

Same shape, different id:

export DARKMUX_MACHINE_ID=laptop
export DARKMUX_REDIS_URL=redis://default:<password>@<hub-tailnet-addr>:6379
export DARKMUX_ORCHESTRATOR=claude-code

Same Redis URL on every peer. That's how they coordinate. Same orchestrator string per session (lets you trace which frontier drove which dispatch in flow records).

Reload + darkmux doctor on each peer. All checks should green.

Add peers to each machine's roster

Each machine has its own ~/.darkmux/fleet.json roster: the list of machines IT knows about. Today this is per-machine hand-maintained (cross-machine state replication tracked in #280).

From the hub, register every peer:

darkmux fleet add laptop --address <laptop-tailnet-addr>:8765
darkmux fleet add ipad-pi --address <ipad-tailnet-addr>:8765

From each peer, register the hub + any peers it should know about:

darkmux fleet add studio --address <hub-tailnet-addr>:8765
darkmux fleet add laptop --address 127.0.0.1:8765   # self

Verify with darkmux fleet status on each machine. Reachability probes confirm the addresses resolve and the daemon ports are open.

Add a new machine to an existing fleet

Once the initial fleet is up, joining the next Mac is a 10-step walkthrough that lives in the /darkmux-add-machine Claude Code skill:

# From your frontier orchestrator (Claude Code, etc.):
/darkmux-add-machine

The skill reads your existing fleet state, asks the operator a handful of questions (what id, which Redis URL), and proposes the configuration. It then waits for you to apply each step and verifies the outcome before continuing. Read+propose throughout. Every state-mutating command runs at your hand, not the skill's.

If you don't use a Claude Code orchestrator: the steps are mechanical and the same as the "Set up the fleet" section above. The skill is a convenience, not a requirement.

Use the fleet

Dispatch a role across the fleet

From any machine in the fleet:

darkmux crew dispatch coder --message "implement the X feature"

With no --machine, the dispatch runs locally. To hand the work to the fleet queue instead, name a target machine:

darkmux crew dispatch coder --machine laptop --message "implement the X feature"

This publishes the work to the single global darkmux:work stream; the first available runner claims it. The --machine id is an advisory hint (#590): any runner may claim the job, and a non-target runner logs a soft warning and proceeds (no requeue). Requires DARKMUX_REDIS_URL set on the dispatching machine and darkmux serve running on the runner.

Capability-based auto-routing — match work to the machine best suited to run it, without naming a target — is the planned successor (#590).

See what's on every machine: `fleet status --deep`

darkmux fleet status --deep

Fans out across every reachable peer's /machine/specs endpoint and renders a table:

MACHINE        ADDRESS                PROBE      RAM-FREE    OS              VERSION  MODELS
laptop         100.64.1.5:8765        ✓ 23ms     78 GB       macos aarch64   1.9.0    darkmux:qwen3.6-35b-a3b-mlx
studio         100.64.2.1:8765        ✓ 45ms     12 GB       macos aarch64   1.9.0    darkmux:qwen3-4b-instruct

Bounded at 1s per peer; degraded peers render with specs? in the RAM-FREE column rather than failing the whole command. This is the "what's actually going on right now?" view across your fleet: the answer to "is the laptop's RAM full?" or "did the studio reboot and lose the model I had loaded?"

Watch fleet activity from any machine: the live viewer

The daemon serves the live viewer at its own origin: open http://localhost:8765/ in any browser on a machine running darkmux serve. When DARKMUX_REDIS_URL is set on that daemon, the viewer's /flow/<date> + /flow/<date>/stream endpoints aggregate events from every machine writing to the shared darkmux:flow stream: fleet-wide events, not just the host's local file.

For loading the hub's viewer from a peer's browser (without each peer running its own daemon), see always-on hub → upgrading to substrate + viewer. Tailscale Serve is the recommended path. It proxies HTTPS to the hub's daemon without exposing the daemon to the tailnet interface.

Source-pill on the toolbar tells you the truth about the data path:

source: live: Redis aggregation is working; you're seeing the whole fleet
source: no daemon: can't reach the daemon at daemon-base; check the daemon's running
source: replay (<fixture>): you're viewing a static fixture, not live data

The viewer is a static HTML page. Nothing uploaded; all rendering happens client-side from the daemon's JSON.

When things go wrong

The substrate degrades gracefully. Knowing the modes saves you from confused debugging.

Hub machine drops off the network

The most common failure. Other Macs' flow writes fall back to LocalFileSink only (records still land on disk; Redis sink errors are logged + skipped). Cross-machine dispatch bails loud with the operator-actionable hint shown above. darkmux doctor reports the Redis sink as unreachable.

If you've left a topology viewer tab open, the SSE Redis tail retries for a bounded budget (MAX_CONSECUTIVE_XREAD_FAILURES, ≈5 seconds wall-clock), then emits a synthetic stream.error record and exits cleanly. The viewer sees the channel close, not a forever-spinning loader.

Recovery: bring the hub back up (physical access usually required). Once Redis is reachable again, the substrate self-heals on the next operation, no manual intervention.

Slow / paused viewer tab

The SSE channel between the daemon's Redis-tail task and the SSE stream is bounded at SSE_MPSC_CAPACITY (256 records, ~256 KB worst-case per stream). When the consumer falls behind, the daemon drops the newest records and logs the drop with a running total. Operator-visible: viewer might miss a few recent events; reconnect or refresh to recover.

"Why didn't my dispatch land where I expected?"

A --machine dispatch emits a dispatch route flow record. Tail the flow stream or open the topology viewer; filter for action: dispatch route. The record's payload shows target_machine and decision (pinned for --machine, or local for a local fall-through). The substrate's reasoning is in the audit chain, not hidden.

Two operators on the same machine

Out of scope. DARKMUX_MACHINE_ID is per-machine, not per-user. If you ever need per-user provenance on a shared Mac, darkmux has outgrown its target. Fork it.

What's not yet built

The substrate works; some operator-experience polish is still ahead. Watch these issues if they sound relevant:

#280: cross-machine mission / sprint state replication. Today missions you create on machine A are not visible from machine B. The fix is event-sourced state on Redis; tracked as the next architectural arc.
#282: mission priority + cross-fleet pause/resume from any node. Today's mission lifecycle verbs only affect future dispatches on the machine where they ran.
#302: topology viewer surface for stream.error records. The synthetic-record substrate is in place; the viewer needs a UI element (pill or toast) to surface it explicitly.
Elastic-hub failover: promote any peer to hub when the current one drops. Not yet filed; the workflow today is operator-manual.

If any of these block a real workflow, comment on the issue and that work will move up the queue.

The big picture

Before you start: harden your hub

Set up the fleet

On the hub machine (always-on, runs Redis)

On each peer

Add peers to each machine's roster

Add a new machine to an existing fleet

Use the fleet

Dispatch a role across the fleet

See what's on every machine: fleet status --deep

Watch fleet activity from any machine: the live viewer

When things go wrong

Hub machine drops off the network

Slow / paused viewer tab

"Why didn't my dispatch land where I expected?"

Two operators on the same machine

What's not yet built

See what's on every machine: `fleet status --deep`