The lab harness: darkmux user guide

The point of the lab

Local AI claims are usually unfalsifiable. "This profile is faster". Measured how? On which workload? With which compaction settings? With which loaded state at start?

The lab harness exists to make those questions answerable on your own hardware:

Reproducible workloads: named JSON manifests under templates/builtin/workloads/ (and your own under ~/.darkmux/workloads/).
On-disk artifacts: every run writes a manifest, trajectory, and verify outcome to .darkmux/runs/<id>/. No "I think I saw 30 seconds" hand-waving. The numbers are on disk, in JSON, comparable.
Comparison primitives: lab compare A B diffs two runs by wall-clock, compaction events, and per-turn behavior.
Trait-based extension: add your own workload kinds via the WorkloadProvider trait in src/workloads/types.rs.

Prerequisites for lab

By default, darkmux lab run uses the same internal Docker-bounded runtime as darkmux crew dispatch. No openclaw install needed. Workloads dispatch through a per-invocation darkmux-runtime container; build the image once with docker build -t darkmux-runtime:latest runtime/ from the darkmux repo root.

Operators who already have openclaw installed (or who want the runtime the lab-series numbers were measured against) can opt in per-run with darkmux lab run <workload> --runtime openclaw. The openclaw shell-out binary defaults to openclaw on PATH; override with --runtime-cmd <path> to point at Aider, Cline, or anything else with a <cmd> agent --message surface. The flag is only consulted under --runtime openclaw; pure-internal operators never touch it.

If you have neither Docker nor an external runtime, the swap / status / profiles / doctor verbs still work. You just can't run lab dispatches.

The smoke workload

quick-q is a single-turn smoke prompt against the active profile. It's the "is this thing wired up at all" test: should complete in ~6–10 seconds if a model is loaded.

darkmux lab run quick-q
# …
# Captured run quick-q-deep-1778302418-1 in .darkmux/runs/quick-q-deep-1778302418-1/

Each run prints its id; that's your handle for inspect/compare later.

What the run wrote to disk

ls .darkmux/runs/quick-q-deep-1778302418-1/
# manifest.json     trajectory.jsonl   verify.json

manifest.json: workload id, profile, loaded model state, hardware fingerprint, start/end time.
trajectory.jsonl: per-turn detail, including prompt, response, tool calls, timing, and compaction events.
verify.json: verify-command outcome. Only coding-task workloads define a verify command; a prompt workload like quick-q has none, so its verify is just "a response came back" and reports pass without running any check. Don't read a prompt workload's "pass" as a test result: there's nothing to fail. The real pass/fail signal lives in coding-task verify commands.

Cross-layer telemetry (always-on)

Cross-layer telemetry is captured automatically (#557): no flag, no sidecar file. The internal runtime and crew dispatch emit it as category=telemetry flow records on the flow stream (sources: lms, process, detector, runtime, context, compaction), so you can see what LMStudio had loaded, where the runtime process sat across the run, detector signals, and compaction events. That's useful for diagnosing where wall-clock latency is concentrated.

View it in the daemon's observability viewer: run darkmux serve and open http://localhost:8765/. The viewer reads live flow records straight from the daemon. A demo instance lives at darkmux.com/demo.

Inspect a run

darkmux lab inspect quick-q-deep-1778302418-1

Shows: total wall-clock, turn count, compaction events, mode (fast vs slow per the bimodal wall-clock distribution observed in empirical testing), notes from the trajectory.

The mode classification (fast/slow) reflects the bimodal wall-clock distribution observed in empirical testing. Wall-clock per turn splits into two clusters based on whether the prompt happened to hit a state that triggers heavy compaction or not. Inspect tells you which cluster a given run landed in. The output is a small report, best seen by running it on your own machine after a quick-q or long-agentic run. The methodology behind this is documented in Part 2 of the lab series if you want the full empirical grounding.

Characterize your hardware

darkmux lab characterize is the one-command "QA my Mac": it dispatches a representative smoke workload, captures the run, and emits a verdict.

darkmux lab characterize

Output is a structured JSON verdict plus a human-readable summary. Used to:

Compare two operators' machines against the same benchmark.
Catch drift after a profile change or LMStudio upgrade.
Establish a baseline before tuning (see below).

Tune: detect bimodal variance

Single runs lie about local-AI behavior. Wall-clock varies enough that one run isn't a reliable signal. lab tune dispatches N runs against the same workload and clusters the results, surfacing the bimodal "fast" vs "slow" mode shape if it's there.

darkmux lab tune long-agentic --runs 6

Output: a per-run table + a cluster verdict ("looks bimodal at fast=<mean>s, slow=<mean>s" or "single-mode").

Use this after changing a profile setting (context length, compaction mode, compactor model) to see whether the change actually shifted the distribution rather than just luck on a single run.

Compare runs

darkmux lab compare A B diffs two runs, typically a "before tuning" baseline vs "after" verification.

darkmux lab compare quick-q-deep-1778302418-1 quick-q-deep-1778466601-1

Reports: wall-clock delta, compaction-event delta, mode classification per side, any structural differences in the trajectory.

The discipline behind it: baseline → single-variable change → re-measure → compare → record in notebook. Each step has a darkmux primitive. Don't skip the baseline. Don't change two variables at once. Without this discipline, the comparison is uninterpretable.

List your runs

darkmux lab runs --limit 10        # most recent 10
darkmux lab runs --limit 50

Each row shows: run id, workload, profile, wall-clock, mode, verify outcome. Quick way to spot which run id to inspect or compare.

Reproducible sandboxes: fixtures

A single number isn't a measurement; a number you can reproduce is. The fixture system (landed across #487, phases 1–5) makes every coding-task run start from a known state and end with a verifiable one, so "this profile is faster" becomes a claim you can re-run on demand.

Per-run copy-on-write isolation

Every darkmux lab run works in its own sandbox, a copy-on-write clone of the source, not the source itself. The source fixture directory is never touched; cross-run contamination is eliminated by construction. The clone is near-instant where the filesystem supports it (clonefile on APFS, --reflink on btrfs/xfs/zfs) and falls back to a deep copy everywhere else.

~/.darkmux/runs/<run-id>/
  sandbox/          # this run's isolated COW clone — the model works here
  manifest.json     # now carries baseline_hash + final_hash (schema_version 4)
  trajectory.jsonl  # per-turn detail

The manifest gained two content hashes (BLAKE3, formatted blake3:<hex>):

fixture.baseline_hash: the source state before the run, captured at clone time. Proves two runs started from the same place.
final_hash: the sandbox state after dispatch, excluding derived dirs (.git, node_modules, target, __pycache__, .coverage, .darkmux-runtime). Two runs with the same final_hash left bitwise-identical output, the strongest reproducibility signal the lab emits.

The fixture manifest

A fixture is a self-contained directory with a .fixture.json manifest at its root. Only name is required:

{
  "name": "demo-tiny-py",          // registry key — no path separators
  "version": "1.0",
  "satisfies": "tiny-python-suite@1.0", // what abstract requirement this fills
  "language": "python",
  "verify_command": "python3 -m unittest discover -s tests",
  "baseline": { "test_count": 5 },   // free-form expectations, surfaced by doctor
  "required_files": ["src/parser.py", "tests/test_parser.py"],
  "hash_exclude": ["__pycache__", ".pytest_cache"]
}

hash_include / hash_exclude layer on top of the defaults if you need to pull in out-of-tree files or ignore derived ones.

Register once, resolve by requirement

The fixture directory stays wherever it lives; the registry (~/.darkmux/lab-registry.json, or .darkmux/lab-registry.json project-scoped) is just a name → path lookup with integrity metadata. Registering computes and records the content hash so later drift is detectable.

darkmux lab register ./my-fixture            # add by path (reads .fixture.json)
darkmux lab register ./my-fixture --force    # replace an existing entry
darkmux lab fixtures                         # list registered fixtures
darkmux lab unregister demo-tiny-py          # drop the entry (never deletes the dir)
darkmux lab doctor                           # offline integrity check (see below)

A workload opts into a fixture by declaring requires_fixture in its manifest, a <name>@<version> string. At run time the resolver finds a registered fixture whose satisfies matches and uses it as the source; with no requires_fixture set, the run falls back to {sandboxes}/<workload-id>/ as before. (This replaces the old DARKMUX_SANDBOX_<ID> env-var binding; the registry is now the only persistent fixture binding.)

`darkmux lab doctor`

A cheap, offline check that catches a broken fixture before you waste a dispatch on it. For every registered fixture it verifies the path still exists, the manifest still loads, the required_files are present, and the content hash + manifest version haven't drifted since registration. If there's no registry at all, it points you at scripts/lab-init.sh.

Bootstrap the built-ins

A built-in demo-tiny-py fixture (a trivial Python module + a 5-test suite) ships in the repo under templates/builtin/lab-fixtures/. Populate your registry with it — and any future built-ins — via the standalone init script (idempotent; safe to re-run after a git pull):

scripts/lab-init.sh          # register all built-ins
scripts/lab-init.sh --dry    # print what would be registered; no writes
scripts/lab-init.sh --force  # re-register, accepting upstream drift

It's a plain script, not a CLI verb: run it once, fork it, or skip it entirely. The discoverability path lives in darkmux lab doctor's "no registry" hint.

Adding a workload

Built-in workloads live in templates/builtin/workloads/*.json and are embedded into the binary at compile time via include_str!. They work from any directory without the source tree.

To add your own:

Drop a JSON manifest at ~/.darkmux/workloads/<id>.json (user-local), picked up automatically on next lab run.
For deeper extension (new workload kinds, not just new prompts), implement the WorkloadProvider trait in src/workloads/types.rs and register it in src/workloads/registry.rs::register_builtins(). Requires a code change + reinstall.

Two provider kinds ship out of the box: prompt (single prompt → response) and coding-task (sandbox + verify-command pattern).

Notebook entries from runs

darkmux notebook draft <run-id> dispatches the active role to author a lab-style notebook entry from the run's manifest + trajectory.

darkmux notebook draft quick-q-deep-1778302418-1

The output is markdown: Action/Why/Result/Next blocks suitable for pasting into your lab notebook. If you have DARKMUX_NOTEBOOK_DIR set (typically to an iCloud-synced path), darkmux notebook list enumerates entries across machines.