Methodology

How we measure. What we don’t measure. How you can reproduce.

What we measure

Cold-start latency. Steady-state tokens per second. Tokens per second under concurrent load (1, 5, 10, 20 simultaneous users). Tool-call success rate. Multi-turn coherence. Memory footprint. Watts at the wall, where we can capture them. Heat under sustained load.

What we don’t measure

Vibes. “Feels fast.” Marketing-grade comparisons against models we haven’t actually run. Synthetic benchmarks that don’t map to workloads we run in production. Anything we can’t reproduce from a JSONL plus the harness.

Fairness rules

Same prompt corpus across runs of comparable kind. NTP-synced timestamps. No warm-cache cheating — every cold-start measurement is genuinely cold. Model checkpoints declared with their hash. The git SHA of the harness committed alongside every run.

Reproducibility

Every run we publish has its raw JSONL, its tee’d log, a human-readable summary, and a metadata file declaring device, model, parameters, environment, and the SHA of the harness. Clone the public archive; re-run; tell us where we got it wrong.

Mistakes we’ve published and corrected

The Predator Qwen rerun set started life as a Mission 1 v0.5 with a methodology bug (per-model swap pressure inflating the mid-size overhead). The post-mortem explains what we got wrong + what we changed. The v3 rerun (`ad057f5b`) is the cleanest weeyuga-routing baseline we have today; a llama.cpp-vs-Ollama engine-direct comparison stayed parked when Pavilion’s Ollama wrapper gap blocked the Phase 2 run.

More entries land here as Janie + Vera voice-pass them. Honest is how trust gets built.