16/16 models attempted. 12 successes, 4 timeouts (qwen2.5-coder:14b, qwen3:8b, qwen3:14b — same 14B+ tier that timed out on 5Q).
First reproducible benchmark of the Weeyuga cluster's Pavilion peer
when driven through weeyuga serve at
http://[weeyuga-cluster-host]:[cluster-port]/v1/chat/completions instead of
Ollama's native /api/generate. The methodology mirrors
Sloba's canonical 2026-04-11/12 Pavilion benchmark
(temperature=0.1, num_ctx=4096,
num_predict=2048, single-loaded-model
single-parallel-slot, frozen 5Q + Hello suite). The number worth
publishing is the difference: what does going through the
weeyuga routing layer cost vs hitting Ollama directly?
Phase 1 of a 3-phase plan. The 14B+ class regressed (see §3.3); 9b/35b ran clean.
16/16 models attempted. 12 successes, 4 timeouts (qwen2.5-coder:14b, qwen3:8b, qwen3:14b — same 14B+ tier that timed out on 5Q).
96 calls total (16 × 6). 81 successes, 15 timeouts. Format-OK + marker rates within ±5% of prior baseline.
Deferred per HARNESS.md plan. Trinity dispatch follows; recommended scope: 7 working models × 20 prompts.
Deferred to Phase 3. Will run after Phase 2 lands and we know which models can hold real-context coherently.
qwen2.5-coder:0.5b / 3b / 14b all have direct 2026-04-11/12 Ollama-direct numbers to diff against (see §3).
The numbers that frame this baseline.
All 8 variants (0.8b → 35b-a3b) ran clean: 22-32s avg / 30-41 tok/s on 5Q. Zero timeouts.
5/5 timeouts at 6-min wall. Was 54.9s/call on GPU on 2026-04-12. Triage queued for Bane.
vs Ollama-direct GPU on qwen2.5-coder:0.5b (10.4s vs 7.99s). Quality unchanged.
qwen3.5:0.8b on nginx_safe_reload via llama.cpp :[cluster-port].
qwen2.5-coder:3b: 70.5s vs 13.0s prior. Likely Ollama swap-thrashing under multi-model sweep.
num_predict=2048 bounded every call. Worst case (qwen3:4b): 263s / 2048 tokens.
Methodology delta from prior 2026-04-11/12 baseline.
Endpoint: weeyuga :[cluster-port]/v1/chat/completions (OpenAI-compat) — prior runs hit Ollama :11434/api/generate direct.
Routing layer: through Mila's R-QWEN35-DEFAULTS Phase 0+1 sampling defaults + thinking-bypass classifier + Atlas's Phase-2 llama.cpp branch for qwen3.5:0.8b.
Sampling: identical (temperature=0.1, num_ctx=4096, num_predict=2048).
Lanes: single (whichever weeyuga picked) — prior ran CPU + GPU separately and we compare to the faster lane.
Prompts: identical, sourced from ~/Documents/MyServers/.../small_model_eval_questions.json verbatim.
The headline of the run. Three models have direct comparison data
from
~/Documents/MyServers/instances/pavilion-windows-laptop/telemetry/archive/raw-run-artifacts/windows-local/.
| Lane | 5Q avg dur | tok/s | Format-OK | Marker-hit |
|---|---|---|---|---|
| 2026-04-11 Ollama direct, GPU | 7.99 s | 62.65 | 3/5 | 0.80 |
| 2026-04-11 Ollama direct, CPU | 6.72 s | 63.62 | 2/5 | 0.69 |
| 2026-04-28 weeyuga (this run) | 10.40 s | 37.6 | 3/5 | 0.82 |
Verdict: weeyuga adds ~30% wall-time vs prior GPU lane and ~55% vs prior CPU lane. tok/s drops from ~63 → ~38. Quality is unchanged (marker 0.82 vs 0.80). This is the routing-layer cost — Mila's sampling defaults + thinking-bypass classifier + future personality engine all live on this path.
| Lane | 5Q avg dur | tok/s | Format-OK | Marker-hit |
|---|---|---|---|---|
| 2026-04-11 Ollama direct, GPU | 13.00 s | 21.46 | 3/5 | 0.87 |
| 2026-04-11 Ollama direct, CPU | 13.10 s | 21.40 | 3/5 | 0.90 |
| 2026-04-28 weeyuga (this run) | 70.50 s | 3.20 | 3/5 | 0.90 |
Verdict: 5.4× slower through weeyuga. tok/s collapsed from 21 → 3.2. Quality identical.
Most likely root cause: Pavilion's Ollama is configured with OLLAMA_MAX_LOADED_MODELS=3 and OLLAMA_NUM_PARALLEL=3 (Bane 2026-04-26 reconfigure). When weeyuga drives a sequence of calls across 16 different models, Ollama has to evict + reload models, paying disk → VRAM cost on every swap. The prior 2026-04-11 run was single-model so didn't hit this.
Recommended verification: re-run with OLLAMA_MAX_LOADED_MODELS=1 / NUM_PARALLEL=1 on Pavilion (Bane). If 3b drops back near 13s, we've isolated the cause.
| Lane | 5Q avg dur | tok/s | Format-OK | Marker-hit |
|---|---|---|---|---|
| 2026-04-12 Ollama direct, GPU | 54.92 s | 3.08 | 4/5 | 0.90 |
| 2026-04-12 Ollama direct, CPU | 80.15 s | 3.10 | 4/5 | 0.90 |
| 2026-04-28 weeyuga (this run) | TIMEOUT 5/5 | — | 0/5 | 0.00 |
Verdict: previously-usable model is now unusable through weeyuga on Pavilion's current config. Worst single finding in the run.
The 2026-04-12 GPU lane finished 5Q in 4.5 minutes total. Today the 6-min hard cap fires on every individual call. The total wall just on 14b was ~36 minutes (6 calls × 6 min) all wasted.
Action: Bane verifies with single-model config; if it's swap thrashing, the §3.2 test fixes both. If not, becomes a real triage.
n=5 for 5Q phase. TIMEOUT = harness 6-min hard wall fired.
Format-OK and marker-hit scored against canonical small_model_eval_questions.json rules.
| Model | Hello | 5Q avg dur | avg tok-out | tok/s | 5Q errors | Format-OK | Marker |
|---|---|---|---|---|---|---|---|
| qwen3.5 family — thinking; Ollama or llama.cpp | |||||||
| qwen3.5:0.8b via llama.cpp :[cluster-port] | 3.5 s | 26.8 s | 1106 | 38.2 | 0/5 | 3/5 | 0.72 |
| qwen3.5:2b | 4.7 s | 28.5 s | 1173 | 37.9 | 0/5 | 4/5 | 0.85 |
| qwen3.5:4b | 4.0 s | 31.9 s | 1237 | 31.7 | 0/5 | 3/5 | 0.72 |
| qwen3.5:9b | 26.5 s | 25.0 s | 1029 | 40.8 | 0/5 | 4/5 | 0.85 |
| qwen3.5:9b-q4km | 3.7 s | 27.8 s | 1131 | 38.8 | 0/5 | 3/5 | 0.72 |
| qwen3.5:9b-q6k | 3.8 s | 29.9 s | 1216 | 37.0 | 0/5 | 3/5 | 0.72 |
| qwen3.5:35b-a3b-iq2s | 4.0 s | 30.2 s | 1228 | 31.2 | 0/5 | 3/5 | 0.78 |
| qwen3.5:35b-a3b-uncensored-iq1m | 6.0 s | 22.4 s | 890 | 37.7 | 0/5 | 3/5 | 0.75 |
| qwen3 family — thinking; older generation | |||||||
| qwen3:4b | 276.4 s | 262.9 s | 2048 | 1.65 | 0/5 | 0/5 | 0.00 |
| qwen3:8b | TIMEOUT | 157.0 s* | 667* | — | 3/5 | 2/5 | 0.40 |
| qwen3:14b | TIMEOUT | 352.7 s* | 682* | — | 4/5 | 1/5 | 0.16 |
| qwen2.5 / qwen2.5-coder family — non-thinking | |||||||
| qwen2.5-coder:0.5b | 13.4 s | 10.4 s | 390 | 37.6 | 0/5 | 3/5 | 0.82 |
| qwen2.5-coder:1.5b | 20.6 s | 13.9 s | 247 | 17.8 | 0/5 | 3/5 | 0.90 |
| qwen2.5-coder:3b | 269.6 s | 70.5 s | 227 | 3.2 | 0/5 | 3/5 | 0.90 |
| qwen2.5-coder:14b | TIMEOUT | — | — | — | 5/5 | 0/5 | 0.00 |
| qwen2.5:3b | 31.3 s | 28.0 s | 293 | 10.5 | 0/5 | 3/5 | 0.90 |
* averaged over non-timeout calls only.
By audience — what each agent owes from this run.
:[cluster-port] is the right "cheap fast micro-call" pick (3.5s hello, 26.8s 5Q avg).<think> and never produces a usable answer (0/5 marker, 0/5 format-OK).OLLAMA_MAX_LOADED_MODELS=1 / NUM_PARALLEL=1. If 3b drops to ~13s and 14b to ~55s, swap-thrashing under multi-model sweep is the cause — we plan around it (single-model lanes).think:false, OR the path differs on Ollama vs llama.cpp.
Pavilion through weeyuga ran 81 of 96 benchmark calls cleanly —
the 8-model qwen3.5 family is now the cluster's most usable tier,
finishing the canonical 5-question coding eval in 22-32 seconds
per call with 30-40 tokens per second sustained. Where direct
comparison against the prior 2026-04-11 Ollama-direct run is
possible, small models pay a manageable 30% overhead for going
through the weeyuga routing layer (which buys you Mila's sampling
defaults and thinking-bypass classifier), but mid-size and large
models are 5× slower or fully timing out — the most likely cause
is Ollama's multi-model parallelism config thrashing under the
harness's serial all-models sweep. The cap that held everywhere
was num_predict=2048 — even thinking models that
wanted to reason forever were bounded into the 263-second worst
case, no infinite loops. The two real problems to fix are
qwen2.5-coder:14b regressing from 55 s/call to "won't
finish" and qwen3:4b burning every token of budget on
<think> and never producing a markable answer.
Quality rates (format-OK, marker-hit) are unchanged from the
prior baseline.
The harness's primary JSONL ledger (ff1131ca-…-d54e97e.jsonl) only captured the meta + 17 of 96 call records — 18 lines total. Stdout logged all 96 cleanly; Trinity reconstructed the missing 79 records into ff1131ca-…-d54e97e-reconstructed-from-log.jsonl which is the canonical source for this report.
Reconstruction has: duration_seconds, completion_tokens, tokens_per_second, marker_hit_rate, format_ok, error.
Reconstruction missing: prompt_tokens, response_chars, response_preview, exact markers_hit list.
Fix: harness v2 (this commit) adds os.fsync() after every write_jsonl() call. Future runs land complete on disk.
OLLAMA_MAX_LOADED_MODELS=3OLLAMA_NUM_PARALLEL=3OLLAMA_KEEP_ALIVE=2400hWeeyugaServe Scheduled Task):[cluster-port] with --reasoning-budget 1024 --reasoning-format deepseek --ctx-size 8192 --jinja --n-gpu-layers 999Prior 2026-04-11/12 lanes ran with MAX_LOADED_MODELS=1 / NUM_PARALLEL=1 per LOCAL_LINUX_GPU_BENCHMARK_ROLLOUT.md. That config delta is the leading hypothesis for the mid/large-model regressions.
metadata.test=true brain telemetry tagging — N/A (Mode A direct-engine, zero brain side effect)cd [user-path]
python3 scripts/benchmarks/run_pavilion_weeyuga.py --probe # health-check
python3 scripts/benchmarks/run_pavilion_weeyuga.py \
--phase=hello+5q --models=auto --weeyuga-url=http://[weeyuga-cluster-host]:[cluster-port]
python3 -c "
import json
def stats(fp, model, phase):
rows = [json.loads(l) for l in open(fp) if l.strip()]
rows = [r for r in rows if r.get('type')=='call'
and r.get('model')==model and r.get('phase')==phase
and not r.get('error')]
if not rows: return None
return round(sum(r['duration_seconds'] for r in rows)/len(rows), 2)
base = stats('docs/BENCHMARKS/runs/ff1131ca-d021-4e06-8616-4b4cdb54e97e-reconstructed-from-log.jsonl',
'qwen3.5:0.8b', '5q')
new = stats('docs/BENCHMARKS/runs/<new-uuid>.jsonl', 'qwen3.5:0.8b', '5q')
print(f'baseline: {base}s, new: {new}s, delta: {round((new-base)/base*100,1)}%')
"
A delta ≥ +30% is a regression per HARNESS.md §7.
~/Documents/MyServers/instances/pavilion-windows-laptop/reports/2026-04-11-windows-vs-mac-qwen0_5b-summary.md