Pavilion via Weeyuga — Mission 1 Phase 1 Baseline

Lane	5Q avg dur	tok/s	Format-OK	Marker-hit
2026-04-11 Ollama direct, GPU	7.99 s	62.65	3/5	0.80
2026-04-11 Ollama direct, CPU	6.72 s	63.62	2/5	0.69
2026-04-28 weeyuga (this run)	10.40 s	37.6	3/5	0.82

Lane	5Q avg dur	tok/s	Format-OK	Marker-hit
2026-04-11 Ollama direct, GPU	13.00 s	21.46	3/5	0.87
2026-04-11 Ollama direct, CPU	13.10 s	21.40	3/5	0.90
2026-04-28 weeyuga (this run)	70.50 s	3.20	3/5	0.90

Lane	5Q avg dur	tok/s	Format-OK	Marker-hit
2026-04-12 Ollama direct, GPU	54.92 s	3.08	4/5	0.90
2026-04-12 Ollama direct, CPU	80.15 s	3.10	4/5	0.90
2026-04-28 weeyuga (this run)	TIMEOUT 5/5	—	0/5	0.00

Model	Hello	5Q avg dur	avg tok-out	tok/s	5Q errors	Format-OK	Marker
qwen3.5 family — thinking; Ollama or llama.cpp
qwen3.5:0.8b via llama.cpp :[cluster-port]	3.5 s	26.8 s	1106	38.2	0/5	3/5	0.72
qwen3.5:2b	4.7 s	28.5 s	1173	37.9	0/5	4/5	0.85
qwen3.5:4b	4.0 s	31.9 s	1237	31.7	0/5	3/5	0.72
qwen3.5:9b	26.5 s	25.0 s	1029	40.8	0/5	4/5	0.85
qwen3.5:9b-q4km	3.7 s	27.8 s	1131	38.8	0/5	3/5	0.72
qwen3.5:9b-q6k	3.8 s	29.9 s	1216	37.0	0/5	3/5	0.72
qwen3.5:35b-a3b-iq2s	4.0 s	30.2 s	1228	31.2	0/5	3/5	0.78
qwen3.5:35b-a3b-uncensored-iq1m	6.0 s	22.4 s	890	37.7	0/5	3/5	0.75
qwen3 family — thinking; older generation
qwen3:4b	276.4 s	262.9 s	2048	1.65	0/5	0/5	0.00
qwen3:8b	TIMEOUT	157.0 s*	667*	—	3/5	2/5	0.40
qwen3:14b	TIMEOUT	352.7 s*	682*	—	4/5	1/5	0.16
qwen2.5 / qwen2.5-coder family — non-thinking
qwen2.5-coder:0.5b	13.4 s	10.4 s	390	37.6	0/5	3/5	0.82
qwen2.5-coder:1.5b	20.6 s	13.9 s	247	17.8	0/5	3/5	0.90
qwen2.5-coder:3b	269.6 s	70.5 s	227	3.2	0/5	3/5	0.90
qwen2.5-coder:14b	TIMEOUT	—	—	—	5/5	0/5	0.00
qwen2.5:3b	31.3 s	28.0 s	293	10.5	0/5	3/5	0.90

For Atlas 🪐 (personality engine)

Default to qwen3.5 family

All 8 qwen3.5 variants ran clean (22-32s/call, 30-40 tok/s, 0 timeouts) — this is the cluster's most usable thinking-model tier.
0.8b through llama.cpp :[cluster-port] is the right "cheap fast micro-call" pick (3.5s hello, 26.8s 5Q avg).
Avoid qwen3:4b unless num_predict is much smaller — it eats every token of 2048 on <think> and never produces a usable answer (0/5 marker, 0/5 format-OK).
Sub-500ms target: not reachable on Pavilion alone with current models. qwen3.5:0.8b 5Q avg is 26.8s = 53× over budget. Either much smaller model, prefix caching across micro-calls, or stronger hardware (Predator 1070 next).

For Bane 🦇 (Pavilion ops)

Verify swap-thrashing hypothesis

Re-run qwen2.5-coder:3b and :14b 5Q with OLLAMA_MAX_LOADED_MODELS=1 / NUM_PARALLEL=1. If 3b drops to ~13s and 14b to ~55s, swap-thrashing under multi-model sweep is the cause — we plan around it (single-model lanes).
If those numbers don't recover, the 14b regression is independent and worth real triage.
Suggested: run the test in a quiet window; ping Ben when done so I can re-render the comparison row.

For Mila 🔬 (R-QWEN35-DEFAULTS)

Classifier observation

Trivial-bypass fired correctly on 16/16 hello calls. qwen3.5:0.8b returned in 3.5s with 133 tokens — short-circuit working.
BUT qwen3.5:9b's hello took 26.5s with 1081 tokens — that's the 9b model thinking on a borderline prompt. Either the classifier doesn't route 9b to think:false, OR the path differs on Ollama vs llama.cpp.
Worth a Mila pass to confirm intent: should 9b/35b also auto-bypass on greetings, or is "thinking-by-default for substantive models" the design?

For Trinity 🌐 (Phase 2 driver)

Recommended Phase 2 scope

Skip the 14B+ tier entirely (qwen2.5-coder:14b, qwen3:8b/14b — they timeout on 5Q and 20Q is heavier).
Skip qwen3:4b — it produces unusable output even when not timing out.
Run on the 12 working models: full qwen3.5 family (8) + qwen2.5-coder:0.5b/1.5b/3b + qwen2.5:3b.
Estimated wall: ~3-5 hours. Run with the v2 harness (post-fsync fix) so the JSONL stays complete this time.

Pavilion through weeyuga ran 81 of 96 benchmark calls cleanly — the 8-model qwen3.5 family is now the cluster's most usable tier, finishing the canonical 5-question coding eval in 22-32 seconds per call with 30-40 tokens per second sustained. Where direct comparison against the prior 2026-04-11 Ollama-direct run is possible, small models pay a manageable 30% overhead for going through the weeyuga routing layer (which buys you Mila's sampling defaults and thinking-bypass classifier), but mid-size and large models are 5× slower or fully timing out — the most likely cause is Ollama's multi-model parallelism config thrashing under the harness's serial all-models sweep. The cap that held everywhere was num_predict=2048 — even thinking models that wanted to reason forever were bounded into the 263-second worst case, no infinite loops. The two real problems to fix are qwen2.5-coder:14b regressing from 55 s/call to "won't finish" and qwen3:4b burning every token of budget on <think> and never producing a markable answer. Quality rates (format-OK, marker-hit) are unchanged from the prior baseline.

Pavilion via Weeyuga — Mission 1 Phase 1 Baseline

Run Coverage

Headline Metrics

What this measures

§3. Comparison vs prior 2026-04-11/12 Pavilion-Ollama-direct

§3.1 qwen2.5-coder:0.5b — the cleanest comparison

§3.2 qwen2.5-coder:3b — the surprise regression

§3.3 qwen2.5-coder:14b — the hard regression

§4. Per-model results (Hello + 5Q)

§5. Findings + recommendations

Default to qwen3.5 family

Verify swap-thrashing hypothesis

Classifier observation

Recommended Phase 2 scope

§5.5 1-paragraph framing for Janie 📝

§6. Caveats + audit

§6.1 JSONL flush bug (fixed in harness v2)

§6.2 Pavilion config at run time

§6.3 What this baseline does NOT cover

§7. Re-run + cross-references

Re-run (harness v2, future baseline)

Compare against this baseline

Cross-references