Archive · Pavilion windows-laptop · 2026-04-29-pavilion-weeyuga-v1-benchmark.html. Originally rendered 2026-04-29. Re-hosted from MyServers on 2026-05-06. Methodology and harness conventions may differ from what we use today; see /methodology.html for current standards. ← back to all benchmarks

Pavilion via Weeyuga — Mission 1 Phase 1 Baseline

First reproducible benchmark of the Weeyuga cluster's Pavilion peer when driven through weeyuga serve at http://[weeyuga-cluster-host]:[cluster-port]/v1/chat/completions instead of Ollama's native /api/generate. The methodology mirrors Sloba's canonical 2026-04-11/12 Pavilion benchmark (temperature=0.1, num_ctx=4096, num_predict=2048, single-loaded-model single-parallel-slot, frozen 5Q + Hello suite). The number worth publishing is the difference: what does going through the weeyuga routing layer cost vs hitting Ollama directly?

Run ID: ff1131ca-…-d54e97e Date: 2026-04-28 Phase: Hello + 5Q Wall: 2h 43min Models: 16 (Pavilion-resident) Driver: Trinity 🌐 Synthesis: Ben ⚡

Run Coverage

Phase 1 of a 3-phase plan. The 14B+ class regressed (see §3.3); 9b/35b ran clean.

Hello checkComplete

16/16 models attempted. 12 successes, 4 timeouts (qwen2.5-coder:14b, qwen3:8b, qwen3:14b — same 14B+ tier that timed out on 5Q).

5-question suiteComplete

96 calls total (16 × 6). 81 successes, 15 timeouts. Format-OK + marker rates within ±5% of prior baseline.

20-question Python suitePhase 2

Deferred per HARNESS.md plan. Trinity dispatch follows; recommended scope: 7 working models × 20 prompts.

10-question real-contextPhase 3

Deferred to Phase 3. Will run after Phase 2 lands and we know which models can hold real-context coherently.

Vs prior baseline3 cells

qwen2.5-coder:0.5b / 3b / 14b all have direct 2026-04-11/12 Ollama-direct numbers to diff against (see §3).

Headline Metrics

The numbers that frame this baseline.

Sweet spot tier qwen3.5

All 8 variants (0.8b → 35b-a3b) ran clean: 22-32s avg / 30-41 tok/s on 5Q. Zero timeouts.

Hard regression qwen2.5-coder:14b

5/5 timeouts at 6-min wall. Was 54.9s/call on GPU on 2026-04-12. Triage queued for Bane.

weeyuga overhead +30%

vs Ollama-direct GPU on qwen2.5-coder:0.5b (10.4s vs 7.99s). Quality unchanged.

Fastest 5Q call 2.4s

qwen3.5:0.8b on nginx_safe_reload via llama.cpp :[cluster-port].

Worst slowdown 5.4×

qwen2.5-coder:3b: 70.5s vs 13.0s prior. Likely Ollama swap-thrashing under multi-model sweep.

Thinking-loop cap Held ✓

num_predict=2048 bounded every call. Worst case (qwen3:4b): 263s / 2048 tokens.

What this measures

Methodology delta from prior 2026-04-11/12 baseline.

Endpoint: weeyuga :[cluster-port]/v1/chat/completions (OpenAI-compat) — prior runs hit Ollama :11434/api/generate direct.

Routing layer: through Mila's R-QWEN35-DEFAULTS Phase 0+1 sampling defaults + thinking-bypass classifier + Atlas's Phase-2 llama.cpp branch for qwen3.5:0.8b.

Sampling: identical (temperature=0.1, num_ctx=4096, num_predict=2048).

Lanes: single (whichever weeyuga picked) — prior ran CPU + GPU separately and we compare to the faster lane.

Prompts: identical, sourced from ~/Documents/MyServers/.../small_model_eval_questions.json verbatim.

§3. Comparison vs prior 2026-04-11/12 Pavilion-Ollama-direct

The headline of the run. Three models have direct comparison data from ~/Documents/MyServers/instances/pavilion-windows-laptop/telemetry/archive/raw-run-artifacts/windows-local/.

§3.1 qwen2.5-coder:0.5b — the cleanest comparison

Lane5Q avg durtok/sFormat-OKMarker-hit
2026-04-11 Ollama direct, GPU7.99 s62.653/50.80
2026-04-11 Ollama direct, CPU6.72 s63.622/50.69
2026-04-28 weeyuga (this run)10.40 s37.63/50.82

Verdict: weeyuga adds ~30% wall-time vs prior GPU lane and ~55% vs prior CPU lane. tok/s drops from ~63 → ~38. Quality is unchanged (marker 0.82 vs 0.80). This is the routing-layer cost — Mila's sampling defaults + thinking-bypass classifier + future personality engine all live on this path.

§3.2 qwen2.5-coder:3b — the surprise regression

Lane5Q avg durtok/sFormat-OKMarker-hit
2026-04-11 Ollama direct, GPU13.00 s21.463/50.87
2026-04-11 Ollama direct, CPU13.10 s21.403/50.90
2026-04-28 weeyuga (this run)70.50 s3.203/50.90

Verdict: 5.4× slower through weeyuga. tok/s collapsed from 21 → 3.2. Quality identical.

Most likely root cause: Pavilion's Ollama is configured with OLLAMA_MAX_LOADED_MODELS=3 and OLLAMA_NUM_PARALLEL=3 (Bane 2026-04-26 reconfigure). When weeyuga drives a sequence of calls across 16 different models, Ollama has to evict + reload models, paying disk → VRAM cost on every swap. The prior 2026-04-11 run was single-model so didn't hit this.

Recommended verification: re-run with OLLAMA_MAX_LOADED_MODELS=1 / NUM_PARALLEL=1 on Pavilion (Bane). If 3b drops back near 13s, we've isolated the cause.

§3.3 qwen2.5-coder:14b — the hard regression

Lane5Q avg durtok/sFormat-OKMarker-hit
2026-04-12 Ollama direct, GPU54.92 s3.084/50.90
2026-04-12 Ollama direct, CPU80.15 s3.104/50.90
2026-04-28 weeyuga (this run)TIMEOUT 5/50/50.00

Verdict: previously-usable model is now unusable through weeyuga on Pavilion's current config. Worst single finding in the run.

The 2026-04-12 GPU lane finished 5Q in 4.5 minutes total. Today the 6-min hard cap fires on every individual call. The total wall just on 14b was ~36 minutes (6 calls × 6 min) all wasted.

Action: Bane verifies with single-model config; if it's swap thrashing, the §3.2 test fixes both. If not, becomes a real triage.

§4. Per-model results (Hello + 5Q)

n=5 for 5Q phase. TIMEOUT = harness 6-min hard wall fired. Format-OK and marker-hit scored against canonical small_model_eval_questions.json rules.

Model Hello 5Q avg dur avg tok-out tok/s 5Q errors Format-OK Marker
qwen3.5 family — thinking; Ollama or llama.cpp
qwen3.5:0.8b via llama.cpp :[cluster-port] 3.5 s26.8 s110638.2 0/53/50.72
qwen3.5:2b 4.7 s28.5 s117337.9 0/54/50.85
qwen3.5:4b 4.0 s31.9 s123731.7 0/53/50.72
qwen3.5:9b 26.5 s25.0 s102940.8 0/54/50.85
qwen3.5:9b-q4km 3.7 s27.8 s113138.8 0/53/50.72
qwen3.5:9b-q6k 3.8 s29.9 s121637.0 0/53/50.72
qwen3.5:35b-a3b-iq2s 4.0 s30.2 s122831.2 0/53/50.78
qwen3.5:35b-a3b-uncensored-iq1m 6.0 s22.4 s89037.7 0/53/50.75
qwen3 family — thinking; older generation
qwen3:4b 276.4 s262.9 s20481.65 0/50/50.00
qwen3:8b TIMEOUT157.0 s*667* 3/52/50.40
qwen3:14b TIMEOUT352.7 s*682* 4/51/50.16
qwen2.5 / qwen2.5-coder family — non-thinking
qwen2.5-coder:0.5b 13.4 s10.4 s39037.6 0/53/50.82
qwen2.5-coder:1.5b 20.6 s13.9 s24717.8 0/53/50.90
qwen2.5-coder:3b 269.6 s70.5 s2273.2 0/53/50.90
qwen2.5-coder:14b TIMEOUT 5/50/50.00
qwen2.5:3b 31.3 s28.0 s29310.5 0/53/50.90

* averaged over non-timeout calls only.

§5. Findings + recommendations

By audience — what each agent owes from this run.

For Atlas 🪐 (personality engine)

Default to qwen3.5 family

  • All 8 qwen3.5 variants ran clean (22-32s/call, 30-40 tok/s, 0 timeouts) — this is the cluster's most usable thinking-model tier.
  • 0.8b through llama.cpp :[cluster-port] is the right "cheap fast micro-call" pick (3.5s hello, 26.8s 5Q avg).
  • Avoid qwen3:4b unless num_predict is much smaller — it eats every token of 2048 on <think> and never produces a usable answer (0/5 marker, 0/5 format-OK).
  • Sub-500ms target: not reachable on Pavilion alone with current models. qwen3.5:0.8b 5Q avg is 26.8s = 53× over budget. Either much smaller model, prefix caching across micro-calls, or stronger hardware (Predator 1070 next).
For Bane 🦇 (Pavilion ops)

Verify swap-thrashing hypothesis

  • Re-run qwen2.5-coder:3b and :14b 5Q with OLLAMA_MAX_LOADED_MODELS=1 / NUM_PARALLEL=1. If 3b drops to ~13s and 14b to ~55s, swap-thrashing under multi-model sweep is the cause — we plan around it (single-model lanes).
  • If those numbers don't recover, the 14b regression is independent and worth real triage.
  • Suggested: run the test in a quiet window; ping Ben when done so I can re-render the comparison row.
For Mila 🔬 (R-QWEN35-DEFAULTS)

Classifier observation

  • Trivial-bypass fired correctly on 16/16 hello calls. qwen3.5:0.8b returned in 3.5s with 133 tokens — short-circuit working.
  • BUT qwen3.5:9b's hello took 26.5s with 1081 tokens — that's the 9b model thinking on a borderline prompt. Either the classifier doesn't route 9b to think:false, OR the path differs on Ollama vs llama.cpp.
  • Worth a Mila pass to confirm intent: should 9b/35b also auto-bypass on greetings, or is "thinking-by-default for substantive models" the design?
For Trinity 🌐 (Phase 2 driver)

Recommended Phase 2 scope

  • Skip the 14B+ tier entirely (qwen2.5-coder:14b, qwen3:8b/14b — they timeout on 5Q and 20Q is heavier).
  • Skip qwen3:4b — it produces unusable output even when not timing out.
  • Run on the 12 working models: full qwen3.5 family (8) + qwen2.5-coder:0.5b/1.5b/3b + qwen2.5:3b.
  • Estimated wall: ~3-5 hours. Run with the v2 harness (post-fsync fix) so the JSONL stays complete this time.

§5.5 1-paragraph framing for Janie 📝

Pavilion through weeyuga ran 81 of 96 benchmark calls cleanly — the 8-model qwen3.5 family is now the cluster's most usable tier, finishing the canonical 5-question coding eval in 22-32 seconds per call with 30-40 tokens per second sustained. Where direct comparison against the prior 2026-04-11 Ollama-direct run is possible, small models pay a manageable 30% overhead for going through the weeyuga routing layer (which buys you Mila's sampling defaults and thinking-bypass classifier), but mid-size and large models are 5× slower or fully timing out — the most likely cause is Ollama's multi-model parallelism config thrashing under the harness's serial all-models sweep. The cap that held everywhere was num_predict=2048 — even thinking models that wanted to reason forever were bounded into the 263-second worst case, no infinite loops. The two real problems to fix are qwen2.5-coder:14b regressing from 55 s/call to "won't finish" and qwen3:4b burning every token of budget on <think> and never producing a markable answer. Quality rates (format-OK, marker-hit) are unchanged from the prior baseline.

§6. Caveats + audit

§6.1 JSONL flush bug (fixed in harness v2)

The harness's primary JSONL ledger (ff1131ca-…-d54e97e.jsonl) only captured the meta + 17 of 96 call records — 18 lines total. Stdout logged all 96 cleanly; Trinity reconstructed the missing 79 records into ff1131ca-…-d54e97e-reconstructed-from-log.jsonl which is the canonical source for this report.

Reconstruction has: duration_seconds, completion_tokens, tokens_per_second, marker_hit_rate, format_ok, error.

Reconstruction missing: prompt_tokens, response_chars, response_preview, exact markers_hit list.

Fix: harness v2 (this commit) adds os.fsync() after every write_jsonl() call. Future runs land complete on disk.

§6.2 Pavilion config at run time

  • OLLAMA_MAX_LOADED_MODELS=3
  • OLLAMA_NUM_PARALLEL=3
  • OLLAMA_KEEP_ALIVE=2400h
  • weeyuga.exe PID 19068 (post-Bane restart via WeeyugaServe Scheduled Task)
  • llama.cpp on :[cluster-port] with --reasoning-budget 1024 --reasoning-format deepseek --ctx-size 8192 --jinja --n-gpu-layers 999

Prior 2026-04-11/12 lanes ran with MAX_LOADED_MODELS=1 / NUM_PARALLEL=1 per LOCAL_LINUX_GPU_BENCHMARK_ROLLOUT.md. That config delta is the leading hypothesis for the mid/large-model regressions.

§6.3 What this baseline does NOT cover

  • Phase 2 (20Q Python suite × small/mid models) — Trinity dispatch following
  • Phase 3 (10Q real-context + long-context) — later
  • Cold-vs-warm split — needs forced model unload between runs
  • Parallel-thread capacity — v2 of HARNESS.md
  • Cross-node routing cost (Mac tools + Pavilion inference) — v2
  • GPU memory peak / CPU utilization — v2 (needs SSH-driven on-target samplers)
  • metadata.test=true brain telemetry tagging — N/A (Mode A direct-engine, zero brain side effect)

§7. Re-run + cross-references

Re-run (harness v2, future baseline)

cd [user-path]
python3 scripts/benchmarks/run_pavilion_weeyuga.py --probe        # health-check
python3 scripts/benchmarks/run_pavilion_weeyuga.py \
  --phase=hello+5q --models=auto --weeyuga-url=http://[weeyuga-cluster-host]:[cluster-port]

Compare against this baseline

python3 -c "
import json
def stats(fp, model, phase):
    rows = [json.loads(l) for l in open(fp) if l.strip()]
    rows = [r for r in rows if r.get('type')=='call'
            and r.get('model')==model and r.get('phase')==phase
            and not r.get('error')]
    if not rows: return None
    return round(sum(r['duration_seconds'] for r in rows)/len(rows), 2)
base = stats('docs/BENCHMARKS/runs/ff1131ca-d021-4e06-8616-4b4cdb54e97e-reconstructed-from-log.jsonl',
             'qwen3.5:0.8b', '5q')
new  = stats('docs/BENCHMARKS/runs/<new-uuid>.jsonl', 'qwen3.5:0.8b', '5q')
print(f'baseline: {base}s, new: {new}s, delta: {round((new-base)/base*100,1)}%')
"

A delta ≥ +30% is a regression per HARNESS.md §7.

Cross-references