Predator Qwen rerun matrix
Three Qwen variants on Predator: 36 calls, ~10.9 tok/s mean, p50 24s. The single-machine flagship for the rerun campaign.
What Janie says
The qwen3.5 family on Predator under three different inference shapes — the budget=500 thinking config, the no-think config, and the same model's 14B dense sibling for the contrast. Three variants, twelve prompts each, thirty-six calls. The headline: ten-point-nine tokens per second mean across the matrix, p50 twenty-four seconds per call.
The 14B dense model is the casualty; it sits in this matrix at roughly one token per second on hard prompts and serves as the empirical floor of "what a six-gigabyte VRAM GPU can't do gracefully." The thinking models (budget=500) and the no-think baseline come within striking distance of each other on wall-time — the reasoning-budget cap is doing its job. Same machine, same model architecture across two of the three variants, different deployment knobs. This matrix is what tuning looks like in practice on one machine.
Methodology
See A3B_AND_CPU_OVERNIGHT_2026-05-05
for the full procedure.
Reproducible at git SHA ddbaaf46.
Results
| Cell | tok/s mean | tok/s p50 | tok/s p95 | duration p50 | calls |
|---|---|---|---|---|---|
| qwen3.5:9b-q4km-think500 | 14.2 | 14.9 | 15.3 | 33.7s | 14 |
| qwen3.5:9b-q4km-nothink | 12.6 | 14.4 | 14.9 | 7.2s | 14 |
| qwen3:14b-q4km | 1.1 | 1.1 | 1.3 | 1m18s | 8 |
tokens per second — mean · p50 · p95
Cold start vs warm
Cold-start measurements are the first call into a model after it loads from disk; warm calls are everything after. The ratio shows how much of the deployment’s wall-time cost is one-time vs steady-state.
| Cell | cold n | cold tok/s | cold p50 | warm n | warm tok/s | warm p50 | warm/cold |
|---|---|---|---|---|---|---|---|
| predator:llamacpp:qwen3.5:9b-q4km… | 3 | 12.9 | 35.9s | 9 | 14.7 | 33.6s | 1.14× |
| predator:llamacpp:qwen3.5:9b-q4km… | 3 | 11.6 | 7.3s | 9 | 12.9 | 7.1s | 1.12× |
| predator:llamacpp:qwen3:14b-q4km | 2 | 0.9 | 3m18s | 4 | 1.1 | 59.4s | 1.27× |
By prompt difficulty
Tokens per second by prompt class. hello is a trivial
one-line prompt; P-MEDIUM and P-HARD are the
deeper questions in the suite. The shape of the gap tells you whether
the model is bottlenecked on parsing or on generation.
| Cell | hello | P-MEDIUM | P-HARD |
|---|---|---|---|
| predator:llamacpp:qwen3.5:9b-q4km… | 12.7 tok/s 4.7s · n=4 | 15.0 tok/s 33.7s · n=4 | 15.1 tok/s 51.8s · n=4 |
| predator:llamacpp:qwen3.5:9b-q4km… | 8.5 tok/s 1.4s · n=4 | 14.3 tok/s 7.2s · n=4 | 14.9 tok/s 22.1s · n=4 |
| predator:llamacpp:qwen3:14b-q4km | 1.0 tok/s 59.4s · n=4 | 1.2 tok/s 4m14s · n=2 | — |
Reasoning vs answer
Thinking models split their output into a hidden reasoning trace and a visible answer. The ratio shows how much of the budget the model spent thinking vs answering.
| Cell | reasoning chars | answer chars | reasoning / answer |
|---|---|---|---|
| predator:llamacpp:qwen3.5:9b-q4km… | 1457 | 434 | 3.36× |
| predator:llamacpp:qwen3:14b-q4km | 504 | 201 | 2.51× |
Per-call timeline
Every call placed during this run, in order, colored by phase. Width is proportional to the call’s share of the cell’s wall-time. Hover any segment for the prompt id and tok/s.
Raw data
Every run gets its JSONL, log, summary, and metadata published. Clone the archive; re-run it; tell us where we got it wrong.
Cite
Margetic, S. et al. (2026). benchmarks.weeyuga.com/benchmarks/fba9d9b1.html Public benchmarks of the Weeyuga cluster. Run id: fba9d9b1-cc5d-40bc-9e21-beafbb72c65d. SHA ddbaaf46.