Predator trio bench
42 calls across 3 cell(s); ~16.5 tok/s mean; p50 9.6s
What Janie says
The clearest snapshot of what the bigger of the two consumer-GPU machines tested actually does. Three different model families on the same Predator hardware, all-GPU, Q4_K_M quant: gemma-4-E4B-it at twenty-three-and-a-half tokens per second on the hard prompt; granite-4.1-8B at sixteen; qwen3.5-9B (no-think) at fifteen.
None of these are frontier-model GPUs running frontier-model parameter counts. They're a six-year-old GTX 1060 running the smartest models we could fit at Q4 quant. The result is fast enough for assistant-style chat (faster than typing speed across all three), three model families landing in roughly the same usable performance band, no thirty-second pauses while you wait for a response. This is the "what works" tier on this hardware — the floor of useful, not the ceiling of impressive.
Methodology
See A3B_AND_CPU_OVERNIGHT_2026-05-05
for the full procedure.
Reproducible at git SHA ddbaaf46.
Results
| Cell | tok/s mean | tok/s p50 | tok/s p95 | duration p50 | calls |
|---|---|---|---|---|---|
| granite-4.1:8b-q4km | 13.5 | 15.2 | 15.7 | 6.4s | 14 |
| gemma-4:e4b-it-q4km | 21.8 | 22.9 | 23.6 | 8.2s | 14 |
| qwen3.5:9b-q4km | 14.0 | 14.4 | 14.6 | 35.3s | 14 |
tokens per second — mean · p50 · p95
Cold start vs warm
Cold-start measurements are the first call into a model after it loads from disk; warm calls are everything after. The ratio shows how much of the deployment’s wall-time cost is one-time vs steady-state.
| Cell | cold n | cold tok/s | cold p50 | warm n | warm tok/s | warm p50 | warm/cold |
|---|---|---|---|---|---|---|---|
| predator:llamacpp:granite-4.1:8b-… | 3 | 11.9 | 6.1s | 9 | 14.0 | 6.6s | 1.18× |
| predator:llamacpp:gemma-4:e4b-it-… | 3 | 21.4 | 3.9s | 9 | 22.0 | 12.3s | 1.03× |
| predator:llamacpp:qwen3.5:9b-q4km | 3 | 13.3 | 35.7s | 9 | 14.3 | 35.3s | 1.08× |
By prompt difficulty
Tokens per second by prompt class. hello is a trivial
one-line prompt; P-MEDIUM and P-HARD are the
deeper questions in the suite. The shape of the gap tells you whether
the model is bottlenecked on parsing or on generation.
| Cell | hello | P-MEDIUM | P-HARD |
|---|---|---|---|
| predator:llamacpp:granite-4.1:8b-… | 9.7 tok/s 0.9s · n=4 | 15.1 tok/s 6.4s · n=4 | 15.7 tok/s 18.8s · n=4 |
| predator:llamacpp:gemma-4:e4b-it-… | 19.2 tok/s 2.9s · n=4 | 22.8 tok/s 8.2s · n=4 | 23.5 tok/s 16.0s · n=4 |
| predator:llamacpp:qwen3.5:9b-q4km | 13.2 tok/s 4.6s · n=4 | 14.5 tok/s 35.3s · n=4 | 14.4 tok/s 1m10s · n=4 |
Per-call timeline
Every call placed during this run, in order, colored by phase. Width is proportional to the call’s share of the cell’s wall-time. Hover any segment for the prompt id and tok/s.
Raw data
Every run gets its JSONL, log, summary, and metadata published. Clone the archive; re-run it; tell us where we got it wrong.
Cite
Margetic, S. et al. (2026). benchmarks.weeyuga.com/benchmarks/09d8fbde.html Public benchmarks of the Weeyuga cluster. Run id: 09d8fbde-0008-49bb-99da-03eeaca72be1. SHA ddbaaf46.