4 MAY 2026 · Predator · gaming laptop · GTX 1060 6 GB · 28 GB RAM · gemma/granite/qwen3.5 · chat

Predator trio bench

42 calls across 3 cell(s); ~16.5 tok/s mean; p50 9.6s

What Janie says

The clearest snapshot of what the bigger of the two consumer-GPU machines tested actually does. Three different model families on the same Predator hardware, all-GPU, Q4_K_M quant: gemma-4-E4B-it at twenty-three-and-a-half tokens per second on the hard prompt; granite-4.1-8B at sixteen; qwen3.5-9B (no-think) at fifteen.

None of these are frontier-model GPUs running frontier-model parameter counts. They're a six-year-old GTX 1060 running the smartest models we could fit at Q4 quant. The result is fast enough for assistant-style chat (faster than typing speed across all three), three model families landing in roughly the same usable performance band, no thirty-second pauses while you wait for a response. This is the "what works" tier on this hardware — the floor of useful, not the ceiling of impressive.

Methodology

See A3B_AND_CPU_OVERNIGHT_2026-05-05 for the full procedure. Reproducible at git SHA ddbaaf46.

Results

Cell	tok/s mean	tok/s p50	tok/s p95	duration p50	calls
granite-4.1:8b-q4km	13.5	15.2	15.7	6.4s	14
gemma-4:e4b-it-q4km	21.8	22.9	23.6	8.2s	14
qwen3.5:9b-q4km	14.0	14.4	14.6	35.3s	14

tokens per second — mean · p50 · p95

Cold start vs warm

Cold-start measurements are the first call into a model after it loads from disk; warm calls are everything after. The ratio shows how much of the deployment’s wall-time cost is one-time vs steady-state.

Cell	cold n	cold tok/s	cold p50	warm n	warm tok/s	warm p50	warm/cold
predator:llamacpp:granite-4.1:8b-…	3	11.9	6.1s	9	14.0	6.6s	1.18×
predator:llamacpp:gemma-4:e4b-it-…	3	21.4	3.9s	9	22.0	12.3s	1.03×
predator:llamacpp:qwen3.5:9b-q4km	3	13.3	35.7s	9	14.3	35.3s	1.08×

By prompt difficulty

Tokens per second by prompt class. hello is a trivial one-line prompt; P-MEDIUM and P-HARD are the deeper questions in the suite. The shape of the gap tells you whether the model is bottlenecked on parsing or on generation.

Cell	hello	P-MEDIUM	P-HARD
predator:llamacpp:granite-4.1:8b-…	9.7 tok/s 0.9s · n=4	15.1 tok/s 6.4s · n=4	15.7 tok/s 18.8s · n=4
predator:llamacpp:gemma-4:e4b-it-…	19.2 tok/s 2.9s · n=4	22.8 tok/s 8.2s · n=4	23.5 tok/s 16.0s · n=4
predator:llamacpp:qwen3.5:9b-q4km	13.2 tok/s 4.6s · n=4	14.5 tok/s 35.3s · n=4	14.4 tok/s 1m10s · n=4

Per-call timeline

Every call placed during this run, in order, colored by phase. Width is proportional to the call’s share of the cell’s wall-time. Hover any segment for the prompt id and tok/s.

Raw data

Every run gets its JSONL, log, summary, and metadata published. Clone the archive; re-run it; tell us where we got it wrong.

Cite

Margetic, S. et al. (2026). benchmarks.weeyuga.com/benchmarks/09d8fbde.html
Public benchmarks of the Weeyuga cluster. Run id: 09d8fbde-0008-49bb-99da-03eeaca72be1. SHA ddbaaf46.

Predator trio bench

What Janie says

Methodology

Results

Cold start vs warm

By prompt difficulty

Per-call timeline

Raw data

Cite

Related runs

A3B cross-machine — Predator side

Predator Qwen rerun matrix

predator-a3b-1 — qwen3 on predator