4 MAY 2026 · Predator · gaming laptop · GTX 1060 6 GB · 28 GB RAM · qwen3/qwen3.5 · chat

Predator Qwen rerun matrix

Three Qwen variants on Predator: 36 calls, ~10.9 tok/s mean, p50 24s. The single-machine flagship for the rerun campaign.

What Janie says

The qwen3.5 family on Predator under three different inference shapes — the budget=500 thinking config, the no-think config, and the same model's 14B dense sibling for the contrast. Three variants, twelve prompts each, thirty-six calls. The headline: ten-point-nine tokens per second mean across the matrix, p50 twenty-four seconds per call.

The 14B dense model is the casualty; it sits in this matrix at roughly one token per second on hard prompts and serves as the empirical floor of "what a six-gigabyte VRAM GPU can't do gracefully." The thinking models (budget=500) and the no-think baseline come within striking distance of each other on wall-time — the reasoning-budget cap is doing its job. Same machine, same model architecture across two of the three variants, different deployment knobs. This matrix is what tuning looks like in practice on one machine.

Methodology

See A3B_AND_CPU_OVERNIGHT_2026-05-05 for the full procedure. Reproducible at git SHA ddbaaf46.

Results

Cell	tok/s mean	tok/s p50	tok/s p95	duration p50	calls
qwen3.5:9b-q4km-think500	14.2	14.9	15.3	33.7s	14
qwen3.5:9b-q4km-nothink	12.6	14.4	14.9	7.2s	14
qwen3:14b-q4km	1.1	1.1	1.3	1m18s	8

tokens per second — mean · p50 · p95

Cold start vs warm

Cold-start measurements are the first call into a model after it loads from disk; warm calls are everything after. The ratio shows how much of the deployment’s wall-time cost is one-time vs steady-state.

Cell	cold n	cold tok/s	cold p50	warm n	warm tok/s	warm p50	warm/cold
predator:llamacpp:qwen3.5:9b-q4km…	3	12.9	35.9s	9	14.7	33.6s	1.14×
predator:llamacpp:qwen3.5:9b-q4km…	3	11.6	7.3s	9	12.9	7.1s	1.12×
predator:llamacpp:qwen3:14b-q4km	2	0.9	3m18s	4	1.1	59.4s	1.27×

By prompt difficulty

Tokens per second by prompt class. hello is a trivial one-line prompt; P-MEDIUM and P-HARD are the deeper questions in the suite. The shape of the gap tells you whether the model is bottlenecked on parsing or on generation.

Cell	hello	P-MEDIUM	P-HARD
predator:llamacpp:qwen3.5:9b-q4km…	12.7 tok/s 4.7s · n=4	15.0 tok/s 33.7s · n=4	15.1 tok/s 51.8s · n=4
predator:llamacpp:qwen3.5:9b-q4km…	8.5 tok/s 1.4s · n=4	14.3 tok/s 7.2s · n=4	14.9 tok/s 22.1s · n=4
predator:llamacpp:qwen3:14b-q4km	1.0 tok/s 59.4s · n=4	1.2 tok/s 4m14s · n=2	—

Reasoning vs answer

Thinking models split their output into a hidden reasoning trace and a visible answer. The ratio shows how much of the budget the model spent thinking vs answering.

Cell	reasoning chars	answer chars	reasoning / answer
predator:llamacpp:qwen3.5:9b-q4km…	1457	434	3.36×
predator:llamacpp:qwen3:14b-q4km	504	201	2.51×

Per-call timeline

Every call placed during this run, in order, colored by phase. Width is proportional to the call’s share of the cell’s wall-time. Hover any segment for the prompt id and tok/s.

Raw data

Every run gets its JSONL, log, summary, and metadata published. Clone the archive; re-run it; tell us where we got it wrong.

Cite

Margetic, S. et al. (2026). benchmarks.weeyuga.com/benchmarks/fba9d9b1.html
Public benchmarks of the Weeyuga cluster. Run id: fba9d9b1-cc5d-40bc-9e21-beafbb72c65d. SHA ddbaaf46.