... . --> A3B cross-machine — Predator side — benchmarks.weeyuga.com
← All benchmarks
Headline Methodology Results Cold vs warm By prompt Timeline Raw

4 MAY 2026 · Predator · gaming laptop · GTX 1060 6 GB · 28 GB RAM · qwen3 · chat

A3B cross-machine — Predator side

Predator (GTX 1070) running the same A3B IQ2-XXS at ~3.9 tok/s mean, p50 102s — the larger half of the pair.

What Janie says

Predator is the newer of the two consumer-GPU machines tested — an Acer Predator gaming laptop with a six-gigabyte GTX 1060 and twenty-eight gigabytes of system RAM. On the same Qwen3-30B-A3B model, the same llama.cpp engine, the same prompts as the Pavilion run sitting next to this one in the catalogue, it ran half as fast: four warm tokens per second on the hard prompt against Pavilion's eight. Better hardware, same model, worse number.

The cause is almost certainly an offload-config mistake on our end. --n-gpu-layers 99 lets llama.cpp distribute layers awkwardly across a six-gigabyte VRAM budget; an explicit partition (--n-gpu-layers 32 or whichever count fills VRAM without spilling) should let Predator pull ahead, as llama-bench's prompt-processing number predicts: 40 tok/s on Predator versus 6 on Pavilion is a 6× compute-throughput advantage on the GPU side. The retune is the obvious next bench. We're publishing the un-retuned numbers now, with the methodology footnote naming the gap, because the un-retuned number is what most readers will see when they first try this hardware.

Methodology

See A3B_AND_CPU_OVERNIGHT_2026-05-05 for the full procedure. Reproducible at git SHA ddbaaf46.

Results

Cell tok/s mean tok/s p50 tok/s p95 duration p50 calls
qwen3:30b-a3b-iq2m-think5003.94.04.11m41s14

tokens per second — mean · p50 · p95

1 2 3 5 tok/s mean: 3.87 tok/s p50: 3.99 tok/s p95: 4.06 tok/s qwen3:30b-a3b-i… mean p50 p95

Cold start vs warm

Cold-start measurements are the first call into a model after it loads from disk; warm calls are everything after. The ratio shows how much of the deployment’s wall-time cost is one-time vs steady-state.

Cellcold ncold tok/scold p50warm nwarm tok/swarm p50warm/cold
predator:llamacpp:qwen3:30b-a3b-i…33.51m37s94.01m46s1.16×

By prompt difficulty

Tokens per second by prompt class. hello is a trivial one-line prompt; P-MEDIUM and P-HARD are the deeper questions in the suite. The shape of the gap tells you whether the model is bottlenecked on parsing or on generation.

CellhelloP-MEDIUMP-HARD
predator:llamacpp:qwen3:30b-a3b-i…3.6 tok/s
16.1s · n=4
4.0 tok/s
1m41s · n=4
4.0 tok/s
3m16s · n=4

Reasoning vs answer

Thinking models split their output into a hidden reasoning trace and a visible answer. The ratio shows how much of the budget the model spent thinking vs answering.

Cellreasoning charsanswer charsreasoning / answer
predator:llamacpp:qwen3:30b-a3b-i…10818351.29×

Per-call timeline

Every call placed during this run, in order, colored by phase. Width is proportional to the call’s share of the cell’s wall-time. Hover any segment for the prompt id and tok/s.

predator:llamacpp:qwen3:3… hello · cold · 26.0s · 2.5 tok/s hello · warm · 15.9s · 4.0 tok/s hello · warm · 15.8s · 4.1 tok/s hello · warm · 16.3s · 3.9 tok/s P-MEDIUM · cold · 97.1s · 4.0 tok/s P-MEDIUM · warm · 86.4s · 4.1 tok/s P-MEDIUM · warm · 106.7s · 4.0 tok/s P-MEDIUM · warm · 128.7s · 4.0 tok/s P-HARD · cold · 205.5s · 4.0 tok/s P-HARD · warm · 188.2s · 4.0 tok/s P-HARD · warm · 218.8s · 4.0 tok/s P-HARD · warm · 182.4s · 4.0 tok/s cold warm

Raw data

Every run gets its JSONL, log, summary, and metadata published. Clone the archive; re-run it; tell us where we got it wrong.

Cite

Margetic, S. et al. (2026). benchmarks.weeyuga.com/benchmarks/5fb2913d.html
Public benchmarks of the Weeyuga cluster. Run id: 5fb2913d-6500-4ecf-9e97-d43f7dd61145. SHA ddbaaf46.

Related runs