... . --> A3B cross-machine — Pavilion side — benchmarks.weeyuga.com
← All benchmarks
Headline Methodology Results Cold vs warm By prompt Timeline Raw

4 MAY 2026 · Pavilion · HP laptop · GTX 1050 4 GB · 16 GB RAM · i7-9750H · qwen3 · chat

A3B cross-machine — Pavilion side

Pavilion (GTX 1050) running Qwen3-30B-A3B IQ2-XXS at ~6 tok/s mean, p50 58s. The smaller half of the cross-machine pair.

What Janie says

Pavilion is a 2019 HP gaming laptop with a four-gigabyte GTX 1050. We pointed a thirty-billion-parameter mixture-of-experts model at it (Qwen3-30B-A3B at IQ2-XXS quant) — the file alone, at 9.65 GB, is more than twice the GPU's VRAM budget. The bet was that mixture-of-experts routing only activates ~3B of those 30B parameters per token, so the offload-to-CPU penalty would be small.

The bet paid: warm tokens-per-second on the hard prompt clocks at eight, comfortably faster than typing speed. Twelve of forty-eight model layers ran on the GPU, the rest on CPU; the partition was clean because we set it explicitly with --n-gpu-layers 12. The 14B dense model on the bigger Predator GPU, by contrast, was unusable on this hardware (~1 t/s, hangs above five minutes). Smaller-and-better-quant beat bigger-and-better-hardware. This is the "smartest model that runs at all" ceiling at this hardware tier — possible on a $400 used laptop most people would only think of as a games box.

Methodology

See A3B_AND_CPU_OVERNIGHT_2026-05-05 for the full procedure. Reproducible at git SHA ddbaaf46.

Results

Cell tok/s mean tok/s p50 tok/s p95 duration p50 calls
qwen3:30b-a3b-iq2xxs-think5005.96.78.257.8s14

tokens per second — mean · p50 · p95

2 5 7 9 tok/s mean: 5.88 tok/s p50: 6.67 tok/s p95: 8.17 tok/s qwen3:30b-a3b-i… mean p50 p95

Cold start vs warm

Cold-start measurements are the first call into a model after it loads from disk; warm calls are everything after. The ratio shows how much of the deployment’s wall-time cost is one-time vs steady-state.

Cellcold ncold tok/scold p50warm nwarm tok/swarm p50warm/cold
pavilion:llamacpp:qwen3:30b-a3b-i…33.22m12s96.849.5s2.10×

By prompt difficulty

Tokens per second by prompt class. hello is a trivial one-line prompt; P-MEDIUM and P-HARD are the deeper questions in the suite. The shape of the gap tells you whether the model is bottlenecked on parsing or on generation.

CellhelloP-MEDIUMP-HARD
pavilion:llamacpp:qwen3:30b-a3b-i…3.4 tok/s
16.5s · n=4
6.8 tok/s
55.6s · n=4
7.5 tok/s
1m3s · n=4

Reasoning vs answer

Thinking models split their output into a hidden reasoning trace and a visible answer. The ratio shows how much of the budget the model spent thinking vs answering.

Cellreasoning charsanswer charsreasoning / answer
pavilion:llamacpp:qwen3:30b-a3b-i…9755871.66×

Per-call timeline

Every call placed during this run, in order, colored by phase. Width is proportional to the call’s share of the cell’s wall-time. Hover any segment for the prompt id and tok/s.

pavilion:llamacpp:qwen3:3… hello · cold · 226.8s · 0.3 tok/s hello · warm · 15.9s · 4.0 tok/s hello · warm · 17.0s · 3.8 tok/s hello · warm · 11.9s · 5.4 tok/s P-MEDIUM · cold · 132.3s · 3.5 tok/s P-MEDIUM · warm · 61.8s · 7.4 tok/s P-MEDIUM · warm · 49.5s · 8.2 tok/s P-MEDIUM · warm · 44.5s · 8.1 tok/s P-HARD · cold · 125.7s · 5.9 tok/s P-HARD · warm · 68.4s · 8.0 tok/s P-HARD · warm · 56.9s · 8.0 tok/s P-HARD · warm · 58.7s · 8.0 tok/s cold warm

Raw data

Every run gets its JSONL, log, summary, and metadata published. Clone the archive; re-run it; tell us where we got it wrong.

Cite

Margetic, S. et al. (2026). benchmarks.weeyuga.com/benchmarks/23066b38.html
Public benchmarks of the Weeyuga cluster. Run id: 23066b38-ea9c-4dd3-b2f5-32912a67fce4. SHA ddbaaf46.

Related runs