How fast can a personal AI setup actually be?

Lede

If you're putting together a personal AI setup from used gaming laptops, an M1 Mac, or a small CPU VPS, the question "how fast can it be?" has a measurable answer. We ran sixteen language models across four hardware configurations to find out. Some answers will save you time. Some will save you the cost of believing a benchmark someone else cherry-picked. Here are the numbers.

§1. Hardware tested

Four machines, none of them frontier hardware, all of them realistic for someone setting up a local-AI rig from gear they could plausibly already own or buy used.

Machine	CPU	GPU	RAM	Year	Notes
Pavilion	i7-9750H	GTX 1050 4 GB	16 GB	2019	HP gaming laptop, used market ~$400
Predator	i7 (laptop)	GTX 1060 6 GB	28 GB	~2018	Acer Predator gaming laptop
Mac M1	M1 (8-core)	M1 GPU (unified mem)	8 GB	2020	Apple Silicon, base configuration
vps-81	4 vCPU AMD EPYC	none (CPU only)	64 GB	n/a	Contabo CPU-LLM VPS, ~$25/mo rented

The most capable GPU in this set is a six-year-old GTX 1060 with 6 GB of VRAM. There are no A100s, no H100s, no rented Tensor cores — that's the point. Numbers measured on hardware a reader can actually buy are the only ones that help anyone plan.

The benchmarks below were measured through an HTTP-compatible inference router (the kind of routing layer a typical local-AI deployment runs against :/v1/chat/completions style endpoints), not against the model engine directly. So the timing numbers reflect the wall-clock cost a typical user experience would see, not the theoretical engine ceiling. Where we have prior engine-direct measurements on the same hardware (we do for some), §2 shows the gap.

§2. The benchmarks (Pavilion, through a routing layer)

The first measured baseline through a typical production user path — driver POSTs to an OpenAI-compatible router endpoint at :/v1/chat/completions, which runs sampling defaults + a thinking-bypass classifier and dispatches to the chosen model engine on Pavilion. Sixteen models loaded, the canonical five-question coding eval, methodology mirrored from the prior 2026-04-11/12 Ollama-direct runs so deltas mean something.

The full ledger: docs/BENCHMARKS/PAVILION_WEEYUGA_v1.md (per-model table, raw JSONL, methodology, audit notes).

The headline numbers from that ledger:

Family	5Q avg duration	tok/s	Errors	Format-OK	Marker-hit
qwen3.5 (8 variants, 0.8b → 35b-a3b)	22–32 s	30–41	0/40	3–4 of 5	0.72–0.85
qwen2.5-coder (small: 0.5b / 1.5b)	10–14 s	18–38	0/10	3 of 5	0.82–0.90
qwen2.5-coder:3b	70.5 s	3.2	0/5	3 of 5	0.90
qwen2.5-coder:14b	TIMEOUT (5/5)	—	5/5	0 of 5	0.00
qwen3:4b (thinking, no-budget)	263 s	1.65	0/5	0 of 5	0.00
qwen3:8b / 14b (thinking)	157–353 s on calls that finished	—	3/5–4/5	1–2 of 5	0.16–0.40

Eighty-one of ninety-six calls finished cleanly. The qwen3.5 family is the most usable tier on this hardware — eight variants, ranging from a 800-million-parameter model that answers the five-question eval in twenty-six seconds to a thirty-five-billion- parameter MoE model that answers the same eval in twenty-two — all on a six-year-old Pavilion with a four-gigabyte GTX 1050. The fastest single call in the run was 2.4 s — qwen3.5:0.8b on nginx_safe_reload, served through llama.cpp on port 11436.

The cap that held everywhere: num_predict=2048. Even the worst- case thinking model — qwen3:4b, which spent every token of its budget on <think> and produced no usable answer — bounded at about 263 seconds per call. No infinite loops.

The routing overhead question

The most direct measurement we have:

Lane	Model	5Q avg	tok/s
Pavilion / Ollama-direct, GPU	qwen2.5-coder:0.5b	7.99 s	62.65
Pavilion / through weeyuga	qwen2.5-coder:0.5b	10.4 s	37.6

About 30% wall-time overhead for the smallest model. That's the cost the routing layer adds — sampling defaults, a thinking- bypass classifier, the WebSocket fan-out a multi-device deployment relies on. Quality is unchanged (marker-hit 0.82 vs 0.80; format-OK 3/5 in both lanes). Routing is a latency tax, not an accuracy tax. The methodology-fix re-baseline at ad057f5b holds the same ~30% small-model number, so it's robust across two ledgers.

For mid-size and larger models the gap looks much wider — 5× slower on qwen2.5-coder:3b, 100% timeouts on qwen2.5-coder:14b. The part of this story that didn't resolve. Pavilion's Ollama config (MAX_LOADED_MODELS=3, NUM_PARALLEL=3) was tuned for normal multi-user load. The benchmark, by contrast, sweeps sixteen models in succession — which evicts and reloads three-resident-at-a-time, paying the cold-load cost on most calls. The intended follow-up was an engine-direct comparison run (llama.cpp vs Ollama on the same hardware, same four models, single-loaded both lanes). That comparison stalled on a Pavilion Ollama service-wrapper gap and is parked. So the 5× number stays true within the methodology that produced it, but it isn't pinned to "the routing layer adds 5× overhead" — that claim is unproven either way until the parked A/B is re-run.

The cleanest summary the data supports right now:

A 2019 HP gaming laptop with a four-gigabyte GTX 1050 served small-to-mid coding models through an OpenAI-compatible router in about ten seconds per five-question call, at thirty to forty tokens per second sustained. Going through the router adds roughly thirty percent over engine-direct on the smallest model, with no quality cost. Mid-size models on the same setup need methodology work to publish a defensible overhead number.

[Chart 1 — tokens-per-second per qwen3.5 variant, warm-state — Andre/Maja styling, blocked on chart spec lock.] [Chart 2 — cold-vs-warm latency delta; cold-load is dominant for 3b/14b models — blocked on chart spec lock.] [Chart 3 (optional) — routing overhead delta vs engine-direct, blocked on a future engine-direct comparison run.]

§3. What's good

Three concrete surprises from Mission 1.

The qwen3.5 family is the most usable tier on this hardware — and it runs the same way on a $200 used GPU as it did when we expected it to need a real one. Eight variants, ranging from 800 million parameters to a thirty-five-billion-parameter mixture-of-experts model, all answered the same five-question coding eval in 22–32 seconds per call at 30–41 tokens per second. The 35B-a3b-iq2s variant — which we expected to be the slow one — finished in 30 seconds. None of the eight produced an error. Marker-hit rates landed between 0.72 and 0.85, format-OK at 3–4 of 5. If you have a 4 GB GPU and want a default model tier to plan around, this is the tier these numbers point to.

The thinking-budget cap held under adversarial conditions. qwen3:4b is a thinking model that doesn't honor a budget — given 2048 tokens, it spends every one of them on <think> and produces no usable answer. We caught it doing exactly that. The harness's num_predict=2048 ceiling fired at 263 seconds per call, no infinite loops, the model simply ran out of budget. The cap is the backstop we needed before we could trust thinking models in production.

Quality didn't move when we changed the path. qwen2.5-coder:0.5b on Pavilion through the routing layer vs the same model on the same hardware running direct-to-Ollama — same model, same prompts, same sampling — landed 0.82 vs 0.80 marker-hit and 3-of-5 format- OK in both lanes. The thirty-percent wall-time overhead the routing layer adds is purely latency; it isn't taxing accuracy. That's the precondition for saying any routing layer is worth what it costs.

§4. What's not good

Per the brand: this section is the one most posts skip. We publish it. Three concrete failures from Mission 1.

qwen2.5-coder:14b regressed from "usable" to "won't finish." On the 2026-04-12 GPU lane through Ollama directly, it finished the five-question eval at about 55 seconds per call. Through a routing layer on the same hardware in Mission 1, every one of five calls hit the harness's six-minute hard wall. That's, by definition, a regression. The follow-up that would have separated "router owns it" from "engine config owns it" stalled on the same service-wrapper gap as the rest of §4's open-question; documented in the post-mortem. Either way, a model that was formerly the strongest local coding option on this hardware is not serving in this benchmark.

qwen3:4b burns its entire budget on <think> and never produces a markable answer. It scored zero on format and zero on marker-hit across all five questions. The model was added to the personality engine's candidate list for thinking work; the benchmark tells us it shouldn't be there. We're not routing it in production until either the thinking-budget mechanism gets it under control or the model gets cut from the lineup.

Mid-size models are 5× slower through the production path than they were running engine-direct one model at a time. qwen2.5-coder:3b: 13 seconds on the 2026-04-11 GPU lane (Ollama direct), 70.5 seconds through the routing layer in Mission 1. The leading hypothesis is that Ollama's OLLAMA_MAX_LOADED_MODELS=3 config evicts and re-loads models when a harness sweeps sixteen variants in succession — paying disk-to-VRAM cost on most calls. The prior single-model run had no swap pressure; this run had it constantly. The follow-up that would have isolated this — an engine-direct comparison on the same hardware, single-loaded both lanes — stalled on a service-wrapper gap and is documented in the post-mortem. So the 5× number is bounded by the methodology that produced it. Nothing here proves the routing layer adds 5× overhead, and nothing here disproves it.

The Mission 1 re-run with the methodology fix (ad057f5b, per-model blocks instead of all-models-then-all-prompts) holds the small-model 30% routing-overhead number across two ledgers, so that headline is robust. The mid-size question stays open.

We're publishing the open question before we have the answer. Honest is how trust gets built.

[Other limits — Mac M1 8 GB ceiling, Predator second-GPU node, CPU inference on vps-81, cross-node routing overhead — are real but not in Mission 1's scope; they land in §6.]

§5. What to expect (and what not to)

If your hardware is in the range tested above, here's what these performance numbers tell you you can plan for:

Assistant-style chat with under-1B-parameter models on a 4 GB GPU laptop, no surprises. Sub-30-second responses on a five-question coding eval, faster than typing speed for token output. This is the regime small models are useful in.
Multi-billion-parameter coding models (3B–9B class) work on consumer GPUs but are config-sensitive. Default Ollama config with multi-model loading will thrash; pin MAX_LOADED_MODELS=1 or use llama.cpp with explicit GPU-layer counts and the numbers improve a lot.
Background watcher agents on a gaming laptop are realistic. A 24/7 harness pinning one model in VRAM and serving prompts earns whatever-it-costs-to-run-the-laptop in throughput headroom.
Local-first chat on an M1 Mac with qwen3.5:0.8b-class models is comfortable. Unified memory makes 0.8B inference fast; 4B-class is tight but possible at the cost of available system RAM for everything else.

What these numbers will NOT support, and we say so:

Real-time voice-to-voice with a 70B-parameter model. Hardware isn't there. A cloud passthrough is the realistic path for that workload.
High-parallelism multi-tenant inference. These are single-user setups, not shared GPUs.
Sub-100ms response on a hard prompt from a single thinking- capable model. The path to fast easy-prompt response on this hardware is many small classifier calls in front of the heavyweight model — and that's a different post.

§6. What's next

Two-axis "what would move which numbers."

Hardware:

Adding a current-generation discrete GPU (RTX 4070 / 4080 class) to a setup like this would change the upper end of what runs locally — at the cost of a hardware purchase that disqualifies the "use what you have" framing.
A second consumer GPU added to a single-machine setup is the realistic incremental path: a Predator-class laptop alongside a Pavilion-class one approximately doubles the available VRAM budget for cold loads + parallel serving without buying new hardware.

Software / inference shape:

Chain-of-small-models routing. Many small classifier calls in front of a heavyweight model — sub-500 ms wall-time for easy prompts becomes plausible on consumer hardware. Targeted measurement coming in a follow-up post.
KV-cache reuse across the chain. The shared prompt-prefix approach buys speed if it's wired right.

§7. Methodology footnote

Every number in this post is from a reproducible benchmark harness. The harness spec lives at docs/BENCHMARKS/HARNESS.md. Two ledgers carry the data this post relies on:

Mission 1 v1 — PAVILION_WEEYUGA_v1.md — synthesized from ff1131ca-d021-4e06-8616-4b4cdb54e97e-reconstructed-from-log.jsonl. The original 16-model sweep through an OpenAI-compatible router on Pavilion. Driver git SHA: 9934892.
Mission 1 v3 — raw JSONL at ad057f5b-ed3f-4a95-a38e-361be310ffd6.jsonl. The methodology-fixed re-run on 2026-04-29 with per-model blocks (one model fully exercised before moving to the next, no across-model eviction during the sweep). Confirms the small-model 30% routing-overhead headline; re-baselines for any future engine A/B work.

Each model ran one Hello call (cold) and five canonical 5Q calls (warm-after-hello). Mission 1 is n=1 per cell; we'll add p50/p95 over a five-run repeat in a follow-up baseline. Methodology mirrors the prior 2026-04-11/12 Pavilion-Ollama- direct runs in ~/Documents/MyServers/instances/pavilion-windows-laptop/reports/ so deltas are real and not methodology artifacts. Every benchmark dispatch carries a test=true tag in telemetry so it can be excluded from production-load dashboards.

The git SHA at time of measurement, the exact env vars, the model sizes, the system load at run-time, and the wall-clock time of day are all stamped in the harness output. If you want to reproduce a number, you have everything you need.

The follow-up engine-direct comparison run that would have disambiguated §4's mid-size question (llama.cpp vs Ollama on the same Pavilion, same models, single-loaded both lanes) stalled on a Pavilion Ollama service-wrapper gap and is parked. The full attempt is documented at PAVILION_LLAMACPP_VS_OLLAMA_v0_INCOMPLETE.md.

One known limitation in the v1 baseline: the harness's primary JSONL ledger captured only 18 of 96 records before stopping flushing — Trinity reconstructed the missing 78 from the stdout log into the companion -reconstructed-from-log.jsonl (the canonical source for this report). A few per-call fields (prompt_tokens, response_chars, markers_hit list) aren't recoverable from stdout; harness v2 adds an os.fsync() after every record so this can't happen again. The v3 re-run captured all records cleanly.

If you spot a number that doesn't match what your hardware does on the same model with the same prompt, please tell us. Honest disconfirmation is more valuable than another credulous re-share.