How fast can a personal AI setup actually be?

Lede

If you're putting together a personal AI setup from used gaming laptops, an M1 Mac, or a small CPU VPS, the question "how fast can it be?" has a measurable answer. We ran sixteen language models across four hardware configurations to find out. Some answers will save you time. Some will save you the cost of believing a benchmark someone else cherry-picked. Here are the numbers.


§1. Hardware tested

Four machines, none of them frontier hardware, all of them realistic for someone setting up a local-AI rig from gear they could plausibly already own or buy used.

MachineCPUGPURAMYearNotes
Pavilioni7-9750HGTX 1050 4 GB16 GB2019HP gaming laptop, used market ~$400
Predatori7 (laptop)GTX 1060 6 GB28 GB~2018Acer Predator gaming laptop
Mac M1M1 (8-core)M1 GPU (unified mem)8 GB2020Apple Silicon, base configuration
vps-814 vCPU AMD EPYCnone (CPU only)64 GBn/aContabo CPU-LLM VPS, ~$25/mo rented

The most capable GPU in this set is a six-year-old GTX 1060 with 6 GB of VRAM. There are no A100s, no H100s, no rented Tensor cores — that's the point. Numbers measured on hardware a reader can actually buy are the only ones that help anyone plan.

The benchmarks below were measured through an HTTP-compatible inference router (the kind of routing layer a typical local-AI deployment runs against :/v1/chat/completions style endpoints), not against the model engine directly. So the timing numbers reflect the wall-clock cost a typical user experience would see, not the theoretical engine ceiling. Where we have prior engine-direct measurements on the same hardware (we do for some), §2 shows the gap.


§2. The benchmarks (Pavilion, through a routing layer)

The first measured baseline through a typical production user path — driver POSTs to an OpenAI-compatible router endpoint at :/v1/chat/completions, which runs sampling defaults + a thinking-bypass classifier and dispatches to the chosen model engine on Pavilion. Sixteen models loaded, the canonical five-question coding eval, methodology mirrored from the prior 2026-04-11/12 Ollama-direct runs so deltas mean something.

The full ledger: docs/BENCHMARKS/PAVILION_WEEYUGA_v1.md (per-model table, raw JSONL, methodology, audit notes).

The headline numbers from that ledger:

Family5Q avg durationtok/sErrorsFormat-OKMarker-hit
qwen3.5 (8 variants, 0.8b → 35b-a3b)22–32 s30–410/403–4 of 50.72–0.85
qwen2.5-coder (small: 0.5b / 1.5b)10–14 s18–380/103 of 50.82–0.90
qwen2.5-coder:3b70.5 s3.20/53 of 50.90
qwen2.5-coder:14bTIMEOUT (5/5)5/50 of 50.00
qwen3:4b (thinking, no-budget)263 s1.650/50 of 50.00
qwen3:8b / 14b (thinking)157–353 s on calls that finished3/5–4/51–2 of 50.16–0.40

Eighty-one of ninety-six calls finished cleanly. The qwen3.5 family is the most usable tier on this hardware — eight variants, ranging from a 800-million-parameter model that answers the five-question eval in twenty-six seconds to a thirty-five-billion- parameter MoE model that answers the same eval in twenty-two — all on a six-year-old Pavilion with a four-gigabyte GTX 1050. The fastest single call in the run was 2.4 s — qwen3.5:0.8b on nginx_safe_reload, served through llama.cpp on port 11436.

The cap that held everywhere: num_predict=2048. Even the worst- case thinking model — qwen3:4b, which spent every token of its budget on <think> and produced no usable answer — bounded at about 263 seconds per call. No infinite loops.

The routing overhead question

The most direct measurement we have:

LaneModel5Q avgtok/s
Pavilion / Ollama-direct, GPUqwen2.5-coder:0.5b7.99 s62.65
Pavilion / through weeyugaqwen2.5-coder:0.5b10.4 s37.6

About 30% wall-time overhead for the smallest model. That's the cost the routing layer adds — sampling defaults, a thinking- bypass classifier, the WebSocket fan-out a multi-device deployment relies on. Quality is unchanged (marker-hit 0.82 vs 0.80; format-OK 3/5 in both lanes). Routing is a latency tax, not an accuracy tax. The methodology-fix re-baseline at ad057f5b holds the same ~30% small-model number, so it's robust across two ledgers.

For mid-size and larger models the gap looks much wider — 5× slower on qwen2.5-coder:3b, 100% timeouts on qwen2.5-coder:14b. The part of this story that didn't resolve. Pavilion's Ollama config (MAX_LOADED_MODELS=3, NUM_PARALLEL=3) was tuned for normal multi-user load. The benchmark, by contrast, sweeps sixteen models in succession — which evicts and reloads three-resident-at-a-time, paying the cold-load cost on most calls. The intended follow-up was an engine-direct comparison run (llama.cpp vs Ollama on the same hardware, same four models, single-loaded both lanes). That comparison stalled on a Pavilion Ollama service-wrapper gap and is parked. So the 5× number stays true within the methodology that produced it, but it isn't pinned to "the routing layer adds 5× overhead" — that claim is unproven either way until the parked A/B is re-run.

The cleanest summary the data supports right now:

A 2019 HP gaming laptop with a four-gigabyte GTX 1050 served small-to-mid coding models through an OpenAI-compatible router in about ten seconds per five-question call, at thirty to forty tokens per second sustained. Going through the router adds roughly thirty percent over engine-direct on the smallest model, with no quality cost. Mid-size models on the same setup need methodology work to publish a defensible overhead number.

[Chart 1 — tokens-per-second per qwen3.5 variant, warm-state — Andre/Maja styling, blocked on chart spec lock.] [Chart 2 — cold-vs-warm latency delta; cold-load is dominant for 3b/14b models — blocked on chart spec lock.] [Chart 3 (optional) — routing overhead delta vs engine-direct, blocked on a future engine-direct comparison run.]


§3. What's good

Three concrete surprises from Mission 1.

The qwen3.5 family is the most usable tier on this hardware — and it runs the same way on a $200 used GPU as it did when we expected it to need a real one. Eight variants, ranging from 800 million parameters to a thirty-five-billion-parameter mixture-of-experts model, all answered the same five-question coding eval in 22–32 seconds per call at 30–41 tokens per second. The 35B-a3b-iq2s variant — which we expected to be the slow one — finished in 30 seconds. None of the eight produced an error. Marker-hit rates landed between 0.72 and 0.85, format-OK at 3–4 of 5. If you have a 4 GB GPU and want a default model tier to plan around, this is the tier these numbers point to.

The thinking-budget cap held under adversarial conditions. qwen3:4b is a thinking model that doesn't honor a budget — given 2048 tokens, it spends every one of them on <think> and produces no usable answer. We caught it doing exactly that. The harness's num_predict=2048 ceiling fired at 263 seconds per call, no infinite loops, the model simply ran out of budget. The cap is the backstop we needed before we could trust thinking models in production.

Quality didn't move when we changed the path. qwen2.5-coder:0.5b on Pavilion through the routing layer vs the same model on the same hardware running direct-to-Ollama — same model, same prompts, same sampling — landed 0.82 vs 0.80 marker-hit and 3-of-5 format- OK in both lanes. The thirty-percent wall-time overhead the routing layer adds is purely latency; it isn't taxing accuracy. That's the precondition for saying any routing layer is worth what it costs.


§4. What's not good

Per the brand: this section is the one most posts skip. We publish it. Three concrete failures from Mission 1.

qwen2.5-coder:14b regressed from "usable" to "won't finish." On the 2026-04-12 GPU lane through Ollama directly, it finished the five-question eval at about 55 seconds per call. Through a routing layer on the same hardware in Mission 1, every one of five calls hit the harness's six-minute hard wall. That's, by definition, a regression. The follow-up that would have separated "router owns it" from "engine config owns it" stalled on the same service-wrapper gap as the rest of §4's open-question; documented in the post-mortem. Either way, a model that was formerly the strongest local coding option on this hardware is not serving in this benchmark.

qwen3:4b burns its entire budget on <think> and never produces a markable answer. It scored zero on format and zero on marker-hit across all five questions. The model was added to the personality engine's candidate list for thinking work; the benchmark tells us it shouldn't be there. We're not routing it in production until either the thinking-budget mechanism gets it under control or the model gets cut from the lineup.

Mid-size models are 5× slower through the production path than they were running engine-direct one model at a time. qwen2.5-coder:3b: 13 seconds on the 2026-04-11 GPU lane (Ollama direct), 70.5 seconds through the routing layer in Mission 1. The leading hypothesis is that Ollama's OLLAMA_MAX_LOADED_MODELS=3 config evicts and re-loads models when a harness sweeps sixteen variants in succession — paying disk-to-VRAM cost on most calls. The prior single-model run had no swap pressure; this run had it constantly. The follow-up that would have isolated this — an engine-direct comparison on the same hardware, single-loaded both lanes — stalled on a service-wrapper gap and is documented in the post-mortem. So the 5× number is bounded by the methodology that produced it. Nothing here proves the routing layer adds 5× overhead, and nothing here disproves it.

The Mission 1 re-run with the methodology fix (ad057f5b, per-model blocks instead of all-models-then-all-prompts) holds the small-model 30% routing-overhead number across two ledgers, so that headline is robust. The mid-size question stays open.

We're publishing the open question before we have the answer. Honest is how trust gets built.

[Other limits — Mac M1 8 GB ceiling, Predator second-GPU node, CPU inference on vps-81, cross-node routing overhead — are real but not in Mission 1's scope; they land in §6.]


§5. What to expect (and what not to)

If your hardware is in the range tested above, here's what these performance numbers tell you you can plan for:

What these numbers will NOT support, and we say so:


§6. What's next

Two-axis "what would move which numbers."

Hardware:

Software / inference shape:


§7. Methodology footnote

Every number in this post is from a reproducible benchmark harness. The harness spec lives at docs/BENCHMARKS/HARNESS.md. Two ledgers carry the data this post relies on:

Each model ran one Hello call (cold) and five canonical 5Q calls (warm-after-hello). Mission 1 is n=1 per cell; we'll add p50/p95 over a five-run repeat in a follow-up baseline. Methodology mirrors the prior 2026-04-11/12 Pavilion-Ollama- direct runs in ~/Documents/MyServers/instances/pavilion-windows-laptop/reports/ so deltas are real and not methodology artifacts. Every benchmark dispatch carries a test=true tag in telemetry so it can be excluded from production-load dashboards.

The git SHA at time of measurement, the exact env vars, the model sizes, the system load at run-time, and the wall-clock time of day are all stamped in the harness output. If you want to reproduce a number, you have everything you need.

The follow-up engine-direct comparison run that would have disambiguated §4's mid-size question (llama.cpp vs Ollama on the same Pavilion, same models, single-loaded both lanes) stalled on a Pavilion Ollama service-wrapper gap and is parked. The full attempt is documented at PAVILION_LLAMACPP_VS_OLLAMA_v0_INCOMPLETE.md.

One known limitation in the v1 baseline: the harness's primary JSONL ledger captured only 18 of 96 records before stopping flushing — Trinity reconstructed the missing 78 from the stdout log into the companion -reconstructed-from-log.jsonl (the canonical source for this report). A few per-call fields (prompt_tokens, response_chars, markers_hit list) aren't recoverable from stdout; harness v2 adds an os.fsync() after every record so this can't happen again. The v3 re-run captured all records cleanly.

If you spot a number that doesn't match what your hardware does on the same model with the same prompt, please tell us. Honest disconfirmation is more valuable than another credulous re-share.