Archive · vps-81 historical telemetry · local-mac/2026-04-11-qwen3b-local-mac-vs-vps-report.html. Originally rendered 2026-04-11. Re-hosted from MyServers on 2026-05-06. Methodology and harness conventions may differ from what we use today; see /methodology.html for current standards. ← back to all benchmarks

Qwen2.5 Coder 3B: Mac vs VPS

This page compares the same qwen2.5-coder:3b benchmark stack on Slobodan's Apple M1 Mac and on vps50. The shared comparison uses the suites both sides definitely completed: the five-question small eval, the twenty-question Python suite, and the ten-question real-context suite.

Shared Suite Total 914.5s vs 2282.3s

Mac versus VPS across the shared 5Q, 20Q, and 10Q suites.

Wall-Time Advantage 2.50x

How much shorter the shared benchmark stack was on the Mac.

5Q Throughput Advantage 2.82x

Mac tokens-per-second divided by VPS tokens-per-second on the five-question packet.

Mac Hello Check 5.8s

Of course! I'm here to help. What do you need assistance with today?

Why The Speed Gap Changes By Suite

The ratio is not fixed because each packet stresses the model differently. The Mac wins biggest when the benchmark spends most of its time generating answer tokens, and the gap narrows when the benchmark spends more time digesting large prompt packets first.

In the 20Q Python stack, the Mac cut combined model-eval time from 1189.8s on vps50 to 357.4s locally. That packet is dominated by many medium-length code answers and summary follow-ups, so raw decode throughput matters most and the Mac stretches out to the biggest wall-time lead.

In the 10Q real-context stack, both sides had to chew through much larger repo-shaped prompts before answering. The Mac still finished faster, but the prompt burden stayed high on both sides: about 10,074 prompt tokens locally versus 10,026 on vps50. That pulls the overall ratio back down because the suite is less purely generation-bound.

The 5Q packet is the noisiest of the three. It is only five mixed-format prompts, so one oddball answer or formatting miss moves the average more than it does in the longer Python stacks.

Five-Question Packet

Shell, ops, planning, and SSH triage tasks.

Metric Mac M1 VPS50
Total wall time 110.1s 336.2s
Average question time 22.0s 67.2s
Average throughput 12.86 tok/s 4.56 tok/s
Average marker hit 90% 90%
Format passes 3/5 3/5
Strict passes 3/5 3/5

Python 20Q

General Python implementation and debugging tasks.

Metric Mac M1 VPS50
Total wall time 397.1s 1268.5s
Primary avg duration 13.3s 38.1s
Follow-up avg duration 6.5s 25.3s
Primary avg throughput 14.20 tok/s 4.16 tok/s
Follow-up avg throughput 14.91 tok/s 3.91 tok/s
Primary avg marker hit 85% 89%
Follow-up avg marker hit 0% 0%
Usable primary answers 20/20 20/20
Usable follow-up answers 20/20 20/20

Real-Context 10Q

Repo-shaped multi-file tasks closer to real production prompts.

Metric Mac M1 VPS50
Total wall time 407.3s 677.6s
Primary avg duration 31.1s 54.2s
Follow-up avg duration 9.6s 13.5s
Primary avg throughput 14.01 tok/s 7.91 tok/s
Follow-up avg throughput 14.34 tok/s 8.98 tok/s
Primary avg marker hit 78% 73%
Follow-up avg marker hit 0% 0%
Usable primary answers 10/10 10/10
Usable follow-up answers 10/10 10/10

Drill-Down Reports

Every report below is archived separately so the results stay stable and shareable.