Archive · Pavilion windows-laptop · 2026-04-12-windows-gpu-qwen14b-benchmark.html. Originally rendered 2026-04-12. Re-hosted from MyServers on 2026-05-06. Methodology and harness conventions may differ from what we use today; see /methodology.html for current standards. ← back to all benchmarks

Qwen2.5 Coder 14B · Windows GPU

GTX 1050 offload enabled for the hello check plus the five-question, twenty-question, and ten-question suites.

Model: qwen2.5-coder:14b Disk size: 8572 MB Context: 32K advertised Server: Pavilion Windows · GTX 1050

Run Coverage

All finished artifacts are in this report. The page calls out exactly which planned phases completed and which ones still did not produce any files.

Hello check Complete

Exact prompt succeeded and returned a clean greeting.

5-question suite Complete

All five tasks finished and produced scored artifacts.

20-question Python suite Complete

All twenty primary and follow-up prompts finished.

10-question real-context suite Complete

All ten real-context tasks finished and were pulled into the report.

Context-edge suite Missing

No output files were produced for the long-context append lane.

Headline Metrics

These cards surface the highest-signal latency, throughput, and formatting stats for this run so you can compare qwen2.5-coder:14b against other hosts.

Hello check 17.1s

Exact prompt: hi can you help me?

Small eval average 54.9s

90% marker coverage

20-question primary average 33.2s

3.15 tok/s

20-question follow-up average 41.4s

3.06 tok/s

20-question primary usable 20/20

Every primary prompt returned a non-empty answer.

20-question follow-up format 14/20

Follow-ups were much more obedient than primaries.

10-question real-context average 138.3s

2.88 tok/s

10-question real-context usable 10/10

Primary prompts that returned a usable answer.

Hello Check

This is the exact smoke-test prompt that kicked off the benchmark lane.

Of course! How may I assist you today?
Latency: 17.1s
Prompt: hi can you help me?

5-Question Small Eval Latency

These are the short shell, ops, planning, and debugging tasks. Lower is better.

Disk Guard Script shell
34.5s
IPv4 Validator python
64.8s
Nginx Safe Reload ops
15.9s
YAML Validator Plan planning
62.3s
SSH Lockout Triage debugging
97.0s

20-Question Python Latency

Primary-prompt durations for the full Python suite, sorted from slowest to fastest.

SQLite Store sqlite
68.1s
JSON Validation validation
56.5s
HTTP Retry http
53.1s
Logging Setup logging
47.8s
Config Dataclass config
44.7s
CSV Summary analysis
43.5s
CLI Arguments cli
35.3s
Regex Log Parser parsing
33.6s
File Scanner file_io
32.9s
Async Fetch async
32.3s
Package Layout package
27.5s
Pytest Fixture tests
26.8s
Pathlib Cleaner filesystem
26.3s
Thread Pool concurrency
24.5s
Refactor Split refactor
24.1s
CSV Parser parsing
19.9s
FastAPI Handler web
18.6s
Typed Dataclass typing
18.5s
Pydantic Model validation
15.9s
Debug Stacktrace debugging
13.8s

20-Question Python Marker Coverage

Higher is better. This shows how much of each prompt's requested structure {model_display} actually hit on the first try.

File Scanner file_io
100%
CLI Arguments cli
100%
Typed Dataclass typing
100%
Pytest Fixture tests
100%
Async Fetch async
100%
JSON Validation validation
100%
SQLite Store sqlite
100%
FastAPI Handler web
100%
Thread Pool concurrency
100%
Package Layout package
100%
Pathlib Cleaner filesystem
100%
CSV Parser parsing
75%
HTTP Retry http
75%
Config Dataclass config
75%
Logging Setup logging
75%
CSV Summary analysis
75%
Pydantic Model validation
75%
Regex Log Parser parsing
75%
Debug Stacktrace debugging
50%
Refactor Split refactor
50%

5-Question Suite Details

Quick read: great on planning and SSH triage, shakier on exact shell formatting and the IPv4 implementation details.

Task Primary Follow-up Primary tok/s Marker hit Format OK Usable
Disk Guard Scriptshell 34.5s n/a 3.30 tok/s 75% No Yes
IPv4 Validatorpython 64.8s n/a 3.06 tok/s 100% Yes Yes
Nginx Safe Reloadops 15.9s n/a 3.03 tok/s 75% Yes Yes
YAML Validator Planplanning 62.3s n/a 3.02 tok/s 100% Yes Yes
SSH Lockout Triagedebugging 97.0s n/a 2.98 tok/s 100% Yes Yes

20-Question Python Suite Details

This is the practical coding workload, so each row shows how the model handled the primary and follow-up formats.

Task Primary Follow-up Primary tok/s Marker hit Format OK Usable
CSV Parserparsing 19.9s 56.7s 3.25 tok/s 75% No Yes
File Scannerfile_io 32.9s 34.8s 3.03 tok/s 100% No Yes
CLI Argumentscli 35.3s 39.0s 3.18 tok/s 100% No Yes
Typed Dataclasstyping 18.5s 45.9s 3.19 tok/s 100% No Yes
Pytest Fixturetests 26.8s 35.7s 3.26 tok/s 100% No Yes
Async Fetchasync 32.3s 45.9s 3.30 tok/s 100% No Yes
HTTP Retryhttp 53.1s 49.6s 3.06 tok/s 75% No Yes
JSON Validationvalidation 56.5s 39.5s 3.12 tok/s 100% No Yes
SQLite Storesqlite 68.1s 54.3s 3.06 tok/s 100% No Yes
FastAPI Handlerweb 18.6s 35.9s 3.20 tok/s 100% No Yes
Config Dataclassconfig 44.7s 50.4s 3.12 tok/s 75% No Yes
Logging Setuplogging 47.8s 36.2s 3.02 tok/s 75% No Yes
Thread Poolconcurrency 24.5s 34.3s 3.15 tok/s 100% No Yes
Package Layoutpackage 27.5s 41.3s 3.19 tok/s 100% No Yes
Debug Stacktracedebugging 13.8s 21.1s 3.26 tok/s 50% No Yes
Refactor Splitrefactor 24.1s 41.9s 3.12 tok/s 50% No Yes
CSV Summaryanalysis 43.5s 46.0s 3.13 tok/s 75% No Yes
Pathlib Cleanerfilesystem 26.3s 40.7s 3.05 tok/s 100% No Yes
Pydantic Modelvalidation 15.9s 24.3s 3.19 tok/s 75% No Yes
Regex Log Parserparsing 33.6s 54.8s 3.03 tok/s 75% No Yes

20-Question Python Category Summary

This groups the finished 20-question suite by task family so you can see where the model stayed sharp and where it softened.

Category Tasks Avg primary Avg primary tok/s Avg primary hit Avg follow-up hit
analysis 1/1 43.5s 3.13 tok/s 75% 0%
async 1/1 32.3s 3.30 tok/s 100% 0%
cli 1/1 35.3s 3.18 tok/s 100% 0%
concurrency 1/1 24.5s 3.15 tok/s 100% 100%
config 1/1 44.7s 3.12 tok/s 75% 0%
debugging 1/1 13.8s 3.26 tok/s 50% 100%
file_io 1/1 32.9s 3.03 tok/s 100% 100%
filesystem 1/1 26.3s 3.05 tok/s 100% 100%
http 1/1 53.1s 3.06 tok/s 75% 0%
logging 1/1 47.8s 3.02 tok/s 75% 100%
package 1/1 27.5s 3.19 tok/s 100% 100%
parsing 2/2 26.8s 3.14 tok/s 75% 100%
refactor 1/1 24.1s 3.12 tok/s 50% 100%
sqlite 1/1 68.1s 3.06 tok/s 100% 0%
tests 1/1 26.8s 3.26 tok/s 100% 100%
typing 1/1 18.5s 3.19 tok/s 100% 100%
validation 2/2 36.2s 3.16 tok/s 88% 100%
web 1/1 18.6s 3.20 tok/s 100% 100%

10-Question Real-Context Latency

These are the heavier repository-shaped prompts. Lower is better.

Feature Flag Lifecycle Test tests
222.6s
User Preferences Contract Test tests
208.7s
Lane Config Patch Plan planning
163.7s
Announcements State Sync Review review
142.3s
API Token Audit Regression Test tests
121.9s
Board Snapshot Regression Test tests
118.6s
Task Bulk Job Debug Packet debugging
112.9s
Ingest Log Triage cross_repo_debugging
111.7s
Auth Redirect Triage debugging
92.8s
Orchestration Timeline Forensics forensics
88.2s

10-Question Real-Context Marker Coverage

Higher is better. This shows how much of the requested triage, test, planning, and review structure landed on the first response.

Lane Config Patch Plan planning
100%
Announcements State Sync Review review
100%
Task Bulk Job Debug Packet debugging
100%
Orchestration Timeline Forensics forensics
100%
Auth Redirect Triage debugging
88%
Board Snapshot Regression Test tests
86%
API Token Audit Regression Test tests
86%
Feature Flag Lifecycle Test tests
86%
User Preferences Contract Test tests
86%
Ingest Log Triage cross_repo_debugging
83%

10-Question Real-Context Details

This is the more realistic repo-style suite. It is the best signal for whether the model stays reliable once prompts stop looking like toy exercises.

Task Primary Follow-up Primary tok/s Marker hit Format OK Usable
Auth Redirect Triagedebugging 92.8s 34.4s 2.96 tok/s 88% No Yes
Board Snapshot Regression Testtests 118.6s 51.3s 2.91 tok/s 86% No Yes
Lane Config Patch Planplanning 163.7s 52.8s 2.88 tok/s 100% No Yes
API Token Audit Regression Testtests 121.9s 41.1s 2.87 tok/s 86% No Yes
Announcements State Sync Reviewreview 142.3s 46.6s 2.89 tok/s 100% No Yes
Feature Flag Lifecycle Testtests 222.6s 49.7s 2.86 tok/s 86% No Yes
Task Bulk Job Debug Packetdebugging 112.9s 52.2s 2.87 tok/s 100% No Yes
User Preferences Contract Testtests 208.7s 44.6s 2.83 tok/s 86% No Yes
Orchestration Timeline Forensicsforensics 88.2s 47.0s 2.89 tok/s 100% No Yes
Ingest Log Triagecross_repo_debugging 111.7s 38.6s 2.88 tok/s 83% No Yes

What This Means

qwen2.5-coder:14b is not meant to replace the largest workers, but the telemetry shows it handled this stack end to end. It averaged 33.2s per primary prompt, pushed about 3.15 tok/s, and landed 20/20 usable primaries. Follow-up formatting remained the safety net: 0/20 primary format passes versus 14/20 on the follow-ups.

Launcher note: the original full-lane launcher reached the ten-question handoff but did not leave artifacts for that phase. This report is built from the artifacts that actually exist. The long-context edge lane is still missing.