Qwen2.5 Coder 14B · Windows CPU

CPU-only lane covering the Pavilion hello check plus the five-question, twenty-question, and ten-question suites. GTX 1050 offload stayed disabled for this pass.

Model: qwen2.5-coder:14b Disk size: 8572 MB Context: 32K advertised Server: Pavilion Windows · CPU

Run Coverage

All finished artifacts are in this report. The page calls out exactly which planned phases completed and which ones still did not produce any files.

Hello check Complete

Exact prompt succeeded and returned a clean greeting.

5-question suite Complete

All five tasks finished and produced scored artifacts.

20-question Python suite Complete

All twenty primary and follow-up prompts finished.

10-question real-context suite Complete

All ten real-context tasks finished and were pulled into the report.

Context-edge suite Missing

No output files were produced for the long-context append lane.

Headline Metrics

These cards surface the highest-signal latency, throughput, and formatting stats for this run so you can compare qwen2.5-coder:14b against other hosts.

Hello check 16.2s

Exact prompt: hi can you help me?

Small eval average 80.2s

90% marker coverage

20-question primary average 33.2s

3.11 tok/s

20-question follow-up average 40.3s

3.05 tok/s

20-question primary usable 20/20

Every primary prompt returned a non-empty answer.

20-question follow-up format 13/20

Follow-ups were much more obedient than primaries.

10-question real-context average 141.1s

2.88 tok/s

10-question real-context usable 10/10

Primary prompts that returned a usable answer.

Hello Check

This is the exact smoke-test prompt that kicked off the benchmark lane.

Of course! How may I assist you today?

Latency: 16.2s
Prompt: hi can you help me?

5-Question Small Eval Latency

These are the short shell, ops, planning, and debugging tasks. Lower is better.

Disk Guard Script shell

33.9s

IPv4 Validator python

66.9s

Nginx Safe Reload ops

14.9s

YAML Validator Plan planning

217.5s

SSH Lockout Triage debugging

67.5s

20-Question Python Latency

Primary-prompt durations for the full Python suite, sorted from slowest to fastest.

SQLite Store sqlite

68.2s

HTTP Retry http

52.6s

JSON Validation validation

51.2s

Logging Setup logging

49.3s

Config Dataclass config

44.5s

CSV Summary analysis

43.7s

CLI Arguments cli

36.2s

Async Fetch async

34.1s

File Scanner file_io

32.7s

Regex Log Parser parsing

32.5s

Pytest Fixture tests

28.2s

Package Layout package

28.1s

Pathlib Cleaner filesystem

25.9s

Thread Pool concurrency

24.6s

Refactor Split refactor

24.6s

CSV Parser parsing

20.0s

Typed Dataclass typing

18.9s

FastAPI Handler web

18.8s

Pydantic Model validation

16.0s

Debug Stacktrace debugging

14.0s

20-Question Python Marker Coverage

Higher is better. This shows how much of each prompt's requested structure {model_display} actually hit on the first try.

File Scanner file_io

100%

CLI Arguments cli

100%

Typed Dataclass typing

100%

Pytest Fixture tests

100%

Async Fetch async

100%

JSON Validation validation

100%

SQLite Store sqlite

100%

FastAPI Handler web

100%

Thread Pool concurrency

100%

Package Layout package

100%

Pathlib Cleaner filesystem

100%

CSV Parser parsing

75%

HTTP Retry http

75%

Config Dataclass config

75%

Logging Setup logging

75%

CSV Summary analysis

75%

Pydantic Model validation

75%

Regex Log Parser parsing

75%

Debug Stacktrace debugging

50%

Refactor Split refactor

50%

5-Question Suite Details

Quick read: great on planning and SSH triage, shakier on exact shell formatting and the IPv4 implementation details.

Task	Primary	Follow-up	Primary tok/s	Marker hit	Format OK	Usable
Disk Guard Scriptshell	33.9s	n/a	3.35 tok/s	75%	No	Yes
IPv4 Validatorpython	66.9s	n/a	2.99 tok/s	100%	Yes	Yes
Nginx Safe Reloadops	14.9s	n/a	3.26 tok/s	75%	Yes	Yes
YAML Validator Planplanning	217.5s	n/a	2.93 tok/s	100%	Yes	Yes
SSH Lockout Triagedebugging	67.5s	n/a	2.97 tok/s	100%	Yes	Yes

20-Question Python Suite Details

This is the practical coding workload, so each row shows how the model handled the primary and follow-up formats.

Task	Primary	Follow-up	Primary tok/s	Marker hit	Format OK	Usable
CSV Parserparsing	20.0s	53.0s	3.23 tok/s	75%	No	Yes
File Scannerfile_io	32.7s	41.3s	3.09 tok/s	100%	No	Yes
CLI Argumentscli	36.2s	48.9s	3.12 tok/s	100%	No	Yes
Typed Dataclasstyping	18.9s	29.9s	3.14 tok/s	100%	No	Yes
Pytest Fixturetests	28.2s	38.5s	3.06 tok/s	100%	No	Yes
Async Fetchasync	34.1s	28.4s	3.13 tok/s	100%	No	Yes
HTTP Retryhttp	52.6s	44.6s	3.05 tok/s	75%	No	Yes
JSON Validationvalidation	51.2s	46.7s	3.10 tok/s	100%	No	Yes
SQLite Storesqlite	68.2s	54.6s	3.04 tok/s	100%	No	Yes
FastAPI Handlerweb	18.8s	25.4s	3.15 tok/s	100%	No	Yes
Config Dataclassconfig	44.5s	32.8s	2.97 tok/s	75%	No	Yes
Logging Setuplogging	49.3s	56.9s	3.06 tok/s	75%	No	Yes
Thread Poolconcurrency	24.6s	26.9s	3.14 tok/s	100%	No	Yes
Package Layoutpackage	28.1s	40.5s	3.12 tok/s	100%	No	Yes
Debug Stacktracedebugging	14.0s	21.2s	3.19 tok/s	50%	No	Yes
Refactor Splitrefactor	24.6s	45.8s	3.06 tok/s	50%	No	Yes
CSV Summaryanalysis	43.7s	36.9s	3.11 tok/s	75%	No	Yes
Pathlib Cleanerfilesystem	25.9s	55.4s	3.12 tok/s	100%	No	Yes
Pydantic Modelvalidation	16.0s	26.1s	3.18 tok/s	75%	No	Yes
Regex Log Parserparsing	32.5s	51.3s	3.12 tok/s	75%	No	Yes

20-Question Python Category Summary

This groups the finished 20-question suite by task family so you can see where the model stayed sharp and where it softened.

Category	Tasks	Avg primary	Avg primary tok/s	Avg primary hit	Avg follow-up hit
analysis	1/1	43.7s	3.11 tok/s	75%	0%
async	1/1	34.1s	3.13 tok/s	100%	0%
cli	1/1	36.2s	3.12 tok/s	100%	0%
concurrency	1/1	24.6s	3.14 tok/s	100%	100%
config	1/1	44.5s	2.97 tok/s	75%	100%
debugging	1/1	14.0s	3.19 tok/s	50%	100%
file_io	1/1	32.7s	3.09 tok/s	100%	100%
filesystem	1/1	25.9s	3.12 tok/s	100%	100%
http	1/1	52.6s	3.05 tok/s	75%	0%
logging	1/1	49.3s	3.06 tok/s	75%	0%
package	1/1	28.1s	3.12 tok/s	100%	100%
parsing	2/2	26.3s	3.17 tok/s	75%	100%
refactor	1/1	24.6s	3.06 tok/s	50%	100%
sqlite	1/1	68.2s	3.04 tok/s	100%	0%
tests	1/1	28.2s	3.06 tok/s	100%	100%
typing	1/1	18.9s	3.14 tok/s	100%	100%
validation	2/2	33.6s	3.14 tok/s	88%	100%
web	1/1	18.8s	3.15 tok/s	100%	100%

10-Question Real-Context Latency

These are the heavier repository-shaped prompts. Lower is better.

User Preferences Contract Test tests

199.2s

API Token Audit Regression Test tests

194.6s

Feature Flag Lifecycle Test tests

182.1s

Announcements State Sync Review review

179.8s

Lane Config Patch Plan planning

160.5s

Board Snapshot Regression Test tests

111.5s

Task Bulk Job Debug Packet debugging

106.2s

Ingest Log Triage cross_repo_debugging

98.2s

Orchestration Timeline Forensics forensics

89.8s

Auth Redirect Triage debugging

88.6s

10-Question Real-Context Marker Coverage

Higher is better. This shows how much of the requested triage, test, planning, and review structure landed on the first response.

Orchestration Timeline Forensics forensics

100%

Auth Redirect Triage debugging

88%

Board Snapshot Regression Test tests

86%

API Token Audit Regression Test tests

86%

Feature Flag Lifecycle Test tests

86%

User Preferences Contract Test tests

86%

Lane Config Patch Plan planning

83%

Announcements State Sync Review review

83%

Ingest Log Triage cross_repo_debugging

83%

Task Bulk Job Debug Packet debugging

67%

10-Question Real-Context Details

This is the more realistic repo-style suite. It is the best signal for whether the model stays reliable once prompts stop looking like toy exercises.

Task	Primary	Follow-up	Primary tok/s	Marker hit	Format OK	Usable
Auth Redirect Triagedebugging	88.6s	37.9s	2.95 tok/s	88%	No	Yes
Board Snapshot Regression Testtests	111.5s	41.1s	2.89 tok/s	86%	No	Yes
Lane Config Patch Planplanning	160.5s	52.8s	2.86 tok/s	83%	No	Yes
API Token Audit Regression Testtests	194.6s	39.8s	2.85 tok/s	86%	No	Yes
Announcements State Sync Reviewreview	179.8s	34.3s	2.86 tok/s	83%	No	Yes
Feature Flag Lifecycle Testtests	182.1s	50.3s	2.87 tok/s	86%	No	Yes
Task Bulk Job Debug Packetdebugging	106.2s	51.3s	2.88 tok/s	67%	No	Yes
User Preferences Contract Testtests	199.2s	34.0s	2.85 tok/s	86%	No	Yes
Orchestration Timeline Forensicsforensics	89.8s	43.9s	2.87 tok/s	100%	No	Yes
Ingest Log Triagecross_repo_debugging	98.2s	37.8s	2.89 tok/s	83%	No	Yes

What This Means

qwen2.5-coder:14b is not meant to replace the largest workers, but the telemetry shows it handled this stack end to end. It averaged 33.2s per primary prompt, pushed about 3.11 tok/s, and landed 20/20 usable primaries. Follow-up formatting remained the safety net: 0/20 primary format passes versus 13/20 on the follow-ups.

Launcher note: the original full-lane launcher reached the ten-question handoff but did not leave artifacts for that phase. This report is built from the artifacts that actually exist. The long-context edge lane is still missing.