Exact prompt succeeded and returned a clean greeting.
GTX 1050 offload enabled for the hello check plus the five-question, twenty-question, and ten-question suites.
All finished artifacts are in this report. The page calls out exactly which planned phases completed and which ones still did not produce any files.
Exact prompt succeeded and returned a clean greeting.
All five tasks finished and produced scored artifacts.
All twenty primary and follow-up prompts finished.
All ten real-context tasks finished and were pulled into the report.
No output files were produced for the long-context append lane.
These cards surface the highest-signal latency, throughput, and formatting stats for this run so you can compare qwen2.5-coder:14b against other hosts.
Exact prompt: hi can you help me?
90% marker coverage
3.15 tok/s
3.06 tok/s
Every primary prompt returned a non-empty answer.
Follow-ups were much more obedient than primaries.
2.88 tok/s
Primary prompts that returned a usable answer.
This is the exact smoke-test prompt that kicked off the benchmark lane.
These are the short shell, ops, planning, and debugging tasks. Lower is better.
Primary-prompt durations for the full Python suite, sorted from slowest to fastest.
Higher is better. This shows how much of each prompt's requested structure {model_display} actually hit on the first try.
Quick read: great on planning and SSH triage, shakier on exact shell formatting and the IPv4 implementation details.
| Task | Primary | Follow-up | Primary tok/s | Marker hit | Format OK | Usable |
|---|---|---|---|---|---|---|
| Disk Guard Scriptshell | 34.5s | n/a | 3.30 tok/s | 75% | No | Yes |
| IPv4 Validatorpython | 64.8s | n/a | 3.06 tok/s | 100% | Yes | Yes |
| Nginx Safe Reloadops | 15.9s | n/a | 3.03 tok/s | 75% | Yes | Yes |
| YAML Validator Planplanning | 62.3s | n/a | 3.02 tok/s | 100% | Yes | Yes |
| SSH Lockout Triagedebugging | 97.0s | n/a | 2.98 tok/s | 100% | Yes | Yes |
This is the practical coding workload, so each row shows how the model handled the primary and follow-up formats.
| Task | Primary | Follow-up | Primary tok/s | Marker hit | Format OK | Usable |
|---|---|---|---|---|---|---|
| CSV Parserparsing | 19.9s | 56.7s | 3.25 tok/s | 75% | No | Yes |
| File Scannerfile_io | 32.9s | 34.8s | 3.03 tok/s | 100% | No | Yes |
| CLI Argumentscli | 35.3s | 39.0s | 3.18 tok/s | 100% | No | Yes |
| Typed Dataclasstyping | 18.5s | 45.9s | 3.19 tok/s | 100% | No | Yes |
| Pytest Fixturetests | 26.8s | 35.7s | 3.26 tok/s | 100% | No | Yes |
| Async Fetchasync | 32.3s | 45.9s | 3.30 tok/s | 100% | No | Yes |
| HTTP Retryhttp | 53.1s | 49.6s | 3.06 tok/s | 75% | No | Yes |
| JSON Validationvalidation | 56.5s | 39.5s | 3.12 tok/s | 100% | No | Yes |
| SQLite Storesqlite | 68.1s | 54.3s | 3.06 tok/s | 100% | No | Yes |
| FastAPI Handlerweb | 18.6s | 35.9s | 3.20 tok/s | 100% | No | Yes |
| Config Dataclassconfig | 44.7s | 50.4s | 3.12 tok/s | 75% | No | Yes |
| Logging Setuplogging | 47.8s | 36.2s | 3.02 tok/s | 75% | No | Yes |
| Thread Poolconcurrency | 24.5s | 34.3s | 3.15 tok/s | 100% | No | Yes |
| Package Layoutpackage | 27.5s | 41.3s | 3.19 tok/s | 100% | No | Yes |
| Debug Stacktracedebugging | 13.8s | 21.1s | 3.26 tok/s | 50% | No | Yes |
| Refactor Splitrefactor | 24.1s | 41.9s | 3.12 tok/s | 50% | No | Yes |
| CSV Summaryanalysis | 43.5s | 46.0s | 3.13 tok/s | 75% | No | Yes |
| Pathlib Cleanerfilesystem | 26.3s | 40.7s | 3.05 tok/s | 100% | No | Yes |
| Pydantic Modelvalidation | 15.9s | 24.3s | 3.19 tok/s | 75% | No | Yes |
| Regex Log Parserparsing | 33.6s | 54.8s | 3.03 tok/s | 75% | No | Yes |
This groups the finished 20-question suite by task family so you can see where the model stayed sharp and where it softened.
| Category | Tasks | Avg primary | Avg primary tok/s | Avg primary hit | Avg follow-up hit |
|---|---|---|---|---|---|
| analysis | 1/1 | 43.5s | 3.13 tok/s | 75% | 0% |
| async | 1/1 | 32.3s | 3.30 tok/s | 100% | 0% |
| cli | 1/1 | 35.3s | 3.18 tok/s | 100% | 0% |
| concurrency | 1/1 | 24.5s | 3.15 tok/s | 100% | 100% |
| config | 1/1 | 44.7s | 3.12 tok/s | 75% | 0% |
| debugging | 1/1 | 13.8s | 3.26 tok/s | 50% | 100% |
| file_io | 1/1 | 32.9s | 3.03 tok/s | 100% | 100% |
| filesystem | 1/1 | 26.3s | 3.05 tok/s | 100% | 100% |
| http | 1/1 | 53.1s | 3.06 tok/s | 75% | 0% |
| logging | 1/1 | 47.8s | 3.02 tok/s | 75% | 100% |
| package | 1/1 | 27.5s | 3.19 tok/s | 100% | 100% |
| parsing | 2/2 | 26.8s | 3.14 tok/s | 75% | 100% |
| refactor | 1/1 | 24.1s | 3.12 tok/s | 50% | 100% |
| sqlite | 1/1 | 68.1s | 3.06 tok/s | 100% | 0% |
| tests | 1/1 | 26.8s | 3.26 tok/s | 100% | 100% |
| typing | 1/1 | 18.5s | 3.19 tok/s | 100% | 100% |
| validation | 2/2 | 36.2s | 3.16 tok/s | 88% | 100% |
| web | 1/1 | 18.6s | 3.20 tok/s | 100% | 100% |
These are the heavier repository-shaped prompts. Lower is better.
Higher is better. This shows how much of the requested triage, test, planning, and review structure landed on the first response.
This is the more realistic repo-style suite. It is the best signal for whether the model stays reliable once prompts stop looking like toy exercises.
| Task | Primary | Follow-up | Primary tok/s | Marker hit | Format OK | Usable |
|---|---|---|---|---|---|---|
| 92.8s | 34.4s | 2.96 tok/s | 88% | No | Yes | |
| Board Snapshot Regression Testtests | 118.6s | 51.3s | 2.91 tok/s | 86% | No | Yes |
| Lane Config Patch Planplanning | 163.7s | 52.8s | 2.88 tok/s | 100% | No | Yes |
| API Token Audit Regression Testtests | 121.9s | 41.1s | 2.87 tok/s | 86% | No | Yes |
| Announcements State Sync Reviewreview | 142.3s | 46.6s | 2.89 tok/s | 100% | No | Yes |
| Feature Flag Lifecycle Testtests | 222.6s | 49.7s | 2.86 tok/s | 86% | No | Yes |
| Task Bulk Job Debug Packetdebugging | 112.9s | 52.2s | 2.87 tok/s | 100% | No | Yes |
| User Preferences Contract Testtests | 208.7s | 44.6s | 2.83 tok/s | 86% | No | Yes |
| Orchestration Timeline Forensicsforensics | 88.2s | 47.0s | 2.89 tok/s | 100% | No | Yes |
| 111.7s | 38.6s | 2.88 tok/s | 83% | No | Yes |
qwen2.5-coder:14b is not meant to replace the largest workers, but the telemetry shows it handled this stack end to end. It averaged 33.2s per primary prompt, pushed about 3.15 tok/s, and landed 20/20 usable primaries. Follow-up formatting remained the safety net: 0/20 primary format passes versus 14/20 on the follow-ups.