Pavilion Windows · qwen0.5b (GTX 1050 GPU)

This page captures the same qwen0.5b benchmark stack on the Pavilion Windows laptop with GTX 1050 offload enabled.

Model: qwen2.5-coder:0.5b Disk size: 379 MB Context: 4K eval context Server: Pavilion Windows / Ollama GPU

Run Coverage

All finished artifacts are in this report. The page calls out exactly which planned phases completed and which ones still did not produce any files.

Hello check Complete

Exact prompt succeeded and returned a clean greeting.

5-question suite Complete

All five tasks finished and produced scored artifacts.

20-question Python suite Complete

All twenty primary and follow-up prompts finished.

10-question real-context suite Complete

All ten real-context tasks finished and were pulled into the report.

Context-edge suite Missing

No output files were produced for the long-context append lane.

Headline Metrics

The tiny model stayed surprisingly quick. The strongest signal is not raw quality, it is how much useful structure it maintained while staying under ten seconds on average for the bigger Python suite.

Hello check 7.8s

Exact prompt: hi can you help me?

Small eval average 8.0s

80% marker coverage

20-question primary average 3.3s

62.85 tok/s

20-question follow-up average 1.1s

63.49 tok/s

20-question primary usable 20/20

Every primary prompt returned a non-empty answer.

20-question follow-up format 10/20

Follow-ups were much more obedient than primaries.

10-question real-context average 5.0s

61.62 tok/s

10-question real-context usable 10/10

Primary prompts that returned a usable answer.

Hello Check

This is the exact smoke-test prompt that kicked off the benchmark lane.

Of course! How can I assist you today?

Latency: 7.8s
Prompt: hi can you help me?

5-Question Small Eval Latency

These are the short shell, ops, planning, and debugging tasks. Lower is better.

Disk Guard Script shell

7.2s

IPv4 Validator python

9.0s

Nginx Safe Reload ops

2.5s

YAML Validator Plan planning

14.2s

SSH Lockout Triage debugging

7.1s

20-Question Python Latency

Primary-prompt durations for the full Python suite, sorted from slowest to fastest.

Pytest Fixture tests

5.7s

CSV Parser parsing

5.0s

Config Dataclass config

4.9s

Logging Setup logging

4.8s

HTTP Retry http

4.5s

JSON Validation validation

4.5s

Regex Log Parser parsing

3.7s

File Scanner file_io

3.6s

Refactor Split refactor

3.3s

SQLite Store sqlite

3.3s

Pathlib Cleaner filesystem

3.2s

Thread Pool concurrency

2.9s

CLI Arguments cli

2.7s

Package Layout package

2.4s

Debug Stacktrace debugging

2.3s

Async Fetch async

2.0s

Pydantic Model validation

1.8s

Typed Dataclass typing

1.7s

CSV Summary analysis

1.6s

FastAPI Handler web

1.5s

20-Question Python Marker Coverage

Higher is better. This shows how much of each prompt's requested structure and ingredients the tiny model actually hit on the first try.

CLI Arguments cli

100%

Pytest Fixture tests

100%

Async Fetch async

100%

HTTP Retry http

100%

FastAPI Handler web

100%

Config Dataclass config

100%

Thread Pool concurrency

100%

Debug Stacktrace debugging

100%

Regex Log Parser parsing

100%

CSV Parser parsing

75%

File Scanner file_io

75%

Typed Dataclass typing

75%

JSON Validation validation

75%

SQLite Store sqlite

75%

Logging Setup logging

75%

Package Layout package

75%

Refactor Split refactor

75%

CSV Summary analysis

75%

Pydantic Model validation

75%

Pathlib Cleaner filesystem

50%

5-Question Suite Details

Quick read: great on planning and SSH triage, shakier on exact shell formatting and the IPv4 implementation details.

Task	Primary	Follow-up	Primary tok/s	Marker hit	Format OK	Usable
Disk Guard Scriptshell	7.2s	n/a	62.66 tok/s	75%	No	Yes
IPv4 Validatorpython	9.0s	n/a	62.80 tok/s	75%	No	Yes
Nginx Safe Reloadops	2.5s	n/a	63.32 tok/s	50%	Yes	Yes
YAML Validator Planplanning	14.2s	n/a	61.96 tok/s	100%	Yes	Yes
SSH Lockout Triagedebugging	7.1s	n/a	62.53 tok/s	100%	Yes	Yes

20-Question Python Suite Details

This is the practical coding workload. The tiny model answered every primary and every follow-up, but it needed the follow-up turn to obey formatting much more consistently.

Task	Primary	Follow-up	Primary tok/s	Marker hit	Format OK	Usable
CSV Parserparsing	5.0s	1.1s	62.87 tok/s	75%	No	Yes
File Scannerfile_io	3.6s	0.8s	62.74 tok/s	75%	No	Yes
CLI Argumentscli	2.7s	1.1s	62.61 tok/s	100%	No	Yes
Typed Dataclasstyping	1.7s	0.8s	63.24 tok/s	75%	No	Yes
Pytest Fixturetests	5.7s	1.1s	62.61 tok/s	100%	No	Yes
Async Fetchasync	2.0s	1.1s	63.10 tok/s	100%	No	Yes
HTTP Retryhttp	4.5s	1.2s	62.75 tok/s	100%	No	Yes
JSON Validationvalidation	4.5s	1.1s	63.01 tok/s	75%	No	Yes
SQLite Storesqlite	3.3s	0.9s	62.96 tok/s	75%	No	Yes
FastAPI Handlerweb	1.5s	2.1s	63.15 tok/s	100%	No	Yes
Config Dataclassconfig	4.9s	0.9s	62.49 tok/s	100%	No	Yes
Logging Setuplogging	4.8s	0.7s	61.66 tok/s	75%	No	Yes
Thread Poolconcurrency	2.9s	1.4s	62.07 tok/s	100%	No	Yes
Package Layoutpackage	2.4s	1.0s	63.05 tok/s	75%	No	Yes
Debug Stacktracedebugging	2.3s	1.4s	63.44 tok/s	100%	No	Yes
Refactor Splitrefactor	3.3s	0.6s	62.61 tok/s	75%	No	Yes
CSV Summaryanalysis	1.6s	0.8s	63.30 tok/s	75%	No	Yes
Pathlib Cleanerfilesystem	3.2s	1.6s	62.86 tok/s	50%	No	Yes
Pydantic Modelvalidation	1.8s	0.9s	63.96 tok/s	75%	No	Yes
Regex Log Parserparsing	3.7s	0.9s	62.46 tok/s	100%	No	Yes

20-Question Python Category Summary

This groups the finished 20-question suite by task family so you can see where the tiny model stayed sharp and where it softened.

Category	Tasks	Avg primary	Avg primary tok/s	Avg primary hit	Avg follow-up hit
analysis	1/1	1.6s	63.30 tok/s	75%	100%
async	1/1	2.0s	63.10 tok/s	100%	0%
cli	1/1	2.7s	62.61 tok/s	100%	0%
concurrency	1/1	2.9s	62.07 tok/s	100%	0%
config	1/1	4.9s	62.49 tok/s	100%	100%
debugging	1/1	2.3s	63.44 tok/s	100%	0%
file_io	1/1	3.6s	62.74 tok/s	75%	100%
filesystem	1/1	3.2s	62.86 tok/s	50%	0%
http	1/1	4.5s	62.75 tok/s	100%	0%
logging	1/1	4.8s	61.66 tok/s	75%	100%
package	1/1	2.4s	63.05 tok/s	75%	100%
parsing	2/2	4.4s	62.66 tok/s	88%	50%
refactor	1/1	3.3s	62.61 tok/s	75%	100%
sqlite	1/1	3.3s	62.96 tok/s	75%	100%
tests	1/1	5.7s	62.61 tok/s	100%	0%
typing	1/1	1.7s	63.24 tok/s	75%	100%
validation	2/2	3.1s	63.48 tok/s	75%	50%
web	1/1	1.5s	63.15 tok/s	100%	33%

10-Question Real-Context Latency

These are the heavier repository-shaped prompts. Lower is better.

API Token Audit Regression Test tests

7.2s

User Preferences Contract Test tests

6.7s

Feature Flag Lifecycle Test tests

6.7s

Board Snapshot Regression Test tests

6.5s

Announcements State Sync Review review

5.9s

Task Bulk Job Debug Packet debugging

5.1s

Lane Config Patch Plan planning

4.1s

Auth Redirect Triage debugging

3.4s

Orchestration Timeline Forensics forensics

2.6s

Ingest Log Triage cross_repo_debugging

1.7s

10-Question Real-Context Marker Coverage

Higher is better. This shows how much of the requested triage, test, planning, and review structure landed on the first response.

Announcements State Sync Review review

100%

API Token Audit Regression Test tests

71%

User Preferences Contract Test tests

71%

Lane Config Patch Plan planning

67%

Orchestration Timeline Forensics forensics

67%

Board Snapshot Regression Test tests

57%

Feature Flag Lifecycle Test tests

57%

Ingest Log Triage cross_repo_debugging

50%

Task Bulk Job Debug Packet debugging

33%

Auth Redirect Triage debugging

25%

10-Question Real-Context Details

This is the more realistic repo-style suite. It is the best signal for whether the tiny model can stay useful once the prompts stop looking like toy exercises.

Task	Primary	Follow-up	Primary tok/s	Marker hit	Format OK	Usable
Auth Redirect Triagedebugging	3.4s	1.5s	62.00 tok/s	25%	No	Yes
Board Snapshot Regression Testtests	6.5s	1.1s	61.55 tok/s	57%	No	Yes
Lane Config Patch Planplanning	4.1s	0.8s	61.63 tok/s	67%	No	Yes
API Token Audit Regression Testtests	7.2s	1.4s	61.35 tok/s	71%	No	Yes
Announcements State Sync Reviewreview	5.9s	1.0s	61.81 tok/s	100%	No	Yes
Feature Flag Lifecycle Testtests	6.7s	1.9s	61.63 tok/s	57%	No	Yes
Task Bulk Job Debug Packetdebugging	5.1s	1.6s	61.60 tok/s	33%	No	Yes
User Preferences Contract Testtests	6.7s	2.4s	61.47 tok/s	71%	No	Yes
Orchestration Timeline Forensicsforensics	2.6s	1.1s	61.48 tok/s	67%	No	Yes
Ingest Log Triagecross_repo_debugging	1.7s	1.2s	61.63 tok/s	50%	No	Yes

What This Means

This tiny coder is not a replacement for your larger workers, but it is very much a real utility model. It answered every task in the twenty-question Python suite, stayed around 3.3s on average for primaries, and pushed around 62.85 tok/s. The weak spot is first-pass obedience: its primary-format pass count was 0/20, while the follow-up format pass count jumped to 10/20.

Launcher note: the original full-lane launcher reached the ten-question handoff but did not leave artifacts for that phase. This report is built from the artifacts that actually exist. The long-context edge lane is still missing.