Archive · vps-81 historical telemetry · qwen0_5b/2026-04-10-qwen0_5b-benchmark-report-with-10q.html. Originally rendered 2026-04-10. Re-hosted from MyServers on 2026-05-06. Methodology and harness conventions may differ from what we use today; see /methodology.html for current standards. ← back to all benchmarks

Qwen2.5 Coder 0.5B

This page summarizes the tiny-model benchmark lane that actually finished on vps50: the exact hello check, the five-question small-model suite, and the twenty-question Python suite. The later ten-question real-context run and the long-context edge run did not produce artifacts, so they are marked clearly as missing instead of being silently ignored.

Model: qwen2.5-coder:0.5b Disk size: 397 MB Context: 32K advertised Server: vps50 / Ollama CPU

Run Coverage

All finished artifacts are in this report. The page calls out exactly which planned phases completed and which ones still did not produce any files.

Hello check Complete

Exact prompt succeeded and returned a clean greeting.

5-question suite Complete

All five tasks finished and produced scored artifacts.

20-question Python suite Complete

All twenty primary and follow-up prompts finished.

10-question real-context suite Complete

All ten real-context tasks finished and were pulled into the report.

Context-edge suite Missing

No output files were produced for the long-context append lane.

Headline Metrics

The tiny model stayed surprisingly quick. The strongest signal is not raw quality, it is how much useful structure it maintained while staying under ten seconds on average for the bigger Python suite.

Hello check 4.0s

Exact prompt: hi can you help me?

Small eval average 23.3s

80% marker coverage

20-question primary average 9.4s

20.88 tok/s

20-question follow-up average 3.6s

22.08 tok/s

20-question primary usable 20/20

Every primary prompt returned a non-empty answer.

20-question follow-up format 12/20

Follow-ups were much more obedient than primaries.

10-question real-context average 20.4s

17.21 tok/s

10-question real-context usable 10/10

Primary prompts that returned a usable answer.

Hello Check

This is the exact smoke-test prompt that kicked off the benchmark lane.

Of course! How can I assist you today?
Latency: 4.0s
Prompt: hi can you help me?

5-Question Small Eval Latency

These are the short shell, ops, planning, and debugging tasks. Lower is better.

Disk Guard Script shell
31.7s
IPv4 Validator python
41.2s
Nginx Safe Reload ops
4.6s
YAML Validator Plan planning
23.0s
SSH Lockout Triage debugging
16.1s

20-Question Python Latency

Primary-prompt durations for the full Python suite, sorted from slowest to fastest.

Refactor Split refactor
31.3s
Pytest Fixture tests
19.1s
Regex Log Parser parsing
14.7s
JSON Validation validation
13.3s
File Scanner file_io
12.9s
Pathlib Cleaner filesystem
11.7s
Debug Stacktrace debugging
11.5s
Pydantic Model validation
9.4s
Thread Pool concurrency
6.8s
HTTP Retry http
6.8s
SQLite Store sqlite
6.1s
CSV Parser parsing
6.1s
Config Dataclass config
5.9s
Package Layout package
5.9s
Async Fetch async
5.9s
Logging Setup logging
5.1s
CLI Arguments cli
4.9s
CSV Summary analysis
4.1s
FastAPI Handler web
3.4s
Typed Dataclass typing
3.2s

20-Question Python Marker Coverage

Higher is better. This shows how much of each prompt's requested structure and ingredients the tiny model actually hit on the first try.

CSV Parser parsing
100%
CLI Arguments cli
100%
Pytest Fixture tests
100%
Async Fetch async
100%
HTTP Retry http
100%
FastAPI Handler web
100%
Thread Pool concurrency
100%
Regex Log Parser parsing
100%
File Scanner file_io
75%
Typed Dataclass typing
75%
JSON Validation validation
75%
SQLite Store sqlite
75%
Logging Setup logging
75%
Package Layout package
75%
Debug Stacktrace debugging
75%
Refactor Split refactor
75%
CSV Summary analysis
75%
Pydantic Model validation
75%
Pathlib Cleaner filesystem
50%
Config Dataclass config
25%

5-Question Suite Details

Quick read: great on planning and SSH triage, shakier on exact shell formatting and the IPv4 implementation details.

Task Primary Follow-up Primary tok/s Marker hit Format OK Usable
Disk Guard Scriptshell 31.7s n/a 12.98 tok/s 75% No Yes
IPv4 Validatorpython 41.2s n/a 13.53 tok/s 75% No Yes
Nginx Safe Reloadops 4.6s n/a 27.37 tok/s 50% Yes Yes
YAML Validator Planplanning 23.0s n/a 26.07 tok/s 100% Yes Yes
SSH Lockout Triagedebugging 16.1s n/a 23.40 tok/s 100% Yes Yes

20-Question Python Suite Details

This is the practical coding workload. The tiny model answered every primary and every follow-up, but it needed the follow-up turn to obey formatting much more consistently.

Task Primary Follow-up Primary tok/s Marker hit Format OK Usable
CSV Parserparsing 6.1s 4.0s 21.02 tok/s 100% No Yes
File Scannerfile_io 12.9s 2.1s 22.55 tok/s 75% No Yes
CLI Argumentscli 4.9s 5.5s 27.47 tok/s 100% No Yes
Typed Dataclasstyping 3.2s 5.5s 28.24 tok/s 75% No Yes
Pytest Fixturetests 19.1s 2.9s 20.29 tok/s 100% No Yes
Async Fetchasync 5.9s 2.6s 20.76 tok/s 100% No Yes
HTTP Retryhttp 6.8s 3.4s 22.54 tok/s 100% No Yes
JSON Validationvalidation 13.3s 3.4s 26.87 tok/s 75% No Yes
SQLite Storesqlite 6.1s 1.3s 29.16 tok/s 75% No Yes
FastAPI Handlerweb 3.4s 2.8s 25.86 tok/s 100% No Yes
Config Dataclassconfig 5.9s 2.2s 15.00 tok/s 25% No Yes
Logging Setuplogging 5.1s 2.4s 25.17 tok/s 75% No Yes
Thread Poolconcurrency 6.8s 3.9s 27.99 tok/s 100% No Yes
Package Layoutpackage 5.9s 2.1s 19.16 tok/s 75% No Yes
Debug Stacktracedebugging 11.5s 9.8s 10.20 tok/s 75% No Yes
Refactor Splitrefactor 31.3s 1.6s 6.34 tok/s 75% No Yes
CSV Summaryanalysis 4.1s 4.8s 24.25 tok/s 75% No Yes
Pathlib Cleanerfilesystem 11.7s 6.6s 19.93 tok/s 50% No Yes
Pydantic Modelvalidation 9.4s 2.8s 10.74 tok/s 75% No Yes
Regex Log Parserparsing 14.7s 2.5s 14.01 tok/s 100% No Yes

20-Question Python Category Summary

This groups the finished 20-question suite by task family so you can see where the tiny model stayed sharp and where it softened.

Category Tasks Avg primary Avg primary tok/s Avg primary hit Avg follow-up hit
analysis 1/1 4.1s 24.25 tok/s 75% 100%
async 1/1 5.9s 20.76 tok/s 100% 100%
cli 1/1 4.9s 27.47 tok/s 100% 0%
concurrency 1/1 6.8s 27.99 tok/s 100% 100%
config 1/1 5.9s 15.00 tok/s 25% 100%
debugging 1/1 11.5s 10.20 tok/s 75% 100%
file_io 1/1 12.9s 22.55 tok/s 75% 100%
filesystem 1/1 11.7s 19.93 tok/s 50% 0%
http 1/1 6.8s 22.54 tok/s 100% 0%
logging 1/1 5.1s 25.17 tok/s 75% 100%
package 1/1 5.9s 19.16 tok/s 75% 100%
parsing 2/2 10.4s 17.52 tok/s 100% 0%
refactor 1/1 31.3s 6.34 tok/s 75% 100%
sqlite 1/1 6.1s 29.16 tok/s 75% 100%
tests 1/1 19.1s 20.29 tok/s 100% 0%
typing 1/1 3.2s 28.24 tok/s 75% 100%
validation 2/2 11.3s 18.80 tok/s 75% 50%
web 1/1 3.4s 25.86 tok/s 100% 0%

10-Question Real-Context Latency

These are the heavier repository-shaped prompts. Lower is better.

Task Bulk Job Debug Packet debugging
33.8s
Announcements State Sync Review review
31.4s
API Token Audit Regression Test tests
24.1s
Feature Flag Lifecycle Test tests
24.0s
Lane Config Patch Plan planning
22.5s
User Preferences Contract Test tests
20.0s
Board Snapshot Regression Test tests
19.8s
Auth Redirect Triage debugging
13.6s
Orchestration Timeline Forensics forensics
11.0s
Ingest Log Triage cross_repo_debugging
4.4s

10-Question Real-Context Marker Coverage

Higher is better. This shows how much of the requested triage, test, planning, and review structure landed on the first response.

Announcements State Sync Review review
100%
Orchestration Timeline Forensics forensics
100%
User Preferences Contract Test tests
71%
Lane Config Patch Plan planning
67%
API Token Audit Regression Test tests
57%
Feature Flag Lifecycle Test tests
57%
Auth Redirect Triage debugging
50%
Ingest Log Triage cross_repo_debugging
50%
Board Snapshot Regression Test tests
43%
Task Bulk Job Debug Packet debugging
33%

10-Question Real-Context Details

This is the more realistic repo-style suite. It is the best signal for whether the tiny model can stay useful once the prompts stop looking like toy exercises.

Task Primary Follow-up Primary tok/s Marker hit Format OK Usable
Auth Redirect Triagedebugging 13.6s 3.2s 17.27 tok/s 50% No Yes
Board Snapshot Regression Testtests 19.8s 6.6s 19.95 tok/s 43% No Yes
Lane Config Patch Planplanning 22.5s 7.7s 8.07 tok/s 67% No Yes
API Token Audit Regression Testtests 24.1s 3.1s 19.20 tok/s 57% No Yes
Announcements State Sync Reviewreview 31.4s 5.7s 16.47 tok/s 100% No Yes
Feature Flag Lifecycle Testtests 24.0s 4.4s 17.36 tok/s 57% No Yes
Task Bulk Job Debug Packetdebugging 33.8s 4.2s 11.80 tok/s 33% No Yes
User Preferences Contract Testtests 20.0s 2.4s 20.00 tok/s 71% No Yes
Orchestration Timeline Forensicsforensics 11.0s 2.5s 16.53 tok/s 100% No Yes
Ingest Log Triagecross_repo_debugging 4.4s 2.6s 25.46 tok/s 50% No Yes

What This Means

This tiny coder is not a replacement for your larger workers, but it is very much a real utility model. It answered every task in the twenty-question Python suite, stayed around 9.4s on average for primaries, and pushed around 20.88 tok/s. The weak spot is first-pass obedience: its primary-format pass count was 0/20, while the follow-up format pass count jumped to 12/20.

Launcher note: the original full-lane launcher reached the ten-question handoff but did not leave artifacts for that phase. This report is built from the artifacts that actually exist. The long-context edge lane is still missing.