Exact prompt succeeded and returned a clean greeting.
This page summarizes the tiny-model benchmark lane that actually finished on vps50: the exact hello check, the five-question small-model suite, and the twenty-question Python suite. The later ten-question real-context run and the long-context edge run did not produce artifacts, so they are marked clearly as missing instead of being silently ignored.
All finished artifacts are in this report. The page calls out exactly which planned phases completed and which ones still did not produce any files.
Exact prompt succeeded and returned a clean greeting.
All five tasks finished and produced scored artifacts.
All twenty primary and follow-up prompts finished.
All ten real-context tasks finished and were pulled into the report.
No output files were produced for the long-context append lane.
The tiny model stayed surprisingly quick. The strongest signal is not raw quality, it is how much useful structure it maintained while staying under ten seconds on average for the bigger Python suite.
Exact prompt: hi can you help me?
80% marker coverage
20.88 tok/s
22.08 tok/s
Every primary prompt returned a non-empty answer.
Follow-ups were much more obedient than primaries.
17.21 tok/s
Primary prompts that returned a usable answer.
This is the exact smoke-test prompt that kicked off the benchmark lane.
These are the short shell, ops, planning, and debugging tasks. Lower is better.
Primary-prompt durations for the full Python suite, sorted from slowest to fastest.
Higher is better. This shows how much of each prompt's requested structure and ingredients the tiny model actually hit on the first try.
Quick read: great on planning and SSH triage, shakier on exact shell formatting and the IPv4 implementation details.
| Task | Primary | Follow-up | Primary tok/s | Marker hit | Format OK | Usable |
|---|---|---|---|---|---|---|
| Disk Guard Scriptshell | 31.7s | n/a | 12.98 tok/s | 75% | No | Yes |
| IPv4 Validatorpython | 41.2s | n/a | 13.53 tok/s | 75% | No | Yes |
| Nginx Safe Reloadops | 4.6s | n/a | 27.37 tok/s | 50% | Yes | Yes |
| YAML Validator Planplanning | 23.0s | n/a | 26.07 tok/s | 100% | Yes | Yes |
| SSH Lockout Triagedebugging | 16.1s | n/a | 23.40 tok/s | 100% | Yes | Yes |
This is the practical coding workload. The tiny model answered every primary and every follow-up, but it needed the follow-up turn to obey formatting much more consistently.
| Task | Primary | Follow-up | Primary tok/s | Marker hit | Format OK | Usable |
|---|---|---|---|---|---|---|
| CSV Parserparsing | 6.1s | 4.0s | 21.02 tok/s | 100% | No | Yes |
| File Scannerfile_io | 12.9s | 2.1s | 22.55 tok/s | 75% | No | Yes |
| CLI Argumentscli | 4.9s | 5.5s | 27.47 tok/s | 100% | No | Yes |
| Typed Dataclasstyping | 3.2s | 5.5s | 28.24 tok/s | 75% | No | Yes |
| Pytest Fixturetests | 19.1s | 2.9s | 20.29 tok/s | 100% | No | Yes |
| Async Fetchasync | 5.9s | 2.6s | 20.76 tok/s | 100% | No | Yes |
| HTTP Retryhttp | 6.8s | 3.4s | 22.54 tok/s | 100% | No | Yes |
| JSON Validationvalidation | 13.3s | 3.4s | 26.87 tok/s | 75% | No | Yes |
| SQLite Storesqlite | 6.1s | 1.3s | 29.16 tok/s | 75% | No | Yes |
| FastAPI Handlerweb | 3.4s | 2.8s | 25.86 tok/s | 100% | No | Yes |
| Config Dataclassconfig | 5.9s | 2.2s | 15.00 tok/s | 25% | No | Yes |
| Logging Setuplogging | 5.1s | 2.4s | 25.17 tok/s | 75% | No | Yes |
| Thread Poolconcurrency | 6.8s | 3.9s | 27.99 tok/s | 100% | No | Yes |
| Package Layoutpackage | 5.9s | 2.1s | 19.16 tok/s | 75% | No | Yes |
| Debug Stacktracedebugging | 11.5s | 9.8s | 10.20 tok/s | 75% | No | Yes |
| Refactor Splitrefactor | 31.3s | 1.6s | 6.34 tok/s | 75% | No | Yes |
| CSV Summaryanalysis | 4.1s | 4.8s | 24.25 tok/s | 75% | No | Yes |
| Pathlib Cleanerfilesystem | 11.7s | 6.6s | 19.93 tok/s | 50% | No | Yes |
| Pydantic Modelvalidation | 9.4s | 2.8s | 10.74 tok/s | 75% | No | Yes |
| Regex Log Parserparsing | 14.7s | 2.5s | 14.01 tok/s | 100% | No | Yes |
This groups the finished 20-question suite by task family so you can see where the tiny model stayed sharp and where it softened.
| Category | Tasks | Avg primary | Avg primary tok/s | Avg primary hit | Avg follow-up hit |
|---|---|---|---|---|---|
| analysis | 1/1 | 4.1s | 24.25 tok/s | 75% | 100% |
| async | 1/1 | 5.9s | 20.76 tok/s | 100% | 100% |
| cli | 1/1 | 4.9s | 27.47 tok/s | 100% | 0% |
| concurrency | 1/1 | 6.8s | 27.99 tok/s | 100% | 100% |
| config | 1/1 | 5.9s | 15.00 tok/s | 25% | 100% |
| debugging | 1/1 | 11.5s | 10.20 tok/s | 75% | 100% |
| file_io | 1/1 | 12.9s | 22.55 tok/s | 75% | 100% |
| filesystem | 1/1 | 11.7s | 19.93 tok/s | 50% | 0% |
| http | 1/1 | 6.8s | 22.54 tok/s | 100% | 0% |
| logging | 1/1 | 5.1s | 25.17 tok/s | 75% | 100% |
| package | 1/1 | 5.9s | 19.16 tok/s | 75% | 100% |
| parsing | 2/2 | 10.4s | 17.52 tok/s | 100% | 0% |
| refactor | 1/1 | 31.3s | 6.34 tok/s | 75% | 100% |
| sqlite | 1/1 | 6.1s | 29.16 tok/s | 75% | 100% |
| tests | 1/1 | 19.1s | 20.29 tok/s | 100% | 0% |
| typing | 1/1 | 3.2s | 28.24 tok/s | 75% | 100% |
| validation | 2/2 | 11.3s | 18.80 tok/s | 75% | 50% |
| web | 1/1 | 3.4s | 25.86 tok/s | 100% | 0% |
These are the heavier repository-shaped prompts. Lower is better.
Higher is better. This shows how much of the requested triage, test, planning, and review structure landed on the first response.
This is the more realistic repo-style suite. It is the best signal for whether the tiny model can stay useful once the prompts stop looking like toy exercises.
| Task | Primary | Follow-up | Primary tok/s | Marker hit | Format OK | Usable |
|---|---|---|---|---|---|---|
| 13.6s | 3.2s | 17.27 tok/s | 50% | No | Yes | |
| Board Snapshot Regression Testtests | 19.8s | 6.6s | 19.95 tok/s | 43% | No | Yes |
| Lane Config Patch Planplanning | 22.5s | 7.7s | 8.07 tok/s | 67% | No | Yes |
| API Token Audit Regression Testtests | 24.1s | 3.1s | 19.20 tok/s | 57% | No | Yes |
| Announcements State Sync Reviewreview | 31.4s | 5.7s | 16.47 tok/s | 100% | No | Yes |
| Feature Flag Lifecycle Testtests | 24.0s | 4.4s | 17.36 tok/s | 57% | No | Yes |
| Task Bulk Job Debug Packetdebugging | 33.8s | 4.2s | 11.80 tok/s | 33% | No | Yes |
| User Preferences Contract Testtests | 20.0s | 2.4s | 20.00 tok/s | 71% | No | Yes |
| Orchestration Timeline Forensicsforensics | 11.0s | 2.5s | 16.53 tok/s | 100% | No | Yes |
| 4.4s | 2.6s | 25.46 tok/s | 50% | No | Yes |
This tiny coder is not a replacement for your larger workers, but it is very much a real utility model. It answered every task in the twenty-question Python suite, stayed around 9.4s on average for primaries, and pushed around 20.88 tok/s. The weak spot is first-pass obedience: its primary-format pass count was 0/20, while the follow-up format pass count jumped to 12/20.