Python Model Report
This mini page turns the overnight suite into a quick visual read: which models were fastest, which ones improved after the cold first prompt, and which Python task types each model handled best.
Headline Winners
These are the standout models for the metrics that matter most in the overnight run.
Qwen2.5 Coder 1.5B
11.4 s
Small model with the quickest average first-answer pass.
Qwen2.5 Coder 1.5B
7.5 s
Quickest summary-request average once the model already had context.
Qwen2.5 Coder 1.5B
88%
Highest average hit-rate on the full Python task requirements.
Qwen2.5 Coder 1.5B
0%
Strongest performance on the follow-up summary prompts.
Qwen2.5 Coder 1.5B
-4.6 s
How much faster the later primaries got after the first cold request.
Qwen2.5 Coder 1.5B
-1.20 tok/s
How much token generation speed improved after the first primary prompt.
Primary Prompt Latency
Average time to answer the main Python task prompt. Lower is better.
Follow-up Prompt Latency
Average time to answer the follow-up summary request. Lower is better.
Primary Prompt Throughput
Average tokens per second while answering the main Python tasks. Higher is better.
Follow-up Prompt Throughput
Average tokens per second on the summary requests. Higher is better.
Primary Quality
Average marker coverage on the full Python task prompts. Higher is better.
Follow-up Quality
Average marker coverage on the follow-up summary prompts. Higher is better.
Cold-Start Latency Delta
Positive values mean the first primary prompt was slower than the warmed-up average. Negative values mean the first prompt was actually faster.
Cold-Start Throughput Delta
Positive values mean token generation sped up after the first primary prompt. Negative values mean throughput dropped on later prompts.
Model Overview
This is the quick scan: speed, throughput, primary quality, usable answers, and the cold-start latency gap for each model.
| Model | Primary Avg | Follow-up Avg | Primary tok/s | Follow-up tok/s | Primary Quality | Usable | Cold Gain |
|---|---|---|---|---|---|---|---|
| Qwen2.5 Coder 1.5B Small | 11.4 s | 7.5 s | 16.89 tok/s | 16.56 tok/s | 88% | 20/20 | -4.6 s |
Python Task-Type Heatmap
Values show primary-prompt marker coverage by task type. Green means the model hit more of the requested Python requirements for that category.
| Model | analysis | async | cli | concurrency | config | debugging | file_io | filesystem | http | logging | package | parsing | refactor | sqlite | tests | typing | validation | web |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5 Coder 1.5B Small |
75%
|
100%
|
100%
|
100%
|
50%
|
100%
|
100%
|
75%
|
75%
|
75%
|
100%
|
88%
|
75%
|
100%
|
100%
|
100%
|
75%
|
100%
|