Archive · vps-81 historical telemetry · qwen0_5b/2026-04-10-qwen0_5b-python-task-suite-v1-20q-mini.html. Originally rendered 2026-04-10. Re-hosted from MyServers on 2026-05-06. Methodology and harness conventions may differ from what we use today; see /methodology.html for current standards. ← back to all benchmarks
Overnight Telemetry

Python Model Report

This mini page turns the overnight suite into a quick visual read: which models were fastest, which ones improved after the cold first prompt, and which Python task types each model handled best.

Host: vmi3206382 Finished: 2026-04-10T12:46:26Z Source: 2026-04-10-qwen0_5b-python-task-suite-v1-20q.json Avg primary latency: 9.4 s Avg follow-up throughput: 22.08 tok/s
Run 2026-04-10T12:42:06Z Overnight Python suite start time
Total Wall Time 0.07 h End-to-end runtime across all models
Models 1 Large and small models included
Questions 20 Primary prompts per model plus follow-up summaries
Avg Primary Quality 81% Average marker coverage across the suite

Headline Winners

These are the standout models for the metrics that matter most in the overnight run.

Fastest Primary

Qwen2.5 Coder 0.5B

9.4 s

Large model with the quickest average first-answer pass.

Fastest Follow-up

Qwen2.5 Coder 0.5B

3.6 s

Quickest summary-request average once the model already had context.

Best Primary Quality

Qwen2.5 Coder 0.5B

81%

Highest average hit-rate on the full Python task requirements.

Best Follow-up Quality

Qwen2.5 Coder 0.5B

60%

Strongest performance on the follow-up summary prompts.

Biggest Warmup Gain

Qwen2.5 Coder 0.5B

-3.5 s

How much faster the later primaries got after the first cold request.

Biggest Warmup Throughput Gain

Qwen2.5 Coder 0.5B

-0.15 tok/s

How much token generation speed improved after the first primary prompt.

Primary Prompt Latency

Average time to answer the main Python task prompt. Lower is better.

Qwen2.5 Coder 0.5B Large
9.4 s

Follow-up Prompt Latency

Average time to answer the follow-up summary request. Lower is better.

Qwen2.5 Coder 0.5B Large
3.6 s

Primary Prompt Throughput

Average tokens per second while answering the main Python tasks. Higher is better.

Qwen2.5 Coder 0.5B Large
20.88 tok/s

Follow-up Prompt Throughput

Average tokens per second on the summary requests. Higher is better.

Qwen2.5 Coder 0.5B Large
22.08 tok/s

Primary Quality

Average marker coverage on the full Python task prompts. Higher is better.

Qwen2.5 Coder 0.5B Large
81%

Follow-up Quality

Average marker coverage on the follow-up summary prompts. Higher is better.

Qwen2.5 Coder 0.5B Large
60%

Cold-Start Latency Delta

Positive values mean the first primary prompt was slower than the warmed-up average. Negative values mean the first prompt was actually faster.

Later prompts got faster Later prompts got slower
Qwen2.5 Coder 0.5B Large
-3.5 s Later prompts got slower

Cold-Start Throughput Delta

Positive values mean token generation sped up after the first primary prompt. Negative values mean throughput dropped on later prompts.

Later prompts got faster Later prompts got slower
Qwen2.5 Coder 0.5B Large
-0.15 tok/s Later prompts got slower

Model Overview

This is the quick scan: speed, throughput, primary quality, usable answers, and the cold-start latency gap for each model.

Model Primary Avg Follow-up Avg Primary tok/s Follow-up tok/s Primary Quality Usable Cold Gain
Qwen2.5 Coder 0.5B Large 9.4 s 3.6 s 20.88 tok/s 22.08 tok/s 81% 20/20 -3.5 s

Python Task-Type Heatmap

Values show primary-prompt marker coverage by task type. Green means the model hit more of the requested Python requirements for that category.

Modelanalysisasynccliconcurrencyconfigdebuggingfile_iofilesystemhttploggingpackageparsingrefactorsqliteteststypingvalidationweb
Qwen2.5 Coder 0.5B Large
75%
100%
100%
100%
25%
75%
75%
50%
100%
75%
75%
100%
75%
75%
100%
75%
75%
100%