Archive · vps-81 historical telemetry · python-overnight/2026-04-06-python-task-suite-mini.html. Originally rendered 2026-04-06. Re-hosted from MyServers on 2026-05-06. Methodology and harness conventions may differ from what we use today; see /methodology.html for current standards. ← back to all benchmarks
Overnight Telemetry

Python Model Report

This mini page turns the overnight suite into a quick visual read: which models were fastest, which ones improved after the cold first prompt, and which Python task types each model handled best.

Host: vmi3206382 Finished: 2026-04-06T07:57:51Z Source: 2026-04-06-python-task-suite.json Avg primary latency: 83.0 s Avg follow-up throughput: 2.89 tok/s
Run 2026-04-05T23:17:56Z Overnight Python suite start time
Total Wall Time 8.67 h End-to-end runtime across all models
Models 11 Large and small models included
Questions 20 Primary prompts per model plus follow-up summaries
Avg Primary Quality 85% Average marker coverage across the suite

Headline Winners

These are the standout models for the metrics that matter most in the overnight run.

Fastest Primary

Phi-3 Mini

26.9 s

Small model with the quickest average first-answer pass.

Fastest Follow-up

Llama 3.2 3B

15.3 s

Quickest summary-request average once the model already had context.

Best Primary Quality

Phind 34 16k

92%

Highest average hit-rate on the full Python task requirements.

Best Follow-up Quality

Qwen32 Coder 32k

100%

Strongest performance on the follow-up summary prompts.

Biggest Warmup Gain

Qwen32 Coder 32k

82.1 s

How much faster the later primaries got after the first cold request.

Biggest Warmup Throughput Gain

Qwen2.5 Coder 3B

1.91 tok/s

How much token generation speed improved after the first primary prompt.

Primary Prompt Latency

Average time to answer the main Python task prompt. Lower is better.

Phi-3 Mini Small
26.9 s
Llama 3.2 3B Small
27.7 s
Qwen2.5 3B Small
33.3 s
Qwen2.5 Coder 3B Small
38.1 s
Qwen2.5 Coder 1.5B Small
42.0 s
Qwen14 Coder 32k Large
57.2 s
Qwen14 General 32k Large
76.8 s
Qwen32 Coder 32k Large
105.9 s
CodeLlama 34 16k Large
119.4 s
Codestral 32k Large
147.0 s
Phind 34 16k Large
238.5 s

Follow-up Prompt Latency

Average time to answer the follow-up summary request. Lower is better.

Llama 3.2 3B Small
15.3 s
Phi-3 Mini Small
18.6 s
Qwen2.5 Coder 3B Small
25.3 s
Qwen2.5 Coder 1.5B Small
29.7 s
Qwen2.5 3B Small
35.9 s
Qwen14 General 32k Large
48.0 s
Qwen14 Coder 32k Large
60.1 s
Phind 34 16k Large
95.9 s
CodeLlama 34 16k Large
98.2 s
Qwen32 Coder 32k Large
107.7 s
Codestral 32k Large
111.8 s

Primary Prompt Throughput

Average tokens per second while answering the main Python tasks. Higher is better.

Phi-3 Mini Small
6.93 tok/s
Llama 3.2 3B Small
6.36 tok/s
Qwen2.5 Coder 1.5B Small
4.25 tok/s
Qwen2.5 Coder 3B Small
4.16 tok/s
Qwen2.5 3B Small
3.35 tok/s
Qwen14 Coder 32k Large
1.82 tok/s
Qwen14 General 32k Large
1.70 tok/s
Codestral 32k Large
1.54 tok/s
CodeLlama 34 16k Large
1.25 tok/s
Phind 34 16k Large
1.14 tok/s
Qwen32 Coder 32k Large
0.83 tok/s

Follow-up Prompt Throughput

Average tokens per second on the summary requests. Higher is better.

Llama 3.2 3B Small
6.53 tok/s
Phi-3 Mini Small
6.07 tok/s
Qwen2.5 Coder 3B Small
3.91 tok/s
Qwen2.5 Coder 1.5B Small
3.87 tok/s
Qwen2.5 3B Small
3.45 tok/s
Qwen14 Coder 32k Large
1.83 tok/s
Qwen14 General 32k Large
1.62 tok/s
Codestral 32k Large
1.43 tok/s
CodeLlama 34 16k Large
1.22 tok/s
Phind 34 16k Large
1.09 tok/s
Qwen32 Coder 32k Large
0.80 tok/s

Primary Quality

Average marker coverage on the full Python task prompts. Higher is better.

Phind 34 16k Large
92%
Codestral 32k Large
89%
Qwen2.5 Coder 3B Small
89%
Qwen2.5 Coder 1.5B Small
89%
Qwen14 Coder 32k Large
86%
Llama 3.2 3B Small
86%
Qwen32 Coder 32k Large
82%
CodeLlama 34 16k Large
82%
Qwen14 General 32k Large
82%
Qwen2.5 3B Small
80%
Phi-3 Mini Small
78%

Follow-up Quality

Average marker coverage on the follow-up summary prompts. Higher is better.

Qwen32 Coder 32k Large
100%
Qwen14 Coder 32k Large
100%
Codestral 32k Large
100%
Phind 34 16k Large
100%
Qwen14 General 32k Large
100%
Qwen2.5 3B Small
100%
Llama 3.2 3B Small
100%
Phi-3 Mini Small
93%
CodeLlama 34 16k Large
25%
Qwen2.5 Coder 3B Small
0%
Qwen2.5 Coder 1.5B Small
0%

Cold-Start Latency Delta

Positive values mean the first primary prompt was slower than the warmed-up average. Negative values mean the first prompt was actually faster.

Later prompts got faster Later prompts got slower
Qwen32 Coder 32k Large
82.1 s Later prompts got faster
Qwen14 General 32k Large
62.6 s Later prompts got faster
Qwen14 Coder 32k Large
36.8 s Later prompts got faster
Qwen2.5 Coder 3B Small
29.7 s Later prompts got faster
Phi-3 Mini Small
24.2 s Later prompts got faster
Codestral 32k Large
7.1 s Later prompts got faster
CodeLlama 34 16k Large
2.0 s Later prompts got faster
Llama 3.2 3B Small
-0.3 s Later prompts got slower
Qwen2.5 3B Small
-1.4 s Later prompts got slower
Qwen2.5 Coder 1.5B Small
-6.7 s Later prompts got slower
Phind 34 16k Large
-47.8 s Later prompts got slower

Cold-Start Throughput Delta

Positive values mean token generation sped up after the first primary prompt. Negative values mean throughput dropped on later prompts.

Later prompts got faster Later prompts got slower
Qwen2.5 Coder 3B Small
1.91 tok/s Later prompts got faster
Qwen14 Coder 32k Large
0.51 tok/s Later prompts got faster
Qwen2.5 Coder 1.5B Small
0.39 tok/s Later prompts got faster
Qwen14 General 32k Large
0.28 tok/s Later prompts got faster
Qwen2.5 3B Small
0.21 tok/s Later prompts got faster
Qwen32 Coder 32k Large
0.08 tok/s Later prompts got faster
Codestral 32k Large
0.07 tok/s Later prompts got faster
CodeLlama 34 16k Large
-0.19 tok/s Later prompts got slower
Phind 34 16k Large
-0.21 tok/s Later prompts got slower
Phi-3 Mini Small
-0.44 tok/s Later prompts got slower
Llama 3.2 3B Small
-0.47 tok/s Later prompts got slower

Model Overview

This is the quick scan: speed, throughput, primary quality, usable answers, and the cold-start latency gap for each model.

Model Primary Avg Follow-up Avg Primary tok/s Follow-up tok/s Primary Quality Usable Cold Gain
Qwen32 Coder 32k Large 105.9 s 107.7 s 0.83 tok/s 0.80 tok/s 82% 20/20 82.1 s
Qwen14 Coder 32k Large 57.2 s 60.1 s 1.82 tok/s 1.83 tok/s 86% 20/20 36.8 s
Codestral 32k Large 147.0 s 111.8 s 1.54 tok/s 1.43 tok/s 89% 20/20 7.1 s
CodeLlama 34 16k Large 119.4 s 98.2 s 1.25 tok/s 1.22 tok/s 82% 20/20 2.0 s
Phind 34 16k Large 238.5 s 95.9 s 1.14 tok/s 1.09 tok/s 92% 20/20 -47.8 s
Qwen14 General 32k Large 76.8 s 48.0 s 1.70 tok/s 1.62 tok/s 82% 20/20 62.6 s
Qwen2.5 Coder 3B Small 38.1 s 25.3 s 4.16 tok/s 3.91 tok/s 89% 20/20 29.7 s
Qwen2.5 Coder 1.5B Small 42.0 s 29.7 s 4.25 tok/s 3.87 tok/s 89% 20/20 -6.7 s
Qwen2.5 3B Small 33.3 s 35.9 s 3.35 tok/s 3.45 tok/s 80% 20/20 -1.4 s
Llama 3.2 3B Small 27.7 s 15.3 s 6.36 tok/s 6.53 tok/s 86% 20/20 -0.3 s
Phi-3 Mini Small 26.9 s 18.6 s 6.93 tok/s 6.07 tok/s 78% 20/20 24.2 s

Python Task-Type Heatmap

Values show primary-prompt marker coverage by task type. Green means the model hit more of the requested Python requirements for that category.

Modelanalysisasynccliconcurrencyconfigdebuggingfile_iofilesystemhttploggingpackageparsingrefactorsqliteteststypingvalidationweb
Qwen32 Coder 32k Large
75%
100%
100%
100%
50%
50%
75%
100%
100%
50%
75%
75%
50%
100%
100%
100%
88%
100%
Qwen14 Coder 32k Large
75%
100%
100%
100%
50%
75%
100%
100%
75%
50%
100%
75%
75%
100%
100%
100%
88%
100%
Codestral 32k Large
75%
100%
100%
100%
50%
75%
75%
100%
100%
75%
100%
75%
75%
100%
100%
100%
100%
100%
CodeLlama 34 16k Large
75%
75%
100%
75%
100%
50%
100%
50%
100%
75%
100%
62%
75%
75%
100%
100%
88%
100%
Phind 34 16k Large
100%
100%
100%
75%
100%
75%
75%
100%
100%
75%
100%
100%
100%
75%
100%
100%
88%
100%
Qwen14 General 32k Large
75%
100%
100%
75%
50%
75%
75%
75%
100%
75%
75%
75%
50%
100%
100%
100%
88%
100%
Qwen2.5 Coder 3B Small
75%
100%
100%
100%
50%
100%
100%
75%
100%
50%
100%
88%
100%
100%
100%
100%
75%
100%
Qwen2.5 Coder 1.5B Small
75%
100%
100%
100%
50%
100%
100%
50%
100%
75%
100%
88%
75%
100%
100%
100%
88%
100%
Qwen2.5 3B Small
50%
100%
100%
50%
50%
50%
75%
75%
100%
50%
75%
88%
75%
100%
100%
100%
88%
100%
Llama 3.2 3B Small
75%
100%
100%
100%
100%
75%
100%
75%
100%
75%
75%
75%
50%
100%
100%
100%
75%
100%
Phi-3 Mini Small
50%
100%
100%
25%
50%
50%
100%
100%
75%
75%
75%
75%
75%
75%
100%
100%
75%
100%