Archive · vps-81 historical telemetry · python-overnight/2026-04-09-python-task-suite-mini.html. Originally rendered 2026-04-09. Re-hosted from MyServers on 2026-05-06. Methodology and harness conventions may differ from what we use today; see /methodology.html for current standards. ← back to all benchmarks
Overnight Telemetry

Python Model Report

This mini page turns the overnight suite into a quick visual read: which models were fastest, which ones improved after the cold first prompt, and which Python task types each model handled best.

Host: vmi3206382 Finished: 2026-04-10T01:43:55Z Source: 2026-04-09-python-task-suite.json Avg primary latency: 205.4 s Avg follow-up throughput: 4.40 tok/s
Run 2026-04-09T17:27:52Z Overnight Python suite start time
Total Wall Time 8.27 h End-to-end runtime across all models
Models 11 Large and small models included
Questions 10 Primary prompts per model plus follow-up summaries
Avg Primary Quality 79% Average marker coverage across the suite

Headline Winners

These are the standout models for the metrics that matter most in the overnight run.

Fastest Primary

Qwen2.5 Coder 1.5B

45.4 s

Small model with the quickest average first-answer pass.

Fastest Follow-up

Qwen2.5 Coder 1.5B

10.4 s

Quickest summary-request average once the model already had context.

Best Primary Quality

Qwen32 Coder 32k

93%

Highest average hit-rate on the full Python task requirements.

Best Follow-up Quality

Qwen32 Coder 32k

100%

Strongest performance on the follow-up summary prompts.

Biggest Warmup Gain

CodeLlama 34 16k

74.2 s

How much faster the later primaries got after the first cold request.

Biggest Warmup Throughput Gain

Llama 3.2 3B

1.25 tok/s

How much token generation speed improved after the first primary prompt.

Primary Prompt Latency

Average time to answer the main Python task prompt. Lower is better.

Qwen2.5 Coder 1.5B Small
45.4 s
Qwen2.5 Coder 3B Small
54.2 s
Qwen2.5 3B Small
62.9 s
Llama 3.2 3B Small
65.2 s
Phi-3 Mini Small
111.4 s
Qwen14 General 32k Large
156.6 s
Qwen14 Coder 32k Large
231.0 s
Codestral 32k Large
292.5 s
Phind 34 16k Large
405.9 s
CodeLlama 34 16k Large
415.3 s
Qwen32 Coder 32k Large
418.9 s

Follow-up Prompt Latency

Average time to answer the follow-up summary request. Lower is better.

Qwen2.5 Coder 1.5B Small
10.4 s
Qwen2.5 Coder 3B Small
13.5 s
Qwen2.5 3B Small
20.5 s
Llama 3.2 3B Small
20.5 s
Phi-3 Mini Small
35.8 s
Qwen14 General 32k Large
40.5 s
Qwen14 Coder 32k Large
76.0 s
Phind 34 16k Large
81.2 s
Qwen32 Coder 32k Large
101.9 s
CodeLlama 34 16k Large
137.4 s
Codestral 32k Large
179.0 s

Primary Prompt Throughput

Average tokens per second while answering the main Python tasks. Higher is better.

Qwen2.5 Coder 1.5B Small
10.61 tok/s
Qwen2.5 Coder 3B Small
7.91 tok/s
Qwen2.5 3B Small
7.25 tok/s
Llama 3.2 3B Small
6.72 tok/s
Phi-3 Mini Small
6.16 tok/s
Qwen14 General 32k Large
2.59 tok/s
Qwen14 Coder 32k Large
1.93 tok/s
Codestral 32k Large
1.59 tok/s
Phind 34 16k Large
1.41 tok/s
CodeLlama 34 16k Large
1.38 tok/s
Qwen32 Coder 32k Large
0.97 tok/s

Follow-up Prompt Throughput

Average tokens per second on the summary requests. Higher is better.

Qwen2.5 Coder 1.5B Small
11.64 tok/s
Qwen2.5 Coder 3B Small
8.98 tok/s
Qwen2.5 3B Small
6.43 tok/s
Llama 3.2 3B Small
6.35 tok/s
Phi-3 Mini Small
5.68 tok/s
Qwen14 General 32k Large
2.22 tok/s
Qwen14 Coder 32k Large
1.79 tok/s
Codestral 32k Large
1.53 tok/s
Phind 34 16k Large
1.42 tok/s
CodeLlama 34 16k Large
1.39 tok/s
Qwen32 Coder 32k Large
0.96 tok/s

Primary Quality

Average marker coverage on the full Python task prompts. Higher is better.

Qwen32 Coder 32k Large
93%
Qwen14 Coder 32k Large
86%
Phind 34 16k Large
86%
Codestral 32k Large
85%
Qwen2.5 3B Small
85%
Qwen14 General 32k Large
85%
Llama 3.2 3B Small
83%
Qwen2.5 Coder 1.5B Small
74%
Qwen2.5 Coder 3B Small
73%
Phi-3 Mini Small
65%
CodeLlama 34 16k Large
58%

Follow-up Quality

Average marker coverage on the follow-up summary prompts. Higher is better.

Qwen32 Coder 32k Large
100%
Codestral 32k Large
100%
Phind 34 16k Large
100%
Qwen14 General 32k Large
100%
Qwen2.5 3B Small
100%
Llama 3.2 3B Small
100%
Qwen14 Coder 32k Large
50%
Phi-3 Mini Small
7%
CodeLlama 34 16k Large
0%
Qwen2.5 Coder 3B Small
0%
Qwen2.5 Coder 1.5B Small
0%

Cold-Start Latency Delta

Positive values mean the first primary prompt was slower than the warmed-up average. Negative values mean the first prompt was actually faster.

Later prompts got faster Later prompts got slower
CodeLlama 34 16k Large
74.2 s Later prompts got faster
Qwen32 Coder 32k Large
49.9 s Later prompts got faster
Qwen14 General 32k Large
27.5 s Later prompts got faster
Phi-3 Mini Small
19.9 s Later prompts got faster
Qwen2.5 Coder 3B Small
-2.2 s Later prompts got slower
Qwen14 Coder 32k Large
-13.3 s Later prompts got slower
Llama 3.2 3B Small
-14.9 s Later prompts got slower
Qwen2.5 Coder 1.5B Small
-23.3 s Later prompts got slower
Codestral 32k Large
-26.6 s Later prompts got slower
Qwen2.5 3B Small
-26.7 s Later prompts got slower
Phind 34 16k Large
-62.1 s Later prompts got slower

Cold-Start Throughput Delta

Positive values mean token generation sped up after the first primary prompt. Negative values mean throughput dropped on later prompts.

Later prompts got faster Later prompts got slower
Llama 3.2 3B Small
1.25 tok/s Later prompts got faster
Phi-3 Mini Small
0.91 tok/s Later prompts got faster
Qwen2.5 Coder 3B Small
0.82 tok/s Later prompts got faster
Qwen14 General 32k Large
0.41 tok/s Later prompts got faster
Qwen14 Coder 32k Large
0.12 tok/s Later prompts got faster
CodeLlama 34 16k Large
0.10 tok/s Later prompts got faster
Codestral 32k Large
0.10 tok/s Later prompts got faster
Qwen32 Coder 32k Large
0.08 tok/s Later prompts got faster
Phind 34 16k Large
-0.04 tok/s Later prompts got slower
Qwen2.5 3B Small
-1.84 tok/s Later prompts got slower
Qwen2.5 Coder 1.5B Small
-3.23 tok/s Later prompts got slower

Model Overview

This is the quick scan: speed, throughput, primary quality, usable answers, and the cold-start latency gap for each model.

Model Primary Avg Follow-up Avg Primary tok/s Follow-up tok/s Primary Quality Usable Cold Gain
Qwen32 Coder 32k Large 418.9 s 101.9 s 0.97 tok/s 0.96 tok/s 93% 10/10 49.9 s
Qwen14 Coder 32k Large 231.0 s 76.0 s 1.93 tok/s 1.79 tok/s 86% 10/10 -13.3 s
Codestral 32k Large 292.5 s 179.0 s 1.59 tok/s 1.53 tok/s 85% 10/10 -26.6 s
CodeLlama 34 16k Large 415.3 s 137.4 s 1.38 tok/s 1.39 tok/s 58% 10/10 74.2 s
Phind 34 16k Large 405.9 s 81.2 s 1.41 tok/s 1.42 tok/s 86% 10/10 -62.1 s
Qwen14 General 32k Large 156.6 s 40.5 s 2.59 tok/s 2.22 tok/s 85% 10/10 27.5 s
Qwen2.5 Coder 3B Small 54.2 s 13.5 s 7.91 tok/s 8.98 tok/s 73% 10/10 -2.2 s
Qwen2.5 Coder 1.5B Small 45.4 s 10.4 s 10.61 tok/s 11.64 tok/s 74% 10/10 -23.3 s
Qwen2.5 3B Small 62.9 s 20.5 s 7.25 tok/s 6.43 tok/s 85% 10/10 -26.7 s
Llama 3.2 3B Small 65.2 s 20.5 s 6.72 tok/s 6.35 tok/s 83% 10/10 -14.9 s
Phi-3 Mini Small 111.4 s 35.8 s 6.16 tok/s 5.68 tok/s 65% 10/10 19.9 s

Python Task-Type Heatmap

Values show primary-prompt marker coverage by task type. Green means the model hit more of the requested Python requirements for that category.

Modelcross_repo_debuggingdebuggingforensicsplanningreviewtests
Qwen32 Coder 32k Large
100%
100%
100%
100%
100%
82%
Qwen14 Coder 32k Large
83%
92%
100%
83%
100%
79%
Codestral 32k Large
100%
92%
100%
67%
100%
75%
CodeLlama 34 16k Large
67%
44%
100%
50%
17%
64%
Phind 34 16k Large
83%
92%
100%
67%
100%
82%
Qwen14 General 32k Large
83%
100%
100%
100%
50%
79%
Qwen2.5 Coder 3B Small
83%
75%
100%
50%
100%
61%
Qwen2.5 Coder 1.5B Small
100%
58%
67%
67%
100%
71%
Qwen2.5 3B Small
83%
92%
100%
100%
83%
75%
Llama 3.2 3B Small
83%
67%
100%
100%
100%
79%
Phi-3 Mini Small
100%
62%
50%
67%
50%
64%