Overnight Telemetry

Python Model Report

This mini page turns the overnight suite into a quick visual read: which models were fastest, which ones improved after the cold first prompt, and which Python task types each model handled best.

Host: vmi3206382 Finished: 2026-04-10T01:43:55Z Source: 2026-04-09-python-task-suite.json Avg primary latency: 205.4 s Avg follow-up throughput: 4.40 tok/s

Run 2026-04-09T17:27:52Z Overnight Python suite start time

Total Wall Time 8.27 h End-to-end runtime across all models

Models 11 Large and small models included

Questions 10 Primary prompts per model plus follow-up summaries

Avg Primary Quality 79% Average marker coverage across the suite

Headline Winners

These are the standout models for the metrics that matter most in the overnight run.

Fastest Primary

Qwen2.5 Coder 1.5B

45.4 s

Small model with the quickest average first-answer pass.

Fastest Follow-up

Qwen2.5 Coder 1.5B

10.4 s

Quickest summary-request average once the model already had context.

Best Primary Quality

Qwen32 Coder 32k

93%

Highest average hit-rate on the full Python task requirements.

Best Follow-up Quality

Qwen32 Coder 32k

100%

Strongest performance on the follow-up summary prompts.

Biggest Warmup Gain

CodeLlama 34 16k

74.2 s

How much faster the later primaries got after the first cold request.

Biggest Warmup Throughput Gain

Llama 3.2 3B

1.25 tok/s

How much token generation speed improved after the first primary prompt.

Primary Prompt Latency

Average time to answer the main Python task prompt. Lower is better.

Qwen2.5 Coder 1.5B Small

45.4 s

Qwen2.5 Coder 3B Small

54.2 s

Qwen2.5 3B Small

62.9 s

Llama 3.2 3B Small

65.2 s

Phi-3 Mini Small

111.4 s

Qwen14 General 32k Large

156.6 s

Qwen14 Coder 32k Large

231.0 s

Codestral 32k Large

292.5 s

Phind 34 16k Large

405.9 s

CodeLlama 34 16k Large

415.3 s

Qwen32 Coder 32k Large

418.9 s

Follow-up Prompt Latency

Average time to answer the follow-up summary request. Lower is better.

Qwen2.5 Coder 1.5B Small

10.4 s

Qwen2.5 Coder 3B Small

13.5 s

Qwen2.5 3B Small

20.5 s

Llama 3.2 3B Small

20.5 s

Phi-3 Mini Small

35.8 s

Qwen14 General 32k Large

40.5 s

Qwen14 Coder 32k Large

76.0 s

Phind 34 16k Large

81.2 s

Qwen32 Coder 32k Large

101.9 s

CodeLlama 34 16k Large

137.4 s

Codestral 32k Large

179.0 s

Primary Prompt Throughput

Average tokens per second while answering the main Python tasks. Higher is better.

Qwen2.5 Coder 1.5B Small

10.61 tok/s

Qwen2.5 Coder 3B Small

7.91 tok/s

Qwen2.5 3B Small

7.25 tok/s

Llama 3.2 3B Small

6.72 tok/s

Phi-3 Mini Small

6.16 tok/s

Qwen14 General 32k Large

2.59 tok/s

Qwen14 Coder 32k Large

1.93 tok/s

Codestral 32k Large

1.59 tok/s

Phind 34 16k Large

1.41 tok/s

CodeLlama 34 16k Large

1.38 tok/s

Qwen32 Coder 32k Large

0.97 tok/s

Follow-up Prompt Throughput

Average tokens per second on the summary requests. Higher is better.

Qwen2.5 Coder 1.5B Small

11.64 tok/s

Qwen2.5 Coder 3B Small

8.98 tok/s

Qwen2.5 3B Small

6.43 tok/s

Llama 3.2 3B Small

6.35 tok/s

Phi-3 Mini Small

5.68 tok/s

Qwen14 General 32k Large

2.22 tok/s

Qwen14 Coder 32k Large

1.79 tok/s

Codestral 32k Large

1.53 tok/s

Phind 34 16k Large

1.42 tok/s

CodeLlama 34 16k Large

1.39 tok/s

Qwen32 Coder 32k Large

0.96 tok/s

Primary Quality

Average marker coverage on the full Python task prompts. Higher is better.

Qwen32 Coder 32k Large

93%

Qwen14 Coder 32k Large

86%

Phind 34 16k Large

86%

Codestral 32k Large

85%

Qwen2.5 3B Small

85%

Qwen14 General 32k Large

85%

Llama 3.2 3B Small

83%

Qwen2.5 Coder 1.5B Small

74%

Qwen2.5 Coder 3B Small

73%

Phi-3 Mini Small

65%

CodeLlama 34 16k Large

58%

Follow-up Quality

Average marker coverage on the follow-up summary prompts. Higher is better.

Qwen32 Coder 32k Large

100%

Codestral 32k Large

100%

Phind 34 16k Large

100%

Qwen14 General 32k Large

100%

Qwen2.5 3B Small

100%

Llama 3.2 3B Small

100%

Qwen14 Coder 32k Large

50%

Phi-3 Mini Small

CodeLlama 34 16k Large

Qwen2.5 Coder 3B Small

Qwen2.5 Coder 1.5B Small

Cold-Start Latency Delta

Positive values mean the first primary prompt was slower than the warmed-up average. Negative values mean the first prompt was actually faster.

Later prompts got faster Later prompts got slower

CodeLlama 34 16k Large

74.2 s Later prompts got faster

Qwen32 Coder 32k Large

49.9 s Later prompts got faster

Qwen14 General 32k Large

27.5 s Later prompts got faster

Phi-3 Mini Small

19.9 s Later prompts got faster

Qwen2.5 Coder 3B Small

-2.2 s Later prompts got slower

Qwen14 Coder 32k Large

-13.3 s Later prompts got slower

Llama 3.2 3B Small

-14.9 s Later prompts got slower

Qwen2.5 Coder 1.5B Small

-23.3 s Later prompts got slower

Codestral 32k Large

-26.6 s Later prompts got slower

Qwen2.5 3B Small

-26.7 s Later prompts got slower

Phind 34 16k Large

-62.1 s Later prompts got slower

Cold-Start Throughput Delta

Positive values mean token generation sped up after the first primary prompt. Negative values mean throughput dropped on later prompts.

Later prompts got faster Later prompts got slower

Llama 3.2 3B Small

1.25 tok/s Later prompts got faster

Phi-3 Mini Small

0.91 tok/s Later prompts got faster

Qwen2.5 Coder 3B Small

0.82 tok/s Later prompts got faster

Qwen14 General 32k Large

0.41 tok/s Later prompts got faster

Qwen14 Coder 32k Large

0.12 tok/s Later prompts got faster

CodeLlama 34 16k Large

0.10 tok/s Later prompts got faster

Codestral 32k Large

0.10 tok/s Later prompts got faster

Qwen32 Coder 32k Large

0.08 tok/s Later prompts got faster

Phind 34 16k Large

-0.04 tok/s Later prompts got slower

Qwen2.5 3B Small

-1.84 tok/s Later prompts got slower

Qwen2.5 Coder 1.5B Small

-3.23 tok/s Later prompts got slower

Model Overview

This is the quick scan: speed, throughput, primary quality, usable answers, and the cold-start latency gap for each model.

Model	Primary Avg	Follow-up Avg	Primary tok/s	Follow-up tok/s	Primary Quality	Usable	Cold Gain
Qwen32 Coder 32k Large	418.9 s	101.9 s	0.97 tok/s	0.96 tok/s	93%	10/10	49.9 s
Qwen14 Coder 32k Large	231.0 s	76.0 s	1.93 tok/s	1.79 tok/s	86%	10/10	-13.3 s
Codestral 32k Large	292.5 s	179.0 s	1.59 tok/s	1.53 tok/s	85%	10/10	-26.6 s
CodeLlama 34 16k Large	415.3 s	137.4 s	1.38 tok/s	1.39 tok/s	58%	10/10	74.2 s
Phind 34 16k Large	405.9 s	81.2 s	1.41 tok/s	1.42 tok/s	86%	10/10	-62.1 s
Qwen14 General 32k Large	156.6 s	40.5 s	2.59 tok/s	2.22 tok/s	85%	10/10	27.5 s
Qwen2.5 Coder 3B Small	54.2 s	13.5 s	7.91 tok/s	8.98 tok/s	73%	10/10	-2.2 s
Qwen2.5 Coder 1.5B Small	45.4 s	10.4 s	10.61 tok/s	11.64 tok/s	74%	10/10	-23.3 s
Qwen2.5 3B Small	62.9 s	20.5 s	7.25 tok/s	6.43 tok/s	85%	10/10	-26.7 s
Llama 3.2 3B Small	65.2 s	20.5 s	6.72 tok/s	6.35 tok/s	83%	10/10	-14.9 s
Phi-3 Mini Small	111.4 s	35.8 s	6.16 tok/s	5.68 tok/s	65%	10/10	19.9 s

Python Task-Type Heatmap

Values show primary-prompt marker coverage by task type. Green means the model hit more of the requested Python requirements for that category.

Model	cross_repo_debugging	debugging	forensics	planning	review	tests
Qwen32 Coder 32k Large	100%	100%	100%	100%	100%	82%
Qwen14 Coder 32k Large	83%	92%	100%	83%	100%	79%
Codestral 32k Large	100%	92%	100%	67%	100%	75%
CodeLlama 34 16k Large	67%	44%	100%	50%	17%	64%
Phind 34 16k Large	83%	92%	100%	67%	100%	82%
Qwen14 General 32k Large	83%	100%	100%	100%	50%	79%
Qwen2.5 Coder 3B Small	83%	75%	100%	50%	100%	61%
Qwen2.5 Coder 1.5B Small	100%	58%	67%	67%	100%	71%
Qwen2.5 3B Small	83%	92%	100%	100%	83%	75%
Llama 3.2 3B Small	83%	67%	100%	100%	100%	79%
Phi-3 Mini Small	100%	62%	50%	67%	50%	64%