Overnight Telemetry

Python Model Report

This mini page turns the overnight suite into a quick visual read: which models were fastest, which ones improved after the cold first prompt, and which Python task types each model handled best.

Host: vmi3206382 Finished: n/a Source: 2026-04-13-qwen3_5_9b-vps50-python-task-suite-v1-20q.json Avg primary latency: n/a Avg follow-up throughput: n/a

Run 2026-04-13T05:47:27Z Overnight Python suite start time

Total Wall Time 0.00 h End-to-end runtime across all models

Models 0 Large and small models included

Questions 20 Primary prompts per model plus follow-up summaries

Avg Primary Quality n/a Average marker coverage across the suite

Headline Winners

These are the standout models for the metrics that matter most in the overnight run.

Primary Prompt Latency

Average time to answer the main Python task prompt. Lower is better.

No rows were available for this chart.

Follow-up Prompt Latency

Average time to answer the follow-up summary request. Lower is better.

No rows were available for this chart.

Primary Prompt Throughput

Average tokens per second while answering the main Python tasks. Higher is better.

No rows were available for this chart.

Follow-up Prompt Throughput

Average tokens per second on the summary requests. Higher is better.

No rows were available for this chart.

Primary Quality

Average marker coverage on the full Python task prompts. Higher is better.

No rows were available for this chart.

Follow-up Quality

Average marker coverage on the follow-up summary prompts. Higher is better.

No rows were available for this chart.

Cold-Start Latency Delta

Positive values mean the first primary prompt was slower than the warmed-up average. Negative values mean the first prompt was actually faster.

No delta rows were available for this chart.

Cold-Start Throughput Delta

Positive values mean token generation sped up after the first primary prompt. Negative values mean throughput dropped on later prompts.

No delta rows were available for this chart.