Archive · vps-81 historical telemetry · prompt-handoff-codegen/prompt-handoff-codegen-2026-04-20_13-08-27/report.html. Originally rendered 2026-04-20. Re-hosted from MyServers on 2026-05-06. Methodology and harness conventions may differ from what we use today; see /methodology.html for current standards. ← back to all benchmarks
Prompt Handoff Benchmark

Direct vs cascade codegen, with actual takeaways

This report compares asking each local model directly against asking a smaller model first and then handing its draft to a larger refiner. It is focused on the Python script benchmark in this run and pulls out the quality, speed, and failure patterns that the raw JSON actually supports.

Started: 2026-04-20T13:08:27Z Finished: 2026-04-20T13:10:53Z Best direct: qwen2.5-coder:1.5b (50) Best cascade: qwen3:0.6b -> qwen2.5-coder:1.5b (40)
Best Direct
qwen2.5-coder:1.5b

Score 50

Models
2

Direct baselines in this run

Pairs
1

Smaller -> larger handoff comparisons

Cascade Quality Wins
0/1

Pairs that beat direct larger-model quality

Cascade Speed Wins
0/1

Pairs faster than direct larger-model prompts

Avg Latency Tax
45.6s

Extra wall time added by the cascade pipeline

Findings

What this run actually says

These are conclusions inferred from the run output, not generic benchmark filler.

Direct quality leader

qwen2.5-coder:1.5b led this run with a score of 50 out of 100, at 49.5s of wall time.

Speed frontier

qwen2.5-coder:1.5b was the quickest direct model at 49.5s. Its quality landed +0 pts against the top direct score.

Scale payoff

Across the direct ladder, quality moved from 0 at 0.6B to 50 at 1.5B, which is a +50 pt change.

Cascade verdict

No cascade beat its direct larger-model baseline in this run. The closest result was Qwen3 0.6b -> Qwen2.5-C 1.5b at -10 pts versus direct.

Latency tax

The average cascade pipeline added 45.6s over asking the larger refiner directly. The worst penalty was 45.6s.

Small-model salvage

The biggest gain over the smaller model's own draft was Qwen3 0.6b -> Qwen2.5-C 1.5b with +40 pts relative to the smaller direct score.

Area Chart

Direct quality by model size

Higher is better. This shows whether scaling the direct model improved the functional score on the script benchmark.

Direct score
503825120 0.6B1.5B
Area Chart

Direct latency by model size

Lower is better. This captures the first-pass wall time for the same benchmark prompt with no smaller-model draft attached.

Direct wall time
50.9s38.2s25.4s12.7s0.0s 0.6B1.5B
Area Chart

Cascade score vs direct refiner score

This is the real quality comparison for each pair: did the larger model improve when it inherited the smaller draft, or did it get dragged down?

Cascade score
Direct refiner score
503825120 0.6->1.5B
Area Chart

Cascade pipeline wall vs direct refiner wall

This shows whether the two-step path saved time or just added latency on top of the direct larger-model call.

Cascade pipeline wall
Direct refiner wall
95.1s71.4s47.6s23.8s0.0s 0.6->1.5B
Failure Hotspots

What kept breaking

These are the checks that failed most often across direct candidates and cascade refinements.

Check Failures Share Examples
build_report_expected_metrics220.0%qwen2.5-coder:1.5b, Qwen3 0.6b -> Qwen2.5-C 1.5b
cli_output_file220.0%qwen2.5-coder:1.5b, Qwen3 0.6b -> Qwen2.5-C 1.5b
cli_prints_json_with_top220.0%qwen2.5-coder:1.5b, Qwen3 0.6b -> Qwen2.5-C 1.5b
merge_overlaps_expected_windows220.0%qwen2.5-coder:1.5b, Qwen3 0.6b -> Qwen2.5-C 1.5b
parse_events_rejects_invalid_window110.0%Qwen3 0.6b -> Qwen2.5-C 1.5b
python_compile110.0%qwen3:0.6b
Direct Models

Direct baseline breakdown

This is the clean read on what each model did when it saw the original prompt with no smaller-model draft attached.

Model Size (B) Score Call Wall Throughput Eval Wall Top Failed Checks Candidate
qwen3:0.6b0.6050.9s47.13 tok/s0.12spython_compiledeployment_window_analyzer.py
qwen2.5-coder:1.5b1.55049.5s22.12 tok/s0.18smerge_overlaps_expected_windows, build_report_expected_metrics, cli_prints_json_with_topdeployment_window_analyzer.py
Handoff Pairs

Cascade breakdown

These rows show whether the smaller draft helped, hurt, or just slowed down the larger refiner.

Pair Small Direct Large Direct Cascade Δ vs Large Δ vs Small Pipeline Wall Large Direct Wall Δ Wall Faster? Top Failed Checks Candidate
Qwen3 0.6b -> Qwen2.5-C 1.5b05040-10 pts+40 pts95.1s49.5s+45.6snoparse_events_rejects_invalid_window, merge_overlaps_expected_windows, build_report_expected_metricsdeployment_window_analyzer.py