Prompt Handoff Benchmark

Direct vs cascade codegen, with actual takeaways

This report compares asking each local model directly against asking a smaller model first and then handing its draft to a larger refiner. It is focused on the Python script benchmark in this run and pulls out the quality, speed, and failure patterns that the raw JSON actually supports.

Started: 2026-04-20T13:08:27Z Finished: 2026-04-20T13:10:53Z Best direct: qwen2.5-coder:1.5b (50) Best cascade: qwen3:0.6b -> qwen2.5-coder:1.5b (40)

Best Direct

qwen2.5-coder:1.5b

Score 50

Models

Direct baselines in this run

Pairs

Smaller -> larger handoff comparisons

Cascade Quality Wins

0/1

Pairs that beat direct larger-model quality

Cascade Speed Wins

0/1

Pairs faster than direct larger-model prompts

Avg Latency Tax

45.6s

Extra wall time added by the cascade pipeline

Findings

What this run actually says

These are conclusions inferred from the run output, not generic benchmark filler.

Direct quality leader

qwen2.5-coder:1.5b led this run with a score of 50 out of 100, at 49.5s of wall time.

Speed frontier

qwen2.5-coder:1.5b was the quickest direct model at 49.5s. Its quality landed +0 pts against the top direct score.

Scale payoff

Across the direct ladder, quality moved from 0 at 0.6B to 50 at 1.5B, which is a +50 pt change.

Cascade verdict

No cascade beat its direct larger-model baseline in this run. The closest result was Qwen3 0.6b -> Qwen2.5-C 1.5b at -10 pts versus direct.

Latency tax

The average cascade pipeline added 45.6s over asking the larger refiner directly. The worst penalty was 45.6s.

Small-model salvage

The biggest gain over the smaller model's own draft was Qwen3 0.6b -> Qwen2.5-C 1.5b with +40 pts relative to the smaller direct score.

Area Chart

Direct quality by model size

Higher is better. This shows whether scaling the direct model improved the functional score on the script benchmark.

Direct score

Area Chart

Direct latency by model size

Lower is better. This captures the first-pass wall time for the same benchmark prompt with no smaller-model draft attached.

Direct wall time

Area Chart

Cascade score vs direct refiner score

This is the real quality comparison for each pair: did the larger model improve when it inherited the smaller draft, or did it get dragged down?

Cascade score

Direct refiner score

Area Chart

Cascade pipeline wall vs direct refiner wall

This shows whether the two-step path saved time or just added latency on top of the direct larger-model call.

Cascade pipeline wall

Direct refiner wall

Failure Hotspots

What kept breaking

These are the checks that failed most often across direct candidates and cascade refinements.

Check	Failures	Share	Examples
`build_report_expected_metrics`	2	20.0%	qwen2.5-coder:1.5b, Qwen3 0.6b -> Qwen2.5-C 1.5b
`cli_output_file`	2	20.0%	qwen2.5-coder:1.5b, Qwen3 0.6b -> Qwen2.5-C 1.5b
`cli_prints_json_with_top`	2	20.0%	qwen2.5-coder:1.5b, Qwen3 0.6b -> Qwen2.5-C 1.5b
`merge_overlaps_expected_windows`	2	20.0%	qwen2.5-coder:1.5b, Qwen3 0.6b -> Qwen2.5-C 1.5b
`parse_events_rejects_invalid_window`	1	10.0%	Qwen3 0.6b -> Qwen2.5-C 1.5b
`python_compile`	1	10.0%	qwen3:0.6b

Direct Models

Direct baseline breakdown

This is the clean read on what each model did when it saw the original prompt with no smaller-model draft attached.

Model	Size (B)	Score	Call Wall	Throughput	Eval Wall	Top Failed Checks	Candidate
qwen3:0.6b	0.6	0	50.9s	47.13 tok/s	0.12s	python_compile	deployment_window_analyzer.py
qwen2.5-coder:1.5b	1.5	50	49.5s	22.12 tok/s	0.18s	merge_overlaps_expected_windows, build_report_expected_metrics, cli_prints_json_with_top	deployment_window_analyzer.py

Handoff Pairs

Cascade breakdown

These rows show whether the smaller draft helped, hurt, or just slowed down the larger refiner.

Pair	Small Direct	Large Direct	Cascade	Δ vs Large	Δ vs Small	Pipeline Wall	Large Direct Wall	Δ Wall	Faster?	Top Failed Checks	Candidate
Qwen3 0.6b -> Qwen2.5-C 1.5b	0	50	40	-10 pts	+40 pts	95.1s	49.5s	+45.6s	no	parse_events_rejects_invalid_window, merge_overlaps_expected_windows, build_report_expected_metrics	deployment_window_analyzer.py