Score 50
Direct vs cascade codegen, with actual takeaways
This report compares asking each local model directly against asking a smaller model first and then handing its draft to a larger refiner. It is focused on the Python script benchmark in this run and pulls out the quality, speed, and failure patterns that the raw JSON actually supports.
Direct baselines in this run
Smaller -> larger handoff comparisons
Pairs that beat direct larger-model quality
Pairs faster than direct larger-model prompts
Extra wall time added by the cascade pipeline
What this run actually says
These are conclusions inferred from the run output, not generic benchmark filler.
Direct quality leader
qwen2.5-coder:1.5b led this run with a score of 50 out of 100, at 49.5s of wall time.
Speed frontier
qwen2.5-coder:1.5b was the quickest direct model at 49.5s. Its quality landed +0 pts against the top direct score.
Scale payoff
Across the direct ladder, quality moved from 0 at 0.6B to 50 at 1.5B, which is a +50 pt change.
Cascade verdict
No cascade beat its direct larger-model baseline in this run. The closest result was Qwen3 0.6b -> Qwen2.5-C 1.5b at -10 pts versus direct.
Latency tax
The average cascade pipeline added 45.6s over asking the larger refiner directly. The worst penalty was 45.6s.
Small-model salvage
The biggest gain over the smaller model's own draft was Qwen3 0.6b -> Qwen2.5-C 1.5b with +40 pts relative to the smaller direct score.
Direct quality by model size
Higher is better. This shows whether scaling the direct model improved the functional score on the script benchmark.
Direct latency by model size
Lower is better. This captures the first-pass wall time for the same benchmark prompt with no smaller-model draft attached.
Cascade score vs direct refiner score
This is the real quality comparison for each pair: did the larger model improve when it inherited the smaller draft, or did it get dragged down?
Cascade pipeline wall vs direct refiner wall
This shows whether the two-step path saved time or just added latency on top of the direct larger-model call.
What kept breaking
These are the checks that failed most often across direct candidates and cascade refinements.
| Check | Failures | Share | Examples |
|---|---|---|---|
build_report_expected_metrics | 2 | 20.0% | qwen2.5-coder:1.5b, Qwen3 0.6b -> Qwen2.5-C 1.5b |
cli_output_file | 2 | 20.0% | qwen2.5-coder:1.5b, Qwen3 0.6b -> Qwen2.5-C 1.5b |
cli_prints_json_with_top | 2 | 20.0% | qwen2.5-coder:1.5b, Qwen3 0.6b -> Qwen2.5-C 1.5b |
merge_overlaps_expected_windows | 2 | 20.0% | qwen2.5-coder:1.5b, Qwen3 0.6b -> Qwen2.5-C 1.5b |
parse_events_rejects_invalid_window | 1 | 10.0% | Qwen3 0.6b -> Qwen2.5-C 1.5b |
python_compile | 1 | 10.0% | qwen3:0.6b |
Direct baseline breakdown
This is the clean read on what each model did when it saw the original prompt with no smaller-model draft attached.
| Model | Size (B) | Score | Call Wall | Throughput | Eval Wall | Top Failed Checks | Candidate |
|---|---|---|---|---|---|---|---|
| qwen3:0.6b | 0.6 | 0 | 50.9s | 47.13 tok/s | 0.12s | python_compile | deployment_window_analyzer.py |
| qwen2.5-coder:1.5b | 1.5 | 50 | 49.5s | 22.12 tok/s | 0.18s | merge_overlaps_expected_windows, build_report_expected_metrics, cli_prints_json_with_top | deployment_window_analyzer.py |
Cascade breakdown
These rows show whether the smaller draft helped, hurt, or just slowed down the larger refiner.
| Pair | Small Direct | Large Direct | Cascade | Δ vs Large | Δ vs Small | Pipeline Wall | Large Direct Wall | Δ Wall | Faster? | Top Failed Checks | Candidate |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3 0.6b -> Qwen2.5-C 1.5b | 0 | 50 | 40 | -10 pts | +40 pts | 95.1s | 49.5s | +45.6s | no | parse_events_rejects_invalid_window, merge_overlaps_expected_windows, build_report_expected_metrics | deployment_window_analyzer.py |