Dead end · 2026-05-06 · 6 min read · Pavilion + Predator

Gemma 4 MTP on Pascal — a dead end. Use Vulkan instead.

Don’t burn time chasing a 3× MTP speedup for Gemma 4 on Pascal-class GPUs. Switch to Vulkan instead — that’s where the real Gemma 4 win lives.

What we wanted

A 3× tokens-per-second speedup for Gemma 4 E4B serving on Predator (GTX 1060 6 GB) and Pavilion (GTX 1050 4 GB) via Multi-Token Prediction — speculative decoding with a small drafter model proposing multiple tokens that a larger target verifies in a single forward pass.

What worked instead

Switching llama.cpp from the CUDA backend to the Vulkan backend on Pascal nets +27% to +47% generation throughput, because Flash Attention is broken on compute capability 6.1 in llama.cpp’s CUDA backend (issue #7055) and Vulkan goes around it.

Machine	Model	Backend	tok/s	Note
Predator GTX 1060	Gemma 4 E4B Q4_K_M	CUDA	25.34	baseline
Predator GTX 1060	Gemma 4 E4B Q4_K_M	Vulkan	32.15	+27%
Predator GTX 1060	Gemma 4 E2B Q4_K_M	Vulkan	56.90	2.2× baseline
Pavilion GTX 1050	Gemma 4 E2B Q4_K_M	CUDA	23.44	partial baseline
Pavilion GTX 1050	Gemma 4 E2B Q4_K_M	Vulkan	34.46	+47%

That’s the real deployment recipe on Pascal: Vulkan + Q4_K_M, no MTP. Everything below is what we tried on the way to learning that.

Six MTP attempts that didn’t pencil

1. The official Google drafter doesn’t load in llama.cpp

Google ships an official MTP drafter for Gemma 3 intended to predict 4 tokens ahead. We tried converting it to GGUF and pointing llama.cpp at it via --model-draft. Build 8951 errors out on the architecture; build 9041 errors out too. The drafter format llama.cpp expects isn’t the format Google publishes.

2. Generic `--model-draft` with a smaller Gemma yields 0–13% slowdown

Falling back to "use a smaller Gemma as the drafter for a bigger Gemma." Tested Gemma-2-2B drafting Gemma-3-9B on Predator, smollm2-360M drafting Gemma-3-4B on Pavilion. Across configurations, output throughput came in 0–13% slower than the same target model serving alone. The drafter overhead exceeded the accept-rate-times-tokens-saved math at every config we tried.

3. CPU-side drafting tanks throughput

The "what if the drafter ran on CPU and the target on GPU?" instinct, in case the GPU contention was the bottleneck. CPU drafting on a Ryzen-class chip drops the system to single-digit tok/s — the CPU forward pass for even a 360M-parameter drafter is slower than the GPU’s entire combined speculate-and-verify cycle on the target.

4. Pascal lacks tensor cores so the speedup math doesn’t pencil

Even when MTP works on newer hardware, the speedup is roughly proportional to how much faster the target’s parallel verification is than its serial token generation. On Ampere / Hopper / Ada, that ratio is dominated by tensor-core throughput on the verification batch — the target can verify N speculative tokens in ≈1 forward pass time. Pascal has no tensor cores; the verification batch on a GTX 1060 isn’t meaningfully faster per-token than serial generation. The architectural assumption MTP relies on is missing.

5. HuggingFace Transformers `generate(num_assistant_tokens=...)` errors out

Tested with HF Transformers 5.8.0 against google/gemma-3-9b-it on Predator. The num_assistant_tokens kwarg fails with cryptic errors at generation time across multiple drafter pairings (Gemma-2-2B, Gemma-3-1B-it, the official MTP drafter). HF’s speculative-decoding implementation has tight assumptions about drafter-target architecture compatibility that the documented Gemma 3 MTP drafter doesn’t satisfy.

6. PyTorch / bitsandbytes ecosystem gates above SM_61

Pascal is SM_61. PyTorch 2.4+ cu126 wheels still include SM_61 kernels for base operations, but the speculative-decoding accelerator stacks (vLLM, SGLang, flash-attn, DFlash) all gate above SM_70 or SM_80 with no fallback. Even if we could rewrite Gemma 4 inference outside llama.cpp to do MTP cleanly, the optimization libraries that would make it worth doing have abandoned Pascal.

What to do instead

On Pascal-class GPUs (GTX 1050 / 1060 / 1070 / 1080, Tesla P40), the deployment pattern that actually delivers throughput for Gemma 4 E2B / E4B is:

llama.cpp with the Vulkan backend (build 8951 or later). Avoids the broken-flash-attention path on Pascal CUDA.
Q4_K_M quantization for the model weights. Q5_K_M and above hit VRAM ceilings on 4 GB cards; Q4_K_M is the sweet spot for Pascal’s memory budget.
No MTP, no speculative decoding. The 27–47% Vulkan win is the ceiling on this hardware; chasing more is a multi-day dead end.
Use Gemma 4 E2B for 4 GB cards, E4B for 6 GB cards. The throughput numbers above show E2B at 56 tok/s on a six-year-old GTX 1060 and 34 tok/s on a 4 GB GTX 1050 — both fast enough for assistant chat.

If you’re building on this

These are six independently-confirmed dead ends so you can skip them. If you find a Pascal MTP path that does pencil, we’d love to know — the reference data is in our public archive; re-run, prove us wrong, tell us where we got it wrong.

Source research doc · All findings →