Gemma 4 MTP on Pascal — a dead end. Use Vulkan instead.
Don’t burn time chasing a 3× MTP speedup for Gemma 4 on Pascal-class GPUs. Switch to Vulkan instead — that’s where the real Gemma 4 win lives.
What we wanted
A 3× tokens-per-second speedup for Gemma 4 E4B serving on Predator (GTX 1060 6 GB) and Pavilion (GTX 1050 4 GB) via Multi-Token Prediction — speculative decoding with a small drafter model proposing multiple tokens that a larger target verifies in a single forward pass.
What worked instead
Switching llama.cpp from the CUDA backend to the Vulkan backend on Pascal nets +27% to +47% generation throughput, because Flash Attention is broken on compute capability 6.1 in llama.cpp’s CUDA backend (issue #7055) and Vulkan goes around it.
| Machine | Model | Backend | tok/s | Note |
|---|---|---|---|---|
| Predator GTX 1060 | Gemma 4 E4B Q4_K_M | CUDA | 25.34 | baseline |
| Predator GTX 1060 | Gemma 4 E4B Q4_K_M | Vulkan | 32.15 | +27% |
| Predator GTX 1060 | Gemma 4 E2B Q4_K_M | Vulkan | 56.90 | 2.2× baseline |
| Pavilion GTX 1050 | Gemma 4 E2B Q4_K_M | CUDA | 23.44 | partial baseline |
| Pavilion GTX 1050 | Gemma 4 E2B Q4_K_M | Vulkan | 34.46 | +47% |
That’s the real deployment recipe on Pascal: Vulkan + Q4_K_M, no MTP. Everything below is what we tried on the way to learning that.
Six MTP attempts that didn’t pencil
1. The official Google drafter doesn’t load in llama.cpp
Google ships an
official MTP drafter for Gemma 3
intended to predict 4 tokens ahead. We tried converting it to
GGUF and pointing llama.cpp at it via --model-draft.
Build 8951 errors out on the architecture; build 9041 errors out
too. The drafter format llama.cpp expects isn’t the format
Google publishes.
2. Generic --model-draft with a smaller Gemma yields 0–13% slowdown
Falling back to "use a smaller Gemma as the drafter for a bigger Gemma." Tested Gemma-2-2B drafting Gemma-3-9B on Predator, smollm2-360M drafting Gemma-3-4B on Pavilion. Across configurations, output throughput came in 0–13% slower than the same target model serving alone. The drafter overhead exceeded the accept-rate-times-tokens-saved math at every config we tried.
3. CPU-side drafting tanks throughput
The "what if the drafter ran on CPU and the target on GPU?" instinct, in case the GPU contention was the bottleneck. CPU drafting on a Ryzen-class chip drops the system to single-digit tok/s — the CPU forward pass for even a 360M-parameter drafter is slower than the GPU’s entire combined speculate-and-verify cycle on the target.
4. Pascal lacks tensor cores so the speedup math doesn’t pencil
Even when MTP works on newer hardware, the speedup is roughly proportional to how much faster the target’s parallel verification is than its serial token generation. On Ampere / Hopper / Ada, that ratio is dominated by tensor-core throughput on the verification batch — the target can verify N speculative tokens in ≈1 forward pass time. Pascal has no tensor cores; the verification batch on a GTX 1060 isn’t meaningfully faster per-token than serial generation. The architectural assumption MTP relies on is missing.
5. HuggingFace Transformers generate(num_assistant_tokens=...) errors out
Tested with HF Transformers 5.8.0 against
google/gemma-3-9b-it on Predator. The
num_assistant_tokens kwarg fails with cryptic
errors at generation time across multiple drafter pairings
(Gemma-2-2B, Gemma-3-1B-it, the official MTP drafter). HF’s
speculative-decoding implementation has tight assumptions about
drafter-target architecture compatibility that the documented
Gemma 3 MTP drafter doesn’t satisfy.
6. PyTorch / bitsandbytes ecosystem gates above SM_61
Pascal is SM_61. PyTorch 2.4+ cu126 wheels still include SM_61 kernels for base operations, but the speculative-decoding accelerator stacks (vLLM, SGLang, flash-attn, DFlash) all gate above SM_70 or SM_80 with no fallback. Even if we could rewrite Gemma 4 inference outside llama.cpp to do MTP cleanly, the optimization libraries that would make it worth doing have abandoned Pascal.
What to do instead
On Pascal-class GPUs (GTX 1050 / 1060 / 1070 / 1080, Tesla P40), the deployment pattern that actually delivers throughput for Gemma 4 E2B / E4B is:
- llama.cpp with the Vulkan backend (build 8951 or later). Avoids the broken-flash-attention path on Pascal CUDA.
- Q4_K_M quantization for the model weights. Q5_K_M and above hit VRAM ceilings on 4 GB cards; Q4_K_M is the sweet spot for Pascal’s memory budget.
- No MTP, no speculative decoding. The 27–47% Vulkan win is the ceiling on this hardware; chasing more is a multi-day dead end.
- Use Gemma 4 E2B for 4 GB cards, E4B for 6 GB cards. The throughput numbers above show E2B at 56 tok/s on a six-year-old GTX 1060 and 34 tok/s on a 4 GB GTX 1050 — both fast enough for assistant chat.
If you’re building on this
These are six independently-confirmed dead ends so you can skip them. If you find a Pascal MTP path that does pencil, we’d love to know — the reference data is in our public archive; re-run, prove us wrong, tell us where we got it wrong.
Source research doc · All findings →