Cursor is trying to make MoE models a bit less wasteful when they generate text. Its new method, called “warp decode,” is built around a simple idea—stop spending so much time moving data around, and let the GPU focus on actually producing results.
This becomes important during decoding, where models generate one token at a time. In that setting, a lot of the usual tricks, like batching, don’t really help. Instead, systems often get bogged down in internal steps that don’t directly contribute to the final output.
Normally, MoE models send each token through a few selected “experts,” and the whole pipeline is designed around those experts. Tokens are grouped, processed, reshuffled, and then combined again. It works fine at scale, but during decoding, it starts to feel inefficient—too many steps just to organise data before any real computation happens.
Warp decode takes a cleaner approach. Rather than centring everything around experts, it focuses on the output itself. Each GPU warp—a small group of threads—handles one output value from start to finish. It gathers the needed weights, combines inputs from the chosen experts, and writes the result straight away.
By doing this, the process cuts out a lot of the in-between work. There’s no need for extra buffers or constant coordination inside the GPU. Everything becomes more direct, which helps reduce memory traffic and keeps the hardware busier with actual computation.
In testing on NVIDIA’s B200 GPUs, Cursor saw strong memory bandwidth numbers, getting close to the limits of the hardware. The gap that remains doesn’t seem to come from poor implementation, but from the way MoE models naturally access memory in unpredictable patterns.
There’s also a subtle improvement in how calculations are handled. Keeping everything in higher precision avoids some of the small rounding errors that can build up in more complex pipelines.
Cursor isn’t pitching this as a replacement for existing methods. The traditional approach still works better when dealing with large batches or prefill stages. But for decoding-heavy workloads—the kind most users care about—warp decode looks like a practical way to squeeze more performance out of each GPU while keeping things a bit simpler under the hood.
Also Read: Cursor Delivers 99% Faster AI Code Search With Instant Grep








