The DeepSeek V3.2 Breakthrough Simplified

DeepSeek's new DeepSeek-V3.2-Exp paper presents a very clever method to speed up attention calculations.

DeepSeek Sparse Attention

Here is a basic outline of the new DeepSeek Sparse Attention (DSA) module used in their new model.

Submodule 1 - Lightning Indexer

The lightning indexer uses a small attention-like computation to create an attention mask (a set of which tokens should "pay attention" to each other). Each row of the mask contains only the $k$ largest interactions between a given token's query and all previous tokens' keys.
Time Complexity: $O(n^2)$ (same class as attention)
Note: The speedup compared to standard attention (also $O(n^2)$) is due to using less attention heads and a less dimensions for each key and query vector.

Submodule 2 - Multi-Latent Attention (MLA)

A large multi-latent attention layer computes the output of the block, with one exception: for each query, it only computes the $k$ interactions that are included in the attention mask in a process known as sparse attention.
Time Complexity: $O(kn)$ where $n$ is the number of tokens and $k$ is the number of interactions to use.

Why Does This Matter?

Although DeepSeek Sparse Attention technically has the same quadratic time complexity as attention, it is able to reduce the amount of time spent in the bottleneck by simply using a smaller submodule to check interactions. It then can reuse some of that information in a larger submodule.
This kind of reused computation in attention layers reminds me of techniques such as YOCO (sharing a KV cache across layers) and Multi-Query Attention (sharing a KV cache across attention heads). What makes DeepSeek's approach different is that it is reusing information about which entries of the attention matrix are more important.