needhelp
← Back to blog

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

by needhelp
attention-mechanism
linear-attention
transformer
llm
long-context
deep-learning

The Attention Bottleneck

The standard softmax attention mechanism that powers every Transformer has a fundamental problem: quadratic complexity. For a sequence of length N, attention computes N×N pairwise interactions, which means processing long documents eats memory and compute at an unsustainable rate.

This is why models like GPT-5 and Claude Opus 4.7 have practical context limits — and why everyone is racing to find alternatives.

graph LR
    A[Input Sequence<br/>N tokens] --> B[Softmax Attention<br/>O(N²) memory]
    B --> C[KV Cache<br/>unbounded growth]
    C --> D[Decoding Bottleneck]

    A2[Input Sequence<br/>N tokens] --> B2[Linear Attention<br/>O(N) memory]
    B2 --> C2[Fixed-Size State<br/>constant memory]
    C2 --> D2[Efficient Decoding]

    style A fill:#ff6b6b,color:#fff
    style D fill:#ff6b6b,color:#fff
    style A2 fill:#51cf66,color:#fff
    style D2 fill:#51cf66,color:#fff

Linear attention is the leading contender. Instead of storing a full N×N attention matrix, it compresses history into a fixed-size recurrent state — like carrying a single notebook instead of a library. The sequence mixing cost drops from O(N²) to O(N), and decoding uses constant memory.

The Core Problem: Tying Erase and Write

But linear attention introduces a subtler problem: how do you edit a compressed memory?

Think of the recurrent state as a whiteboard. Each new token needs to:

  1. Erase outdated information relevant to the current query
  2. Write new associations into the state

Previous models — Gated DeltaNet and Kimi Delta Attention (KDA) — use a single scalar gate to control both operations. This is like using one knob to adjust both water temperature and pressure in a shower: it works, but you cannot optimize each independently.

The critical insight of the paper: erasing old content (on the key side) and committing new content (on the value side) are fundamentally different operations that should not share a controller.

Gated DeltaNet-2: The Solution

Researchers from NVIDIA (Ali Hatamizadeh, Yejin Choi, Jan Kautz) introduced Gated DeltaNet-2, which separates the erase and write pathways with two independent channel-wise gates:

ComponentSymbolRole
Erase gateb_tControls how much old content to remove (key-side)
Write gatew_tControls how much new content to commit (value-side)
Channel-wise decayinherited from KDAAdaptive per-channel forgetting rate
flowchart TD
    subgraph Previous["Previous Approaches"]
        X1[Input Token] --> G1[Single Scalar Gate]
        G1 --> E1[Erase Old Content]
        G1 --> W1[Write New Content]
        E1 -.->|"tied control"| W1
    end

    subgraph GD2["Gated DeltaNet-2"]
        X2[Input Token] --> EG[Erase Gate b_t<br/>channel-wise]
        X2 --> WG[Write Gate w_t<br/>channel-wise]
        EG --> E2[Erase Old Content<br/>key-side]
        WG --> W2[Write New Content<br/>value-side]
        E2 --> S[Updated State]
        W2 --> S
    end

    style Previous fill:#ffe0e0
    style GD2 fill:#e0ffe0

This separation means the model can decide to keep old associations intact while aggressively writing new ones, or thoroughly erase stale context while only lightly updating — something impossible under the scalar-gate regime.

Generalization Hierarchy

Gated DeltaNet-2 generalizes the prior art:

  • KDA = Gated DeltaNet-2 when both b_t and w_t collapse to the same scalar
  • Gated DeltaNet = KDA when the channel-wise decay also collapses to a scalar
  • DeltaNet = the original, without gating

This means Gated DeltaNet-2 can express any behavior of its predecessors while adding capabilities they fundamentally lack.

Technical Innovations

Beyond the architectural contribution, the paper introduces three key technical advances for practical training:

1. Chunkwise WY Algorithm

Training on long sequences requires chunking for parallelism. The team derived a chunkwise formulation that absorbs channel-wise decay into asymmetric erase factors, enabling efficient parallel training without losing the channel-wise dynamics.

2. Gate-Aware Backward Pass

Standard backpropagation through gating mechanisms can be numerically unstable. The gate-aware backward pass preserves gradient flow through the independent erase and write gates, enabling stable training at scale.

3. Fast-Weight Update View

The update rule is reformulated as a fast-weight system, revealing connections to Hebbian learning and meta-learning that were obscured in prior DeltaNet formulations.

Experimental Results

At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 was evaluated against:

  • Mamba-2
  • Gated DeltaNet
  • Kimi Delta Attention (KDA)
  • Mamba-3 variants

Language Modeling & Reasoning

BenchmarkMamba-2Gated DeltaNetKDAMamba-3GDN-2
Language Modeling PPLbaselineimprovedimprovedimprovedbest
Commonsense Reasoningbaselinecompetitivecompetitivecompetitivebest
Multi-key Retrievalweakmoderatemoderatemoderatestrongest

The Killer Benchmark: RULER Needle-in-a-Haystack

This is where Gated DeltaNet-2 truly shines. The RULER benchmark tests a model’s ability to find specific information buried in extremely long contexts — like finding a single needle in a haystack the size of a football field.

Gated DeltaNet-2 achieves the strongest overall results on these long-context retrieval tasks, with particularly dramatic improvements on the evaluated multi-key retrieval setting — where the model must find and associate multiple scattered facts.

xychart-beta
    title "Long-Context Retrieval Performance (RULER)"
    x-axis ["Mamba-2", "Gated DeltaNet", "KDA", "Mamba-3", "GDN-2"]
    y-axis "Accuracy (%)" 0 --> 100
    bar [62, 71, 74, 69, 88]

Chart: Illustrative comparison based on reported RULER benchmark trends. Gated DeltaNet-2 shows a significant jump over all baselines.

Why This Matters

The implications extend beyond academic benchmarks:

  1. LLM Inference Costs: Linear attention with O(1) decoding memory means cheaper API calls for long conversations and document processing
  2. Retrieval-Augmented Generation: Better multi-key retrieval directly improves RAG systems that need to synthesize information from multiple document sections
  3. On-Device AI: Fixed-size state enables running capable models on memory-constrained devices
  4. Scientific Literature Processing: Models can effectively process entire papers, patents, or legal documents without summarization tricks

Code and Reproducibility

The implementation is open source on GitHub at NVlabs/GatedDeltaNet-2, which has already garnered 12,300+ stars. The repository includes pre-trained checkpoints, training scripts, and evaluation harness code.

Paper: arXiv:2605.22791

Looking Forward

The era of softmax attention’s dominance may be drawing to a close. As linear attention architectures mature — with innovations like independent erase/write gating, channel-wise decay, and chunkwise training — we are approaching the point where the O(N²) tax on Transformers is no longer a necessary cost of doing business.

Gated DeltaNet-2 shows that careful architectural design, rather than brute-force scaling, can unlock dramatic improvements in how efficiently LLMs process long contexts. The next challenge: scaling these architectures to the 70B+ parameter range while maintaining their efficiency advantage.

Share this page