The Plasticity Crisis in Continual Deep Learning
Article 2 of 12 | [CL] | Anchor papers: Dohare et al. Nature 2024 · Lewandowski et al. Spectral Reg · Lewandowski et al. Fourier · Tang et al. C-CHAIN | Series: Continual Intelligence
Article 1: The Benchmark Gap in Continual RL: From Continual World to SPIRAL
In August 2024, a paper in Nature established empirically something the deep learning community had suspected but never formally established: the longer you train a deep network on sequential tasks, the worse it gets at learning new ones. Not “worse at remembering old things” — worse at learning new things. The capacity for learning itself decays. Here is what that means and what, if anything, can be done about it.
§1 — What Plasticity Loss Is (and Isn’t)
Plasticity loss is not catastrophic forgetting. This distinction is not semantic — it determines what intervention is needed, and confusing the two leads researchers to apply the wrong solution.
Catastrophic forgetting is about the past. When a network trained on task A then trains on task B, the weights that encoded task A’s solution are overwritten. Ask the network about task A afterward and performance has dropped. The mechanism is weight interference: gradient updates for task B move the parameter vector away from the region that solved task A.
Plasticity loss is about the future. The network’s ability to learn new tasks — tasks it has never seen — degrades as a function of how many previous tasks it has been trained on. The problem is not what was overwritten. The problem is that the optimization landscape has flattened. Gradient steps become smaller, less informative, and less diverse. The network’s effective learning rate falls even if the actual learning rate is held constant.
A network suffering catastrophic forgetting still learns new things quickly — it just forgets old things. A network suffering plasticity loss has trouble learning anything new. It plateaus earlier, at lower performance, on each successive task. The performance ceiling falls.
Three proximate mechanisms cause this (each explored in §3):
Dead neurons: ReLU units whose inputs are consistently negative stop firing and stop receiving gradient. They become permanently inert.
Rank collapse: The effective dimensionality of the weight matrix shrinks. The network loses the representational diversity needed to fit new tasks.
NTK rank collapse: The neural tangent kernel — the matrix describing how outputs change under gradient steps — loses rank. Gradient updates become degenerate: small, co-linear, and directionally redundant.
These three mechanisms are correlated but distinct. They manifest at different timescales, affect different network components, and require different interventions. Conflating them is the most common error in the plasticity literature.
[← A10: “Existing benchmarks rarely run agents long enough to reveal plasticity loss — task sequences of 10–50 are standard; the phenomenon manifests at task sequences of 500 or more.”]
Key Takeaway: Plasticity loss is not forgetting. Forgetting is about the past; plasticity loss is about the future. A network can remember everything it learned and still lose the ability to learn new things. The interventions for each are fundamentally different.
§2 — The Nature Paper: What It Demonstrated
Dohare et al. (2024, Nature) is a landmark result for three reasons, and all three are worth stating explicitly.
First, it is the first rigorous empirical proof of plasticity loss at scale. Prior work showed that some networks sometimes had trouble learning new tasks after extended training, but the relationship between training duration and learning capacity had not been systematically characterized across architectures, optimizers, and activation functions. Dohare et al. did this systematically, and the result is unambiguous.
Second, it appeared in Nature — the world’s highest-impact general science journal. AI papers rarely appear there. This publication choice was a signal: the editors judged that plasticity loss is a finding of sufficient generality and importance to merit attention from the entire scientific community, not just the machine learning subcommunity. The field should register this signal.
Third, it produced the key practical negative result: gradient descent alone cannot maintain plasticity. This is not a claim about specific architectures or learning rates. It is a claim about the fundamental optimization algorithm that underlies all of deep learning. Stochastic gradient descent, Adam, RMSProp, Adagrad — none of them prevent plasticity loss when applied to sequential task training. This is not a hyperparameter problem.
The experimental setup: ImageNet binary classification tasks, presented sequentially. Each “task” is a binary decision: does this image belong to one of two categories? The categories change from task to task, creating a sequence of 2,000 binary classification problems. The measure: accuracy on each new binary classification task after training on all previous tasks.
The results: accuracy drops from approximately 89% on early tasks to approximately 77% by the 2,000th sequential task. The 77% figure is significant: it is the accuracy of a linear (shallow) network on the same task distribution (Dohare et al., arXiv:2306.13812). The deep network, after 2,000 sequential tasks, learns no better than a network with no hidden layers. It has lost the capacity for deep feature learning. The collapse is monotonic — the more tasks, the lower the performance ceiling — and it shows no sign of stabilizing.
Dohare et al. tested whether standard regularization and normalization techniques prevent this. Layer normalization: does not prevent it. Dropout: does not prevent it. L2 weight regularization: reduces the rate of collapse but does not stop it. Batch normalization: does not prevent it. Architectural depth and width variations: change the rate but not the qualitative result.
The only intervention that worked: continual backpropagation. The technique is conceptually simple: after each learning step, randomly reinitialize a small fraction (ρ, a tunable replacement rate) of the least-used neurons. The criterion for “least-used” is the neuron’s contribution to the network’s output — neurons whose activation statistics have become very small are reset to their initialization values.
Why does random reinit work when careful gradient optimization does not? Because plasticity loss is, at its root, a diversity loss problem. The network’s optimization landscape collapses into a low-dimensional manifold. Gradient descent, by design, moves along the gradient — it cannot spontaneously move perpendicular to the gradient to explore new regions of the loss landscape. Random reinit does exactly this: it injects stochasticity that is orthogonal to the current gradient direction, repopulating the optimization landscape with fresh starting points.
The punchline of the Nature paper: “Methods based on gradient descent are not enough — sustained deep learning requires a random, non-gradient component to maintain variability and plasticity.” This sentence has profound implications for every pipeline that fine-tunes models on sequential data streams. All of them.
Figure 1 — Dead Neuron Accumulation Across 2,000 Sequential Tasks (Dohare et al., Nature 2024)
% DEAD
NEURONS
│
80% ┤· · · · · · · · · · · · · · · · · · · · · · · · · · ·▓▓▓▓▓
│ ▓▓▓▓▓▓▓▓
│· · · · · · · · · · · · · · · · ▓▓▓▓▓▓▓▓▓▓▓▓
60% ┤ ▓▓▓▓▓▓▓▓▓▓ · · · · Baseline (ReLU)
│ ▓▓▓▓▓▓▓▓▓▓▓▓ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ L2 Reg + perturb
│ ▓▓▓▓▓▓▓▓▓▓ ─ ─ ─ ─ ─ ─ ─ ─ ─ Layer Norm
40% ┤ ▓ ─ ─ ─ ─ ─ ─ ═════ Dropout
│ ════════════════════════════════════════════
20% ┤ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ═════ Cont. Backprop
│═════════════════════════════════════════════════════════════
0% ┤═══════════════════════════════════════════════════════════
└────────────┬──────────────┬──────────────┬──────────────┬────
0 500 1,000 2,000
Sequential Tasks
│◄─────────────────────────────────────────►│
~Task 2000: Baseline reaches shallow-network
equivalence. Accuracy collapses to ~77% —
identical to a zero-hidden-layer network.
Figure 1: Dead neuron accumulation across 2,000 sequential tasks (Dohare et al., 2024). The baseline ReLU network reaches shallow-network-equivalent performance by task 2000. Every gradient-based method (L2, Layer Norm, Dropout) slows the collapse but cannot stop it. Continual backpropagation — random reinit of ρ fraction of least-used units — is the only method that maintains near-zero dead neurons indefinitely.
The surprising framing: prior deep learning intuition held that gradient descent with enough data produces good learning. The Nature result reverses this for sequential settings. More sequential tasks → more entrenched optimization landscape → lower learning ceiling. The problem is not data quantity but data ordering. A network trained on one million images simultaneously learns well. The same network trained on one million images sequentially — 1,000 tasks of 1,000 images each — loses most of its capacity to learn. Sequence is the problem, and standard optimization has no answer for it.
Key Takeaway: Gradient descent cannot maintain plasticity. The Nature paper demonstrated this empirically at scale across architectures, optimizers, and regularization techniques. A non-gradient component — random diversity injection — is necessary. This is the most important negative result in continual learning since catastrophic forgetting was named.
§3 — The Mechanisms: Three Ways Networks Stop Learning
There are three distinct mechanisms of plasticity loss. Conflating them leads researchers to apply the wrong intervention. Each mechanism has a specific fingerprint, a specific cause, and a specific remedy.
Mechanism 1: Dead Neurons
A ReLU (rectified linear unit) activation produces zero output whenever its input is negative. When a neuron’s input is consistently negative across the training data distribution, the neuron fires zero, contributes zero to the forward pass, and receives zero gradient from the backward pass. It is permanently dead. Its weights never update. It contributes nothing to the network’s representational capacity.
In standard single-task training, dead neurons are not catastrophic — there are enough live neurons to carry the task. But in sequential training, dead neurons accumulate. Each new task pushes the weight distribution in a direction that deactivates some neurons for the current task. Across thousands of tasks, the cumulative fraction of dead neurons grows monotonically until the network’s effective capacity is far below its nominal capacity.
Abbas et al. (2023) were among the first to characterize this at scale in a CRL setting, using Atari games as sequential tasks. They documented what they called the “sparse activation footprint” problem: as training proceeds across games, the proportion of active neurons decreases, gradients become sparser, and learning slows. Their proposed fix: CReLU (concatenated ReLU) activation — computing both max(0, x) and max(0, -x) for each unit, ensuring that some gradient always flows. CReLU is effective for simple architectures. It does not scale cleanly to residual networks, where the skip connections alter the gradient flow dynamics.
A more recent and more powerful diagnostic comes from Liu et al. (2025), who introduced GraMa (Gradient Magnitude metric). GraMa’s key insight: activation statistics are an insufficient diagnostic for neuron health in complex architectures. The τ-dormant metric (Sokar et al., 2023), which identifies dead neurons by their activation magnitude, works correctly in simple MLP architectures where activation magnitude correlates with learning contribution. In residual networks, it fails: skip connections mean that neurons with near-zero activation magnitude can still have significant gradient flow, and neurons with high activation magnitude can be contributing near-zero learning signal if their gradients are pointing in degenerate directions.
GraMa measures gradient magnitude directly, not activation magnitude. In comparative experiments on residual networks and diffusion models, GraMa identifies significantly more learning-impaired neurons than τ-dormant — demonstrating that the activation-based diagnostic was systematically underestimating the dead neuron problem in exactly the architectures that the field is most interested in deploying.
ReGraMa (gradient-guided reset, Liu et al., 2025) extends GraMa into an intervention: reset neurons identified as learning-impaired by their gradient statistics rather than their activation statistics. In experiments on residual networks and diffusion models, ReGraMa outperforms τ-dormant-guided reset on both metrics.
Figure 2 — Gradient vs. Activation Statistics: Why the Diagnostic Matters (Liu et al. — GraMa / ReGraMa, 2025)
══════════════════════════════════════════════════════════════════════════
SIMPLE MLP ARCHITECTURE DEEP RESIDUAL NETWORK
══════════════════════════════════════════════════════════════════════════
% Inactive neurons by layer: % Inactive neurons by layer:
τ-dormant (activation) τ-dormant (activation)
50% ┤ 50% ┤
40% ┤ ▓ ▓ 40% ┤ ▓
30% ┤ ▓ ▓ ▓ 30% ┤ ▓ ▓ ▓
20% ┤ ▓ ▓ ▓ ▓ 20% ┤ ▓ ▓ ▓ ▓
0% └──L1─L2─L3─L4── 0% └──L1─L2─L3─L4──
GraMa (gradient magnitude) GraMa (gradient magnitude)
50% ┤ 50% ┤ ▓▓ ▓▓
40% ┤ ▓ ▓ 40% ┤ ▓▓ ▓▓ ▓▓ ▓▓
30% ┤ ▓ ▓ ▓ 30% ┤ ▓▓ ▓▓ ▓▓ ▓▓
20% ┤ ▓ ▓ ▓ ▓ 20% ┤ ▓▓ ▓▓ ▓▓ ▓▓
0% └──L1─L2─L3─L4── 0% └──L1─L2─L3─L4──
──────────────────────────────────────────────────────────────────────────
RESULT: τ-dormant ≈ GraMa RESULT: GraMa reveals 2–3×
Metrics agree in shallow nets. more inactive neurons than
τ-dormant in deep residual nets.
Skip connections create high-
activation / zero-gradient units
that τ-dormant misses entirely.
══════════════════════════════════════════════════════════════════════════
Figure 2: In simple MLP architectures, activation metrics and gradient metrics agree. In deep residual networks — the architectures used in practice — activation metrics (τ-dormant) dramatically underestimate neuron inactivity. GraMa (gradient magnitude) is the correct diagnostic signal. Using τ-dormant on a ResNet or diffusion model understates the plasticity problem by a factor of 2–3×.
Mechanism 2: Rank Collapse
The effective rank of a weight matrix measures how many linearly independent directions that matrix can represent. A weight matrix of size 1000×1000 could, in principle, represent 1000 independent directions. In practice, after sequential training, the effective rank collapses: the weight updates tend to align along a shrinking subspace, and the matrix represents far fewer independent directions than its nominal dimensionality suggests.
Why does this happen? Each gradient update is computed for the current task and moves the weight matrix in a direction that reduces loss on current data. Across many tasks, these updates share structure — they all converge on directions that are broadly useful — while the directions specific to individual tasks get crowded out. The weight matrix becomes a compressed summary of the most frequent gradients, losing the low-probability directions that new, novel tasks would require.
Lewandowski et al. addressed rank collapse directly with spectral regularization. The key insight: the singular value decomposition of a weight matrix reveals its effective dimensionality. At initialization, weight matrices have a roughly flat singular value spectrum — many singular values of similar magnitude, indicating high representational diversity. After sequential training, the spectrum becomes “spiky” — a few large singular values dominate, the rest shrink to near-zero. This spiky spectrum is the signature of rank collapse.
Spectral regularization maintains the singular value spectrum near its initialization distribution. The specific constraint: keep the maximum singular value close to 1, its value at initialization. This prevents the dominant directions from growing while keeping the minor directions from decaying. In experiments across CIFAR-10, CIFAR-100, and TinyImageNet sequential learning, spectral regularization maintains gradient diversity across tasks and substantially reduces performance degradation compared to L2 regularization — and it is less sensitive to hyperparameter choices, making it more robust in practice.
The theoretical grounding: singular values at initialization are the key trainability signal. A matrix with singular values near 1 has gradients that flow freely through it; a matrix with one very large singular value and many near-zero singular values has gradients that collapse onto a single dimension. Spectral regularization is directly targeting the structure responsible for trainability, not a proxy for it.
Mechanism 3: NTK Rank Collapse
The neural tangent kernel (NTK) describes how a neural network’s outputs change in response to gradient updates. More precisely, the NTK matrix K is defined such that K_{ij} = ∇_θ f(x_i)^T ∇_θ f(x_j), where f is the network output and θ are the parameters. The rank of K determines how many independent directions the gradient can push the network’s outputs — high rank means diverse, informative updates; low rank means degenerate, co-linear updates.
NTK rank collapse is categorically different from weight rank collapse. Weight rank collapse affects the network’s representations — what it can express. NTK rank collapse affects the network’s dynamics — how it learns. A network can have high weight rank (diverse representations) and simultaneously have low NTK rank (degenerate learning dynamics). The two mechanisms require different interventions.
Tang, Obando-Ceron et al. (2025) formalized the connection between NTK rank collapse and a phenomenon they call churn: the tendency of a network’s outputs to change unpredictably for inputs not in the current training batch. When the NTK has collapsed, gradient updates for current-batch inputs have outsized effects on out-of-batch inputs — the network is essentially interpolating poorly across the input space, producing unstable predictions for anything it is not actively training on.
The connection is: plasticity loss = NTK rank collapse → churn amplification. As the NTK loses rank, updates become more degenerate, predictions for out-of-batch inputs become more volatile, and the network’s effective learning rate falls because its gradient directions no longer span the loss landscape.
Their intervention, C-CHAIN (Continual CHange-Aware INterference reduction), reduces output variability for out-of-batch data by explicitly regularizing the change in outputs across training steps. This prevents the churn amplification cycle and, by maintaining NTK rank, maintains effective learning rate across tasks.
C-CHAIN was tested across four benchmark suites: OpenAI Gym Control tasks, ProcGen procedurally generated games, DMControl continuous control, and MinAtar (a miniaturized Atari suite). Across all four, C-CHAIN maintained higher sample efficiency across the task sequence compared to standard PPO and SAC baselines, with the advantage growing over the course of training — consistent with the mechanism: the intervention prevents NTK collapse from accumulating.
Figure 3 — Three Mechanisms of Plasticity Loss and Their Interventions
═══════════════════════════════════════════════════════════════════════════
MECHANISM 1 MECHANISM 2 MECHANISM 3
Dead Neurons Rank Collapse NTK Degeneration
─────────────────────────────────────────────────────────────────────────
ReLU: max(0, x) Singular value spectrum: Gradient directions:
input x < 0 → dead EARLY TRAINING EARLY TRAINING
σ: ▓ ▓ ▓ ▓ ▓ ▓ ▓ ▓ ↗ ↑ ↖ ←
○─────●─────○ flat, rank = k ↙ · ↘ →
│[DEAD] ↓ ↗ ↑ ↖
× grad=0 LATE TRAINING diverse, span space
│[DEAD] σ: ▓▓▓▓▓▓▓▓
× grad=0 σ₂: ▒ LATE TRAINING
│[LIVE]✓ σ₃: · → → → →
│ σ₄: · ← rank ≈ 1 → → → →
updates collapse collinear, rank ≈ 1
Dead units freeze. onto 1-2 directions. degenerate updates.
Capacity shrinks. Diversity lost. Learning rate falls.
─────────────────────────────────────────────────────────────────────────
FINGERPRINT FINGERPRINT FINGERPRINT
Sparse activations Spiky singular value Rising churn
(GraMa, not τ-dormant) spectrum (SVD rank ratio) (out-of-batch Δ)
─────────────────────────────────────────────────────────────────────────
INTERVENTION INTERVENTION INTERVENTION
CReLU (Abbas 2023) Spectral regularization C-CHAIN
ReGraMa reset keep σ_max ≈ 1 (Tang et al. 2025)
(Liu et al. 2025) (Lewandowski et al.) reduce output Δ
for out-of-batch x
═══════════════════════════════════════════════════════════════════════════
WARNING: The mechanisms are correlated but NOT interchangeable.
Applying continual backpropagation (dead-neuron fix) to a rank-collapse
problem provides only indirect benefit. Match the fix to the fingerprint.
═══════════════════════════════════════════════════════════════════════════
Figure 3: Three distinct mechanisms of plasticity loss, their diagnostic fingerprints, and their matched interventions. Each column is a separate failure mode with a separate cause. Misidentifying the mechanism is the most common implementation error in the plasticity mitigation literature.
Key Takeaway: Three mechanisms, three fingerprints, three interventions. Dead neurons: measure with GraMa, fix with ReGraMa or CReLU. Rank collapse: measure with singular value spectrum, fix with spectral regularization. NTK degeneration: measure with churn, fix with C-CHAIN. Mismatching mechanism to intervention is the most common error in plasticity loss mitigation.
§4 — The Solution Hierarchy: Training Tricks to Architecture
Solutions to plasticity loss exist at three levels of the system. Each level attacks a deeper cause. The progression is not arbitrary.
Level 1: Training Tricks (Partial Fix)
Training-level interventions address the symptoms of plasticity loss without altering the fundamental structure of the network or its optimization trajectory.
L2 regularization + weight perturbation (Dohare et al., arXiv:2306.13812): Adding L2 regularization keeps weights from drifting far from initialization, limiting the degree of rank collapse. Adding weight perturbation (small random noise) injects diversity into the parameter space. Together, these “substantially ease but don’t solve” plasticity loss — the Nature paper’s phrasing. The collapse is slowed, not prevented.
Continual backpropagation (Dohare et al., Nature 2024): The only complete training-level fix identified to date. Random reinit of a small fraction (ρ) of the least-used neurons after each learning step. This is the intervention that maintained plasticity over at least 5,000 tasks — while the baseline had already failed by task 2,000. It works because it directly addresses the diversity loss at the root of all three mechanisms — dead neurons are replaced with fresh neurons, rank-collapsed directions get repopulated, and NTK diversity is maintained.
The limitation of training tricks: they address the symptom at the current task but do not prevent the next cycle of collapse. Continual backpropagation reinitializes neurons every step, so it can maintain diversity indefinitely — but at the cost of constantly discarding some learned features. It is a steady-state solution, not a structural one.
CReLU (Abbas et al., 2022): Replacing ReLU with concatenated ReLU (computing both the positive and negative parts of the activation) ensures that some gradient always flows to every unit. Simple, low-overhead, effective for shallow networks. Does not scale to residual architectures because the skip connections create gradient paths that bypass the dead-neuron bottleneck.
Level 2: Spectral and Frequency-Domain Approaches (Stronger Fix)
Level 2 interventions target the structural causes rather than managing symptoms.
Spectral regularization (Lewandowski et al.): Directly attacks rank collapse by maintaining the singular value spectrum near initialization. The intervention is architecturally agnostic — it can be applied to any layer of any network as a regularization term. The gradient with respect to the singular value regularizer keeps the maximum singular value near 1 and prevents the spectrum from concentrating. Less sensitive to hyperparameter choices than L2, which operates on weight magnitude rather than weight structure.
Deep Fourier features (Lewandowski et al.): The theoretical leap in the plasticity literature. The starting observation: linear function approximation provably does not lose plasticity. A linear model — no hidden layers — trained sequentially on new tasks doesn’t experience dead neurons, rank collapse, or NTK degeneration, because there is no nonlinearity to create dead zones. It is expressive enough to fit tasks up to the number of input features, no more.
The question: can we get the benefits of linearity (no plasticity loss) while retaining the benefits of nonlinearity (expressiveness)? Deep Fourier features answer yes. The approach: replace activation functions with sine-cosine pairs (one sine output and one cosine output per unit). This produces a feature map that approximates the eigenfunction expansion of a kernel, and critically, the resulting features have the gradient-flow properties of a linear network at each layer boundary while retaining the overall expressiveness of a deep network.
The results: dramatic performance improvements on CIFAR-10, CIFAR-100, and TinyImageNet sequential learning. Networks with Fourier features maintain near-constant performance per task across hundreds of sequential tasks, while standard networks collapse. The improvements are consistent across architectures.
The key insight to internalize: the enemy of plasticity is not nonlinearity per se — it is the kind of nonlinearity that creates dead zones. ReLU has a hard zero for negative inputs; that dead zone is where neurons die. Sigmoid and tanh create dead zones at saturation. Fourier features have no dead zones: sine and cosine are bounded but never zero in gradient. They maintain expressiveness through periodic variation rather than through hard thresholds.
ReGraMa (Liu et al., 2025): Gradient-guided reset strategy. Identifies neurons for reinit based on gradient magnitude rather than activation magnitude, then applies the continual backpropagation reset strategy to those neurons specifically. Architecture-agnostic and applicable to residual networks and diffusion models where τ-dormant fails. Outperforms activation-based reset strategies on all tested architectures.
Level 3: Architecture Changes (The Deepest Fix)
Architecture-level interventions prevent plasticity loss structurally, rather than managing it after the fact.
Transformer² (2025): SVD decomposition of weight matrices with inference-time adaptation. The approach: decompose each weight matrix W = UΣV^T using SVD; during inference, modify only the singular values Σ, leaving U and V fixed. In practice, Transformer² learns task-specific “expert vectors” that scale the SVD components — the adaptation is lightweight but not a simple Σ rescaling. This is weight adaptation without any gradient steps — the network’s representations are frozen, but their scale can be adjusted at inference time. This sidesteps the plasticity loss cycle entirely: the gradient steps that cause rank collapse don’t happen, because adaptation is performed analytically in the SVD domain.
Transformer² represents a third paradigm — distinct from training tricks (modify the gradient) and spectral methods (regularize the structure) — for maintaining plasticity. It is not a panacea: inference-time SVD adaptation is limited to adjustments in the scale of existing features, not the learning of new features. But for deployment scenarios where the base model is frozen and task adaptation is incremental, it avoids accumulating plasticity debt entirely.
If plasticity loss is a disease, training tricks are symptom management and spectral methods are antiviral — but architecture changes are the vaccine. Article 6 in this series (The Forgetting Transformer) shows what that vaccine looks like for the transformer attention mechanism, where rank collapse is structurally facilitated by the softmax normalization and Q-K-V projection matrices.
Figure 4 — The Plasticity–Stability Tradeoff Space
STABILITY
(retains prior task knowledge without degradation)
HIGH │ ╔═════════════════════════════════╗
│ ║ ★ IDEAL REGION ║
│ Spectral Reg. ● ║ ║
│ (Lewandowski) ║ ReGraMa ● Fourier ● ║
│ ║ Features ║
│ ╚═════════════════════════════════╝
│ Cont. ↑
│ Backprop ● [→ A6: Forgetting Transformer]
│ approaches this region via
│ L2 Reg ● architectural redesign
MED │ + perturbation
│
│
LOW │ Baseline ●
│ (ReLU, SGD,
│ 1000+ tasks)
└──────────────────────────────────────────────────────────────▶
LOW MED HIGH
PLASTICITY
(learns new tasks quickly; low dead-neuron rate)
─────────────────────────────────────────────────────────────────────────
LEVEL 1 (training tricks): shift toward stability, limited plasticity gain
LEVEL 2 (spectral/Fourier): approach both axes simultaneously
LEVEL 3 (architecture): change the shape of the tradeoff itself [→A6]
─────────────────────────────────────────────────────────────────────────
Figure 4: Solutions to plasticity loss mapped onto the plasticity–stability tradeoff space. Training tricks shift the baseline toward stability at some plasticity cost. Spectral and Fourier approaches move toward the upper-right. Architecture-level fixes [→A6] do not trade along the curve — they change what the curve looks like by structurally preventing collapse rather than managing it.
Key Takeaway: The three levels of solution attack progressively deeper causes. Training tricks manage symptoms at the cost of steady-state interventions. Spectral methods target structural causes but remain within the gradient-descent paradigm. Architecture changes prevent the problem from occurring. For sequential deployment of deep networks, building to Level 3 from the start is cheaper than accruing and managing plasticity debt at Level 1.
§5 — The 3-Step Diagnostic Framework
A practitioner facing plasticity loss needs a diagnostic protocol, not a literature review. Here is one that translates the mechanistic understanding in §3 directly into actionable steps.
Plasticity Loss Diagnostic Protocol
Step 1 — Identify the mechanism
Is it dead neurons? Use GraMa (gradient magnitude metric), not τ-dormant activation metric. If you’re working with residual networks or diffusion models, τ-dormant will underestimate the problem significantly. GraMa will not.
Is it rank collapse? Track the effective rank of your weight matrices across tasks. Compute the singular value decomposition of each weight matrix after each task and record the ratio of the largest singular value to the sum of all singular values. A growing ratio indicates rank collapse.
Is it NTK degeneration? Measure churn: the average change in model outputs for a held-out set of inputs not in the current training batch, between consecutive gradient steps. Rising churn indicates NTK rank collapse and the beginning of the degenerate update cycle.
Step 2 — Locate the source
Dead neurons → examine activation function; consider whether the architecture has bypass paths (skip connections) that create inactive regions
Rank collapse → examine weight matrix initialization; confirm singular values near 1 at start of training; track whether drift is gradual (L2 issues) or sudden (learning rate spikes)
NTK degeneration → examine whether the network is being updated with gradients from a small batch relative to the input space size; consider whether task switches are creating abrupt distribution shifts that spike NTK condition number
Step 3 — Apply the matched intervention
MechanismMatched InterventionCompute CostDead neurons (simple architecture)CReLU activation or ReGraMa resetLowDead neurons (residual / diffusion)ReGraMa reset (gradient-guided)Low–MedRank collapseSpectral regularization (Lewandowski)Low–MedNTK degenerationC-CHAIN (Tang et al.)MediumAll three simultaneouslyFourier features (Lewandowski)MediumLLM continual fine-tuningTransformer² SVD inference-time editingHigh (inference)Persistent architecture-level ceilingForgetting Transformer [→A6]High (training)
Two notes on applying this protocol:
Don’t treat the mechanisms as independent. Dead neuron accumulation tends to increase rank collapse (fewer active neurons → fewer independent gradient directions → faster rank collapse). Rank collapse tends to increase NTK degeneration (a collapsed weight matrix has a collapsed NTK by construction). In practice, extended sequential training often triggers all three mechanisms simultaneously. If you observe all three at once, go directly to Fourier features rather than applying three separate interventions.
Measure before you intervene. The most common error is applying continual backpropagation (the most famous fix) to a rank collapse problem. Continual backpropagation addresses dead neurons and maintains gradient diversity — it has only indirect effects on singular value collapse, and it has no direct mechanism for preventing NTK degeneration. It may help somewhat, because gradient diversity reduction underlies all three mechanisms, but it is not the right primary intervention for rank collapse.
Key Takeaway: The diagnostic protocol prevents the most common error in the field — applying the famous fix (continual backpropagation) to problems it doesn’t address (rank collapse, NTK degeneration). Measure the mechanism first. Match intervention to cause.
§6 — What This Means for Practitioners Fine-Tuning LLMs
Plasticity loss is not a CRL research problem with no practical implications. It is the structural explanation for a phenomenon that every ML engineer who fine-tunes deployed LLMs encounters: successive fine-tuning rounds get harder, the model becomes less responsive to new data, and performance on new tasks requires more data than earlier rounds did.
This is the sequential task problem that the Nature paper describes, instantiated in production deployment. Each fine-tuning round is a “task” in the sequential learning sense. The model’s optimization landscape — the relationship between its current parameters and the gradient landscape for new data — degrades with each round. The effective learning rate falls. By the third or fourth successive fine-tuning round, practitioners often notice that they need substantially more data to achieve the same level of task adaptation.
Transformer² (2025) provides the most direct practical solution for this scenario. SVD decomposition of the base model’s weight matrices, followed by inference-time modification of the singular values, provides task-specific adaptation without any gradient steps. There is no accumulation of plasticity debt because there are no gradient updates to cause rank collapse, NTK degeneration, or dead neurons. The base model’s representations remain fixed; the scaling of those representations adapts to the task.
The limitation: inference-time SVD editing can only adjust the scale of existing features. If a new task requires genuinely new features — concepts not representable as weighted combinations of the base model’s existing feature directions — SVD editing will fail to adapt effectively. For this reason, Transformer² is best suited to task adaptation within the base model’s representational capacity, not to extending that capacity.
For practitioners who must use gradient-based fine-tuning (because the task genuinely requires new features), the practical recommendation is simple: measure effective rank after each fine-tuning round. Plot the ratio of the dominant singular value to the total singular value mass for your key weight matrices. If this ratio is growing monotonically across rounds, you are accumulating rank collapse — plasticity debt that will make the next fine-tuning round harder than the last. Spectral regularization (Lewandowski et al.) can be added to the fine-tuning objective to stop the accumulation.
The broader implication: every organization that maintains a continuously updated model — recommendation systems, translation services, content moderation, code assistants — is running a sequential learning system. All of them will encounter plasticity loss if they run long enough. The Nature paper is not a research curiosity; it is a description of a production failure mode that most organizations have not yet named.
Key Takeaway: Continual LLM fine-tuning is the production instance of the sequential learning problem. Plasticity loss manifests as decreasing responsiveness to new fine-tuning data over successive rounds. Measuring effective rank after each round is the diagnostic. Transformer² inference-time adaptation avoids the problem; spectral regularization manages it within gradient-based fine-tuning.
§7 — What Comes Next
← A10: The benchmarks ran too few tasks to see this. Standard 10–50 task sequences look fine on per-task performance metrics; the plasticity collapse manifests at 500 or more sequential tasks, the scale that genuine non-episodic environments produce. This is why the benchmark gap (A10) and the plasticity crisis (this article) are the right opening pair for this series: the benchmarks can’t reveal the problem because they’re too short, and the problem is more severe than the benchmarks suggest.
→ A3 (The Big World Hypothesis): Plasticity loss is not a bug — it is a mathematical inevitability for any agent without a world model. A network that must memorize the world rather than model it will run out of capacity. Plasticity loss is what running out of capacity looks like at the optimization level. Article 3 shows why the only principled solution is not better training tricks, but a different computational architecture: one that explicitly separates world-modeling from task-solving, allowing each component to be updated independently.
→ A6 (The Forgetting Transformer): The architectural fix. The transformer attention mechanism is the specific locus of rank collapse in LLMs: the softmax normalization in attention concentrates gradient flow onto a small number of attention patterns, and the Q-K-V projection matrices collapse in rank over successive fine-tuning rounds. A learnable forget gate in the attention mechanism prevents this structurally.
→ A7 (Stable Deep RL at Scale): KL regularization and gradient interference as related training instabilities in RL-tuned LLMs. The mechanisms are analogous: policy gradient updates suffer the same rank collapse and NTK degeneration dynamics as supervised gradient descent. The RL training instabilities that practitioners attribute to reward hacking or distribution shift often have a plasticity loss component that has not been recognized.
Why does plasticity collapse happen at all? Why doesn’t a network just grow more capacity as needed? The next article answers this — and the answer is more fundamental than you might expect. It is not about neural network architecture. It is about the structure of the world.
Final Key Takeaways
Plasticity loss ≠ catastrophic forgetting. Forgetting is about the past; plasticity loss is about the future. A network can remember everything it learned and still lose the ability to learn new things. The mechanisms are different; the interventions are different.
Gradient descent alone cannot maintain plasticity. The Nature paper demonstrated this empirically at scale across architectures, optimizers, and regularization techniques. A non-gradient component — random diversity injection — is necessary for sustained deep learning. This is the most important negative result in continual learning.
Three mechanisms, three interventions. Mismatching is the most common error. Dead neurons → GraMa-guided reset (not τ-dormant). Rank collapse → spectral regularization. NTK degeneration → C-CHAIN. If all three are present → Fourier features.
Fourier features are the strongest training-level fix. They work because linear networks provably do not lose plasticity, and Fourier features approximate linear gradient flow at each layer boundary while maintaining nonlinear expressiveness. The performance improvements on sequential benchmarks are not incremental — they are the difference between collapse and stability.
Architecture is the real solution. Training tricks manage the symptom. Spectral methods target the structure. Architecture-level fixes — Transformer² for inference-time adaptation, the Forgetting Transformer [→A6] for gradient-based training — prevent the problem from occurring rather than managing it after the fact.
This is a production problem. Every organization running continual fine-tuning at scale is encountering plasticity loss. Measure effective rank after each fine-tuning round. If it is declining, you are accumulating plasticity debt. The tools to stop it exist.
Previous: [← A10 — The Benchmark Gap in Continual RL] Next: [A3 — The Big World Hypothesis →]
Plasticity loss reveals that the network is running out of capacity. Article 3 explains why that is mathematically inevitable — and what it implies about the right architecture for agents operating in open worlds.
References
[1] Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., & Sutton, R. S. (2024). Loss of plasticity in deep continual learning. Nature, 632, 768–774. https://doi.org/10.1038/s41586-024-07711-7
[2] Dohare, S., Hernandez-Garcia, J. F., Rahman, P., Sutton, R. S., & Mahmood, A. R. (2023). Maintaining plasticity in deep continual learning via regenerative regularization. ICML 2023 Workshop on Continual Lifelong Learning. arXiv:2306.13812.
[3] Abbas, Z., Zhao, R., Modayil, J., White, A., & Machado, M. C. (2023). Loss of plasticity in continual deep reinforcement learning. Conference on Lifelong Learning Agents (CoLLAs), 2023. arXiv:2303.07507.
[4] Lewandowski, A., Limbacher, J., Precup, D., & Courville, A. (2024). Continual learning by spectral regularization. arXiv:2406.06811.
[5] Lewandowski, A., et al. (2024). Plastic learning with deep Fourier features. arXiv:2410.20634.
[6] Tang, Y., Obando-Ceron, J. S., et al. (2025). Mitigating plasticity loss with continual-change-aware interference reduction (C-CHAIN). arXiv:2506.00592.
[7] Liu, Y., et al. (2025). Measure gradients not activations: Enhancing plasticity diagnosis and reset with GraMa / ReGraMa. arXiv:2505.24061.
[8] Sokar, G., Agarwal, R., Castro, P. S., & Evci, U. (2023). Dormant neuron phenomenon in deep reinforcement learning. Proceedings of the International Conference on Machine Learning (ICML). arXiv:2302.12902.
[9] Sun, Q., et al. (2025). Transformer²: Self-adaptive LLMs. arXiv:2501.06252. (Transformer² inference-time SVD weight editing as a third plasticity paradigm.)
[10] Khetarpal, K., Riemer, M., Rish, I., & Precup, D. (2022). Towards continual reinforcement learning: A review and perspectives. Journal of Artificial Intelligence Research (JAIR), 75, 1401–1476. arXiv:2012.13490. (Background context for §1 on CRL desiderata.)
Article 3: The Big World Hypothesis: Why Continual Learning Is Inevitable

