Thinking Without Tokens: CTM and Inference-Time Compute Beyond CoT

Article 11 of 12 | [RE][EV][FN] | papers: Darlow et al. arXiv:2505.05522 · Koishekenov et al. arXiv:2510.07358 · Simonds & Yoshiyama arXiv:2503.00735 · Tur et al. arXiv:2602.07845 | Continual Intelli.

Mar 29, 2026

Article 10: Darwin-Gödel to ShinkaEvolve: The Case for Open-Ended AI

Language models generate one token at a time. That’s not a feature — it’s a constraint. In 2025, Sakana AI published an architecture where the reasoning happens entirely inside the forward pass, between activations, without producing a single output token. The Continuous Thought Machine (CTM) assigns each neuron its own private processing history and uses the synchronization patterns between neurons — not output tokens — as its latent representation. There are no reasoning tokens. The model thinks without a scratchpad.

The constraint becomes visible by contrast. Take a 3B-parameter language model that scores 1% on undergraduate integration problems. Apply no architectural changes. Add no external training data. Replace the uniform RL training curriculum with one that recursively decomposes hard problems into simpler variants the model can already solve. The same model now scores 82%. Simonds and Yoshiyama (2025) report this result without apology: “strategic self-directed learning can achieve significant capability improvements without relying on architectural scaling or human supervision.” Every reasoning token on the path to that 82% — every scratchpad step, every “Let u = sin(x)” — was compute denominated in the token budget. Quadratic in sequence length. Auditable, inspectable, and expensive. LADDER reaches 82% despite that cost, not because of it.

This article maps the spectrum from CoT to CTM, through two intermediate points — ETD (latent layer iteration) and RD-VLA (recurrent depth for robotics action) — and ends with the economic argument that has practitioners paying attention: if CTM’s architectural principle scales, inference cost decouples from reasoning depth.

§1 — The Token Bottleneck: Why “More CoT” Has Limits

Chain-of-thought reasoning is a remarkable capability and a structural liability at the same time.

The capability is well-established. The previous articles in this series [← A2] and [← A5] document it: CoT enables LLMs to tackle multi-step problems that exceed direct next-token prediction, and RLVR post-training shifts the model’s reasoning distribution toward traces that produce verifiable correct outputs. The liability is less often stated explicitly but is mechanically straightforward.

Token-based reasoning scales quadratically. Attention over a reasoning trace of length L has O(L²) memory cost. A model generating 2,048 reasoning tokens before answering expends four times the attention compute of a model generating 1,024. At the frontier systems [← A9] examined earlier in this series, extended CoT already dominates inference cost. This is not a temporary engineering problem awaiting a hardware fix — it is a structural property of attention-based sequence models.

Three compounding failure modes emerge at the frontier:

First, format collapse. Yue et al. (2025), examined in [← A5], demonstrate that RLVR consistently compresses the diversity of reasoning traces even as it improves pass@1 accuracy. Extended RL training optimizes the distribution toward reliable but stereotyped reasoning paths. The model stops exploring the space of valid strategies; it exploits the shortest reliable path. At extreme RL compute, the model’s “thinking” becomes a ritual — correctly sequenced, narrowly general.

Second, inference latency. Reasoning models that think for 10,000+ tokens before answering are qualitatively different in latency profile from standard LLMs. For applications requiring near-real-time response — embodied agents, dialog systems, game-playing — token-denominated reasoning imposes a hard ceiling on deployment.

Third, theoretical ceiling. The Invisible Leash argument from [← A2] applies directly here: RLVR amplifies computation the base model can already perform. Extended token generation cannot, in principle, construct reasoning steps that require a qualitatively richer computation class — it can only chain together inference steps the model already knows. The gains from longer CoT eventually plateau.

These three failure modes are convergent pressure. They motivate a different question: not “how many tokens should the model generate?” but “where should the computation happen?”

(see Figure 1)

Figure 1 — The Reasoning Computation Spectrum

  ══════════════════════════════════════════════════════════════════════════════
   THE REASONING COMPUTATION SPECTRUM
  ──────────────────────────────────────────────────────────────────────────────

  ◄──────────── TOKEN-DENOMINATED ──────────────────────────────────► LATENT-DENOMINATED ►

  Direct       Short CoT       Extended CoT       ETD           RD-VLA         CTM
  next-token   (prompt-level   (RL-trained,       (iterate      (recurrent     (per-neuron
  prediction   engineering)    RLVR)              over layer    weight-tied    temporal
                                                  subset)       action head)   memory)

  ─────────────────────────────────────────────────────────────────────────────
  REASONING    None          Low–Medium        High          Zero          Zero         Zero
  TOKEN COST   (0 tokens)    (10s–100s)        (1000s–10k)   (0 tok)       (0 tok)      (0 tok)

  LATENT       Standard      Standard          Standard      k×layer       k recurrent  T ticks
  COMPUTE      forward       forward           forward       iters         action iters × D neurons

  ADAPTIVE?    No            No                Partial       Per-          Yes (latent  Per-input
                                               (budget via   token         convergence) (native)
                                               sampling)     depth

  MEMORY COST  Constant      O(L²) attn        O(L²) attn    Constant      Constant     Constant
  ─────────────────────────────────────────────────────────────────────────────
  [ A2, A5 cover this half ]                         [ A11 covers this half ]
  ══════════════════════════════════════════════════════════════════════════════

Figure 1: The reasoning computation spectrum. The left half externalizes reasoning as generated tokens (cost scales with sequence length); the right half internalizes it as latent computation (cost determined by architecture depth, not output length). ETD and RD-VLA occupy the middle-right: latent compute with no reasoning tokens, constant memory. CTM is the rightmost point: temporal dynamics replace both token generation and explicit layer iteration. The critical distinction — not captured by most test-time scaling discussions — is that latent compute is architectural, not a decoding strategy. These are different mechanisms.

Key Takeaway: Token-based reasoning faces three compounding limits: quadratic attention cost, format collapse under extended RLVR, and a theoretical ceiling set by the base model’s computation class. The alternative is not “fewer tokens” but a fundamentally different locus of reasoning: inside the forward pass itself. ETD, RD-VLA, and CTM each locate reasoning computation in latent space rather than output space, trading token cost for architectural depth.

§2 — Continuous Thought Machines: Neurons That Think in Time

Biological neurons are not switch-like. They do not simply fire or not fire; they have temporal dynamics — the recent history of a neuron’s activation shapes its next response. This is well-established neuroscience. It has been systematically discarded in artificial neural networks in favor of computational efficiency: activations are stateless, inputs are independent, the forward pass is a function of the current input alone.

Darlow, Regan, Risi, Seely, and Jones (2025) ask what happens when you put the temporal dynamics back.

The Continuous Thought Machine (CTM) is built around two architectural innovations. Both are departures from standard practice.

Innovation 1: Neuron-Level Models (NLMs) with private weights. In a standard transformer or MLP, every neuron receives the same type of processing — typically a linear projection followed by a nonlinearity. In the CTM, every neuron d ∈ {1, ..., D} has its own private model gθ_d: a 1-layer MLP that processes the last M pre-activations generated by that specific neuron. The neuron is not processing a current input; it is processing its own recent history. Each of D neurons has unique parameters θ_d, meaning no two neurons are applying the same transformation to their history.

The consequence: each neuron has a memory. A neuron that fires strongly in response to a spatial feature and weakly in subsequent ticks develops a different temporal state than a neuron that fires in bursts. This temporal state is the neuron’s contribution to the next tick’s computation.

Innovation 2: Neural synchronization as latent representation. Standard architectures derive their output and attention queries from the current activation vector z_t. The CTM derives them from a different quantity: the synchronization matrix S_t, computed as the inner product of the post-activation history matrix Z_t with itself:

S_t = Z_t · Z_t^T   ∈ R^{D×D}

where Z_t = [z_1 | z_2 | ... | z_t]  ∈ R^{D×t}

S_t measures how correlated each pair of neurons’ activation histories are across the last t ticks. Two neurons that consistently co-activate have high synchronization; two neurons with uncorrelated activity have near-zero synchronization. This synchronization pattern — not the current activation — is what drives output predictions and attention queries.

The biological analogy is direct: binding by synchrony is a long-standing hypothesis in neuroscience (Darlow et al. cite the biological synchronization literature). The CTM instantiates it computationally. Features are bound when their neurons synchronize, not when their activations overlap at a single time step.

The internal tick loop. These two innovations are embedded in an iterative computation loop over T “thought steps” — internal ticks that are decoupled from the input data dimensions. For each tick t:

The synapse model (a shared MLP) takes the current post-activations z_t and the attention output o_t, and produces new pre-activations a_t.
The pre-activation history A_t is updated as a FIFO buffer — the M most recent pre-activations per neuron.
Each NLM gθ_d processes its neuron’s history A^t_d to produce new post-activations z^{t+1}_d.
The synchronization S_t is computed from the full history of post-activations Z_t.
Output predictions y_t and attention queries q_t are projected from sampled neuron pairs in S_t.

(see Figure 2)

Figure 2 — CTM Architecture: Two Scales

  ══════════════════════════════════════════════════════════════════════════════
   SCALE 1: SINGLE NEURON — "What does thinking look like inside one neuron?"
  ──────────────────────────────────────────────────────────────────────────────

   tick  t-4   t-3   t-2   t-1    t
         │     │     │     │     │
   Pre-  ●─────●─────●─────●─────●   }
   Act   [a_{t-4}  a_{t-3}  ... a_t] }  History A^t_d  (FIFO buffer, depth M)
   hist  └─────────────────────────┘ }

                  NLM gθ_d  ←── unique weights for neuron d only
                  (1-layer MLP over A^t_d)
                       │
                   z^{t+1}_d  ←── new post-activation for neuron d

   "This neuron does not process the input. It processes its own recent
    activation history through a private model that no other neuron shares."

  ══════════════════════════════════════════════════════════════════════════════
   SCALE 2: FULL NETWORK — "How does coordination happen without tokens?"
  ──────────────────────────────────────────────────────────────────────────────

   External Data ──► Feature Extractor ──► KV Tokens
                                                │
   z_t ────────────────────────────────► Synapse MLP ──► pre-activations a_t
         │                                  ▲
         │     ┌─────────────────────────┐  │
         │     │  NLM₁  NLM₂  ...  NLMd │  │   (D private models, parallel)
         │     └────────────────────────┘  │
         │                │                │
         └── Z_t history ─┴─── S_t = Z_t·Z_t^T
                                    │
                       ┌────────────┴────────────┐
               Sampled output pairs         Sampled action pairs
                       │                         │
                   y_t  ←─ W_out · S^out_t    q_t ←─ W_in · S^action_t
               (prediction)                 (attention query)
                                                  │
                                          Cross-Attn(q_t, KV) = o_t ──► synapse
  ──────────────────────────────────────────────────────────────────────────────
  "Thinking" happens in synchronization patterns. No token is generated.
  Each tick refines the synchronization representation toward a stable prediction.
  ══════════════════════════════════════════════════════════════════════════════

Figure 2: CTM architecture at two scales. Scale 1 (top) shows the per-neuron temporal memory: each neuron d has its own private NLM that processes its M-step pre-activation history — the neuron’s “memory” of its recent activity. No two neurons share these weights. Scale 2 (bottom) shows how coordination emerges: post-activation histories are collected into Z_t, synchronization S_t = Z_t·Z_t^T captures co-activation patterns across time, and output predictions and attention queries are derived from subsets of S_t. The reasoning computation is distributed across T ticks of this loop — no reasoning tokens are generated.

Adaptive computation — the practical payoff. Because the CTM produces outputs y_t at every internal tick and computes certainty C_t (one minus normalized entropy) at every tick, it can implement adaptive computation natively. The loss function selects two ticks per training example:

t₁ = argmin(loss) — the tick producing the best prediction
t₂ = argmax(certainty) — the tick at maximum confidence

The final loss is L = (L_{t₁} + L_{t₂}) / 2. Since t₁ and t₂ are defined dynamically per data point, the model learns to halt early on easy inputs and allocate more ticks to hard ones — without any explicit stopping module. Darlow et al. (2025) describe this as “native adaptive computation, as opposed to a post-hoc addition.”

Empirical results. The CTM is evaluated across tasks designed to probe temporal dynamics and sequential reasoning — not to establish new state-of-the-art scores. Darlow et al. (2025) are explicit: the goal is to demonstrate the architecture’s capabilities, not to optimize hyperparameters for benchmark performance.

On ImageNet-1K classification with 50 internal ticks and a ResNet-152 backbone, the CTM achieves 72.47% top-1 and 89.89% top-5 accuracy on uncropped data — while demonstrating native adaptive computation: a certainty threshold of 0.8 allows the model to halt early for the majority of inputs. On parity computation (cumulative parity over 64-length binary sequences), CTMs with 75–100 internal ticks significantly outperform parameter-matched LSTMs, and some seeded runs with 75 or 100 ticks achieve perfect accuracy. On 2D maze navigation (39×39 mazes, 75 internal ticks), the CTM outperforms LSTM baselines of 1, 2, and 3 layers across path lengths, and generalizes to 99×99 mazes not seen during training.

Critically, ablations confirm that both innovations are necessary. A CTM without NLMs and a CTM without synchronization as a representation both perform substantially worse — neither component alone explains the results.

What the CTM does not do (yet). The CTM’s experiments are at small scale: parameters in the 30M range for maze experiments, ImageNet results explicitly stated as preliminary. The paper has not demonstrated the architecture at LLM scale (billions of parameters, language modeling objectives, reasoning benchmarks). Whether per-neuron temporal memory scales to transformer-class models with the same qualitative benefits is an open question — and a central one. Darlow et al. (2025) name language modeling as future work.

The CTM cell in PyTorch. Code makes the architecture concrete (Raschka’s discipline). Below is a minimal implementation of one CTM tick, faithful to Listings 1–3 in Darlow et al. (2025):

import torch
import torch.nn as nn
import torch.nn.functional as F


class CTMCell(nn.Module):
    “”“
    One tick of a Continuous Thought Machine.

    Darlow, Regan, Risi, Seely & Jones (2025) — arXiv:2505.05522

    Two key innovations:
    (1) Per-neuron MLP (NLM) over M-step pre-activation history — private weights per neuron.
    (2) Neural synchronization S = Z @ Z^T as latent representation for attention and output.

    Args:
        D:        Model width — number of neurons.
        M:        Memory length — history depth per neuron.
        d_hidden: Width of per-neuron MLPs (NLMs).
        d_input:  Dimension of attention output o_t.
    “”“

    def __init__(self, D: int, M: int, d_hidden: int, d_input: int):
        super().__init__()
        self.D, self.M = D, M

        # Shared synapse model: mix current z_t + attention output → pre-activations
        self.synapses = nn.Sequential(
            nn.Linear(D + d_input, D * 2), nn.GELU(),
            nn.Linear(D * 2, D),
        )

        # Per-neuron MLPs (NLMs): private weights for each of D neurons.
        # Using einsum avoids D separate nn.Linear calls (Listing 2, Darlow et al.).
        self.nlm_w1 = nn.Parameter(torch.randn(M, d_hidden, D) * 0.01)
        self.nlm_w2 = nn.Parameter(torch.randn(d_hidden, D) * 0.01)

        # Learnable initial states
        self.z_init = nn.Parameter(torch.zeros(D))
        self.pre_acts_init = nn.Parameter(torch.zeros(D, M))

    def apply_nlms(self, pre_acts_history: torch.Tensor) -> torch.Tensor:
        “”“
        Apply D private NLMs via einsum — parallelizes across all neurons.
        pre_acts_history : (B, D, M)
        returns          : (B, D)   — new post-activations
        “”“
        # Each neuron: linear over M history steps → d_hidden latent
        h = torch.einsum(’bdm,mhd->bdh’, pre_acts_history, self.nlm_w1)
        h = F.gelu(h)
        # Each neuron: compress d_hidden → single activation
        z = torch.einsum(’bdh,hd->bd’, h, self.nlm_w2)
        return z

    def compute_sync(self, post_acts_history: torch.Tensor) -> torch.Tensor:
        “”“
        Neural synchronization S_t = Z_t @ Z_t^T.
        post_acts_history : (B, T, D)   — full history so far
        returns           : (B, D, D)   — synchronization matrix
        “”“
        Z = post_acts_history            # (B, T, D)
        # Inner product across time dimension: neurons that co-activate have high S_ij
        S = torch.bmm(Z.transpose(1, 2), Z)  # (B, D, D)
        return S

    def forward(
        self,
        pre_acts_history: torch.Tensor,   # (B, D, M)  — FIFO pre-activation buffer
        post_acts_history: torch.Tensor,  # (B, T, D)  — accumulated post-activations
        attn_out: torch.Tensor,           # (B, d_input) — cross-attention output o_t
        z: torch.Tensor,                  # (B, D)      — current post-activations
    ):
        # Step 1: Synapse — shared processing across neurons
        pre_acts = self.synapses(torch.cat([z, attn_out], dim=-1))  # (B, D)

        # Step 2: Update FIFO pre-activation history (drop oldest, append new)
        pre_acts_history = torch.cat(
            [pre_acts_history[:, :, 1:], pre_acts.unsqueeze(-1)], dim=-1
        )  # (B, D, M)

        # Step 3: Per-neuron MLPs — the core innovation; each neuron processes its own history
        z_new = self.apply_nlms(pre_acts_history)  # (B, D)

        # Step 4: Accumulate post-activation history
        post_acts_history = torch.cat(
            [post_acts_history, z_new.unsqueeze(1)], dim=1
        )  # (B, T+1, D)

        # Step 5: Synchronization — “thinking” becomes correlation structure, not a token
        S = self.compute_sync(post_acts_history)  # (B, D, D)
        # In practice: subsample D_out and D_action neuron pairs from S for
        # output prediction (y_t) and attention query (q_t) respectively.

        return pre_acts_history, post_acts_history, z_new, S

The key architectural decision is in apply_nlms: the einsum bdm,mhd->bdh computes a separate linear transformation of the M-step history for each of D neurons in parallel, using weights nlm_w1 of shape (M, d_hidden, D). This is the mechanism that makes each neuron’s response depend on its own history through its own parameters — not a shared activation function, but D distinct learned functions.

Key Takeaway: The CTM introduces temporal dynamics into artificial neural networks at the neuron level. Each neuron has a private model (NLM) that processes its own M-step pre-activation history — no two neurons share these parameters. The latent representation is neural synchronization: the correlation structure of neuron activation histories. No reasoning tokens are generated; thinking is entirely encoded in synchronization patterns across T internal ticks. Adaptive computation is native: the model allocates more ticks to hard inputs without any explicit stopping mechanism. The architecture demonstrates these properties at small scale (≤30M parameters); whether the principle transfers to LLM-scale models is the central open question.

§3 — Recursive Latent Thoughts: ETD and the Middle of the Spectrum

Between token-based CoT and CTM’s fully implicit computation sits a practical intermediate: iterate over a subset of the model’s own layers, in place, without generating output tokens.

ETD was introduced in [← A8] in the context of DGM’s self-modification spectrum — as the inner-loop counterpart to DGM’s outer-loop code rewriting. Here we examine its architecture and results from the latent compute lens.

Koishekenov, Lipani, and Cancedda (2025) at FAIR/Meta and UCL introduce Encode–Think–Decode (ETD), a method that makes this precise. Motivated by interpretability research showing that the computation required for reasoning is concentrated in a limited range of layers — not spread uniformly across all depth — ETD partitions a pretrained transformer into three components:

Latent encoder E: the early layers that map input tokens into a latent “reasoning space”
Thinking block T: the middle layers most responsible for reasoning, identified via the Kneedle algorithm applied to angular distances between consecutive layers
Latent decoder D: the final layers that translate the latent representation back to output tokens

At inference time, instead of passing through the thinking block T once, the model iterates over T exactly k times. The architecture, parameter count, and training data are unchanged; the only modification is mid-training on the original training data with the thinking block repeated. For OLMo-2 1B, the Kneedle algorithm identifies a 7-4-5 partitioning: 7 encoder layers, 4 thinking layers repeated k times, 5 decoder layers.

The results on OLMo-2 1B are substantial. ETD achieves +28.4% relative accuracy improvement on GSM8K and +36% relative improvement on MATH compared to the non-recursive baseline, across 17 reasoning benchmarks in total. Koishekenov et al. (2025) also demonstrate an adaptive depth strategy that adjusts the number of iterations k per input token, spending less compute on easy tokens and more on hard ones.

ETD’s position in the spectrum matters. It produces no reasoning tokens — there is no token-level scratchpad. But the computation is sequential and layer-indexed: the model iterates over specific identified layers, which is a different mechanism from CTM’s per-neuron temporal dynamics. ETD exploits existing layer structure to add latent compute; CTM adds a new temporal dimension orthogonal to layer depth. Both reduce inference-time token cost to zero. Neither requires architectural changes to the base model (ETD) or is the base model (CTM introduces new components).

(see Figure 3)

Figure 3 — ETD Architecture: Encode → Think (×k) → Decode

  ══════════════════════════════════════════════════════════════════════════════
   ENCODE–THINK–DECODE (ETD) — Koishekenov et al. (2025)
   OLMo-2 1B configuration: 7-4*k-5
  ──────────────────────────────────────────────────────────────────────────────

   Input Tokens
       │
   ┌───▼───────────────────┐
   │  LATENT ENCODER (E)   │   7 layers — maps tokens to latent reasoning space
   │  layers 0 → 6         │   Kneedle identifies where angular distance stabilizes
   └───┬───────────────────┘
       │  x_E  (latent representation)
       │
   ┌───▼───────────────────┐  ─┐
   │  THINKING BLOCK (T)   │   │
   │  layers 7 → 10        │   │  Repeated k times (k=2 shown; k is tunable)
   └───┬───────────────────┘   │
       │  (iterate k times)   ─┘
       │  x_T  (refined latent)
       │
   ┌───▼───────────────────┐
   │  LATENT DECODER (D)   │   5 layers — maps back to token space
   │  layers 11 → 15       │
   └───┬───────────────────┘
       │
   Output Tokens (final answers only — no scratchpad tokens)

  ─────────────────────────────────────────────────────────────────────────────
  Reasoning cost:  0 output tokens  (zero scratchpad)
  Latent cost:     k × |T| extra layer passes
  Memory cost:     Constant (no sequence length increase)
  Gains (OLMo-2 1B):  +28.4% on GSM8K  |  +36% on MATH
  ══════════════════════════════════════════════════════════════════════════════

Figure 3: ETD architecture for OLMo-2 1B. The Kneedle algorithm identifies three natural layer groups from the pattern of angular distances between consecutive layers: an encoder mapping inputs to latent reasoning space, a thinking block where the critical reasoning computation is concentrated, and a decoder mapping back to output tokens. At inference, the thinking block is iterated k times, adding latent compute without any output tokens. The +28.4% relative gain on GSM8K and +36% on MATH demonstrates that recursive latent reasoning is a practical path to stronger performance without token overhead.

Key Takeaway: ETD demonstrates that iterating over reasoning-critical layers adds substantial latent compute without generating output tokens. For OLMo-2 1B, repeating 4 identified layers k times yields +28.4% on GSM8K and +36% on MATH across 17 benchmarks, with no additional parameters, no new data, and no changes to hyperparameters. The critical design decision is identifying which layers to iterate — not ad hoc, but via the Kneedle algorithm applied to layer-to-layer representation evolution. ETD occupies the middle of the compute spectrum: more latent depth than standard inference, less architectural novelty than CTM.

§4 — Recurrent Depth for Action: From Text to Robotics

One measure of a principle’s robustness is whether it transfers across domains. ETD and CTM both operate on language or classification tasks. Tur, Naghiyev, Fang, Tsai, Duan, Fox, and Krishna (2026) ask whether the same principle — adaptive latent compute without token generation — can work in continuous action spaces, where token-level reasoning is fundamentally ill-suited.

The context: Vision-Language-Action (VLA) models are built on multimodal LLMs and control robots by predicting action sequences. Chain-of-Thought prompting for VLAs requires generating text tokens before producing actions — an approach that is slow, memory-intensive (O(L²) attention over reasoning tokens), and misaligned with the temporal requirements of continuous control. Simple grip adjustments and complex multi-step manipulation tasks receive the same fixed compute budget, despite vastly different difficulty.

Recurrent-Depth VLA (RD-VLA) replaces token-level CoT with a recurrent, weight-tied action head. The action head iterates over its own computation k times before producing an action, using Truncated Backpropagation Through Time (TBPTT) for training. At inference, an adaptive stopping criterion based on latent convergence determines how many iterations to use: simple tasks stop early (1–2 iterations); complex manipulation tasks iterate further.

The performance results are stark. On challenging manipulation tasks that fail entirely with single-iteration inference (0% success rate), four iterations achieve 90%+ success. Simpler tasks saturate rapidly. Tur et al. (2026) report up to 80× inference speedup over prior reasoning-based VLA models — the comparison baseline generates text tokens before acting; RD-VLA computes latently with a constant memory footprint.

RD-VLA’s significance extends beyond robotics. It demonstrates that the latent compute principle does not require language modeling as its substrate. Any architecture with a recurrent action head can implement adaptive latent depth. The robotics domain provides an unusually clean test: the performance metric (task success rate) is unambiguous, the cost of failure is observable, and the comparison against token-based reasoning is methodologically clean because both approaches use the same backbone.

Key Takeaway: RD-VLA extends the latent compute principle to robotics, replacing token-based CoT with a recurrent weight-tied action head. Tasks that fail at single-iteration inference (0% success) exceed 90% success at 4 iterations; simpler tasks saturate early, demonstrating genuine adaptive compute allocation. The 80× speedup over token-based reasoning VLAs confirms that the memory efficiency argument is not theoretical: constant memory footprint versus O(L²) token attention is a practical difference at robotics inference timescales. The principle transfers across domains.

§5 — Self-Improving via Recursive Decomposition: LADDER

LADDER occupies a different position in the spectrum. Where ETD and CTM add latent compute to the forward pass, and RD-VLA removes token overhead from robotic control, LADDER adds a different kind of compute: recursive problem decomposition before the model attempts the problem. (LADDER is introduced here in full; [→ A12] extends its logic to the case where the teacher is itself trained by RL to generate better curricula.)

Simonds and Yoshiyama (2025) at Tufa Labs introduce LADDER — Learning through Autonomous Difficulty-Driven Example Recursion. The setting: a language model that cannot solve hard problems because they exceed its current capability. LADDER’s response is not to give the model harder data or more parameters; it is to ask the model to generate simpler variants of the hard problem, recursively, until it reaches problems it can already solve. These simpler variants serve as stepping stones: the model trains on them via RL (using GRPO), the RL updates propagate capability upward through the difficulty gradient, and the model improves on harder problems as a consequence.

The mechanism is self-contained. LADDER requires only a verification signal — in their experiments, a numerical integrator for checking the correctness of symbolic integration solutions. No curated datasets, no human feedback, no demonstrations from a stronger model.

The results are dramatic. A Llama 3B model’s accuracy on undergraduate-level integration problems improves from 1% to 82% under LADDER training. A 7B model achieves 73% on the 2025 MIT Integration Bee examination, substantially exceeding GPT-4o (42%) and typical human performance (15–30%). Simonds and Yoshiyama (2025) extend LADDER to inference time via TTRL (Test-Time Reinforcement Learning), which generates problem variants at test time and runs brief RL updates on them before answering — boosting the 7B model from 73% to 90% on the MIT Integration Bee, exceeding OpenAI o1 on that benchmark.

LADDER is not latent compute in the architectural sense. The recursive decomposition happens at the token level — the model generates simpler variants as text. The compute budget is denominated in tokens. What LADDER changes is not where the compute happens but how the training signal is structured: a self-generated difficulty curriculum rather than a fixed human-designed one.

This places LADDER at the far right of Figure 1 in a different sense. Its inference compute (TTRL) is still token-based. But its training compute is recursive in a way that fundamentally changes what the model can do — not by adding architectural depth, but by restructuring the problem space. The relationship to [→ A12: RL as Educator] is direct: LADDER is a special case of RL-as-educator where the teacher and student are the same model at different difficulty levels.

Key Takeaway: LADDER demonstrates that self-directed recursive problem decomposition, guided only by a verifiable reward signal, can produce capability jumps that neither architectural scaling nor human supervision achieves at equivalent cost. A Llama 3B model goes from 1% to 82% on undergraduate integration; a 7B model exceeds GPT-4o and human performance on the MIT Integration Bee. TTRL extends the principle to inference time: brief test-time RL on problem variants pushes the 7B model to 90%, surpassing o1. The compute is token-denominated; the capability gain is from curriculum structure, not from more tokens on the hard problem directly.

Figure 4 — Inference-Time Compute Methods: A Comparison

  ══════════════════════════════════════════════════════════════════════════════════════
   INFERENCE-TIME COMPUTE METHODS COMPARED
  ──────────────────────────────────────────────────────────────────────────────────────

  Method   │ Compute Type  │ Reasoning │ Memory  │ Adaptive? │ Demonstrated On
           │               │ Tokens    │ Cost    │           │
  ─────────┼───────────────┼───────────┼─────────┼───────────┼──────────────────
  CoT      │ Token-based   │ Yes       │ O(L²)   │ Partial   │ Math, code, logic
  (RLVR)   │ (output seq.) │           │ attn    │ (budget)  │ (A2, A5)
  ─────────┼───────────────┼───────────┼─────────┼───────────┼──────────────────
  ETD      │ Latent        │ No        │ Const.  │ Per-token │ GSM8K +28.4%
  (FAIR)   │ (layer iter.) │ (0 tokens)│         │ depth k   │ MATH +36%
           │               │           │         │           │ OLMo-2 1B
  ─────────┼───────────────┼───────────┼─────────┼───────────┼──────────────────
  RD-VLA   │ Latent        │ No        │ Const.  │ Yes (conv.│ Robotics: 0%→90%+
  (UW/AI2) │ (recur. head) │ (0 tokens)│         │ threshold)│ 4 iters; 80× spdup
  ─────────┼───────────────┼───────────┼─────────┼───────────┼──────────────────
  LADDER   │ Token-based   │ Yes       │ O(L²)   │ Via       │ Integration: 1%→82%
  (Tufa)   │ (at train)    │ (TTRL:    │ attn    │ recursion │ 7B: 73%→90% TTRL
           │               │  yes)     │         │ depth     │ MIT Integ. Bee
  ─────────┼───────────────┼───────────┼─────────┼───────────┼──────────────────
  CTM      │ Latent        │ No        │ Const.  │ Native    │ ImageNet-1K 72.47%
  (Sakana) │ (neural sync.)│ (0 tokens)│         │ (per-tick │ top-1; maze; parity
           │               │           │         │ certainty)│ (small scale only)
  ══════════════════════════════════════════════════════════════════════════════════════
  [RE] CoT/ETD/LADDER — language reasoning substrate
  [WM] RD-VLA — continuous action space substrate
  [FN] CTM — architecture-level, task-agnostic
  ══════════════════════════════════════════════════════════════════════════════════════

Figure 4: Inference-time compute methods compared on compute type, reasoning token overhead, memory cost, adaptive compute capability, and demonstrated application. CoT externalizes reasoning at O(L²) attention cost; ETD and RD-VLA internalize it at constant memory; CTM eliminates reasoning tokens entirely through temporal dynamics. LADDER is a training-time mechanism that generates token-denominated reasoning, but its TTRL variant extends to inference. The “small scale only” note for CTM is deliberate: current CTM experiments operate at 10–30M parameters; scaling behavior to LLM scale is undemonstrated.

§6 — What This Means for the WM + CL Thesis

This series has argued, since [← A3], that the plasticity crisis in continual RL is not a training problem — it is the shape that the absence of a world model takes in a neural network. Temporal computation changes what that argument implies.

If a neuron accumulates state across synchronization steps — as in the CTM — it is doing something that resembles what world models do: building a compressed, predictive representation of input dynamics. The CTM neuron does not predict external future states (that is the world model’s job in A3 and A4); it predicts its own near-term activation history. But the mechanism is structurally similar: state compression over time, predictive structure driving representation quality.

Darlow et al. (2025) recognize this connection explicitly, noting that the maze task — which required the CTM to form an internal world model of spatial relationships without positional encodings — demonstrated the CTM’s capacity for “sequential reasoning, planning, and spatial understanding.” The maze setup is designed to necessitate world model formation: the task requires building an internal representation of environment structure from sequential observation, not pattern-matching at the input level.

The GVF prediction framework from [← A4] provides a theoretical frame: GVFs are structured predictions about how the world will evolve under a given policy. CTM’s per-neuron temporal memories are predictions about how the neuron’s own activity will evolve under the current input. The two are not the same, but they instantiate the same principle: prediction over time as the mechanism that creates useful internal representations.

This suggests a convergence that the series has been building toward: the architectural conditions that produce stable world models (persistent, predictive, compressed representations of environmental dynamics) may be structurally related to the conditions that produce temporally-rich internal representations in a CTM. Both require a mechanism for state persistence. Both use prediction as the training signal. Both benefit from adaptive allocation of compute to uncertain situations.

Whether this convergence produces a unified architecture — a model that is both a continual world model and a temporal compute engine — remains open. The pieces are present in the literature. They have not yet been assembled.

Key Takeaway: CTM’s per-neuron temporal memory is structurally similar to what world models do: state compression over time, predictive structure driving representation quality. The maze experiments demonstrate that temporal compute enables the CTM to form implicit world models of spatial dynamics without explicit positional encoding. The connection to GVFs [← A4] is direct — both are structured prediction mechanisms over time. Whether temporal compute architectures can maintain useful state across episodes (rather than resetting per input) is the open question that connects A11 to the CL/WM thesis of the series.

§7 — Open Questions: Can Temporal Compute and Continual Learning Coexist?

The CTM, as described, resets its internal state per input. Each forward pass begins from learnable initial parameters (z_init, pre_acts_init) and builds its synchronization state from scratch over T ticks. There is no memory across inputs. The temporal computation is rich within a single forward pass; it is amnesiac across forward passes.

This is precisely the setting where the plasticity crisis bites hardest. If the CTM were deployed in a continual setting — processing a stream of inputs over time, with task distribution shifting — the per-neuron temporal memories would be reinitialized at each input boundary. The representation learned during tick 75 of one input does not persist to tick 1 of the next. The CTM’s synchronization state is an episode memory, not an autobiographical memory.

Three specific open questions follow:

Q1: Does per-neuron temporal memory suffer plasticity loss? The CTM’s NLMs (private per-neuron MLPs) are trained through backpropagation. If the CTM is trained sequentially on a changing task distribution, the NLMs will face the same gradient interference dynamics that drive plasticity loss in standard networks [← A1]. The temporal compute mechanism does not, by itself, provide any protection against dead neurons or rank collapse. Whether the temporal structure provides an implicit regularizer that slows plasticity loss — because the NLMs are processing diverse histories rather than a single activation value — is an empirical question not yet answered in the literature.

Q2: What would persistent temporal memory look like? The current CTM resets its FIFO pre-activation history at each input. A persistently-temporal CTM would carry the NLM history across inputs — neuron d’s history A^t_d would include activation patterns from previous inputs as well as the current one. This is a radical departure: the neuron’s “memory” would span episodes, making the CTM a genuinely stateful architecture across the input stream. The memory management challenge — how to prevent old episodes from corrupting new ones — is directly analogous to the catastrophic forgetting problem in continual learning. The circular dependency is exact: CL architectures need persistent memory; persistent memory architectures (like a CL-extended CTM) face forgetting.

Q3: Can CTM-scale thinking be combined with LLM-scale language? The CTM is trained on non-language tasks in the current paper. Whether the per-neuron temporal memory mechanism transfers to autoregressive language modeling — where the “input” is a token sequence, not an image or a maze state — is entirely open. Language modeling has its own rich temporal structure (context windows, KV cache, positional encoding), and it is not clear whether CTM’s NLMs provide additional benefit on top of existing transformer attention mechanisms, or whether they would need to replace them.

These three questions define the research frontier that sits between this article and the series’ larger thesis. They are not objections to CTM’s results — those results are clear and reproducible. They are the natural next experiments.

Key Takeaway: The CTM demonstrates temporal compute within a single forward pass. It does not demonstrate temporal compute across a stream of inputs — the internal state resets per example. Three open questions follow: (1) does the NLM architecture suffer plasticity loss under sequential training; (2) what would persistent cross-episode temporal memory look like, and does it reproduce the forgetting problem; (3) can the mechanism transfer to language modeling. These questions are not limitations of the CTM’s current results — they are the research program its results make tractable to formulate.

§ What Comes Next

This article has traced a spectrum from token-based CoT to CTM’s fully latent temporal compute, through ETD (recursive latent layer iteration), RD-VLA (recurrent depth for action), and LADDER (recursive problem decomposition). The central finding is that inference-time compute and reasoning depth are not synonymous with token generation. Multiple mechanisms now exist for increasing reasoning depth without proportional token cost: iterating over layer subsets (ETD), adding recurrent depth to action models (RD-VLA), and building temporal dynamics at the neuron level (CTM).

Two questions remain for the final article in this series.

The first: if the compute is latent — if the model can think deeply without producing tokens — what does the teacher in an RL training loop look like? The LADDER result is suggestive: a model that recursively generates its own curriculum can teach itself to solve problems far beyond its initial capability. [→ A12: RL as Educator] extends this logic: when the teacher is trained by RL to generate better training data — not just to solve problems — the self-improvement loop becomes recursive in a different and potentially more powerful sense.

The second: if CTM-style temporal compute is the “inner loop” of a self-modifying agent [← A8], then the evaluator in a DGM-style system might reason about proposed modifications using latent temporal compute rather than token-based reasoning. That would mean a self-modifying agent can evaluate its own modifications deeply, without any of that evaluation being visible in the token stream. What that implies for interpretability and alignment is not a comfortable question — but it is the right one to be asking at the frontier.

Final Key Takeaways
Token generation is one point on a spectrum, not the only location for reasoning compute. ETD, RD-VLA, and CTM all demonstrate that reasoning depth can be increased through latent computation — iterating layers, iterating recurrent action heads, iterating internal ticks — without proportional token cost.
CTM’s two innovations are synergistic and both necessary. Per-neuron temporal memory (NLMs with private weights) and neural synchronization as latent representation each contribute to the CTM’s capabilities; ablations removing either component substantially degrade performance. The combination is what produces adaptive compute and interpretable reasoning patterns.
Adaptive compute is native in CTM, not bolted on. The loss function’s t₁/t₂ selection — minimum loss tick and maximum certainty tick — means the model learns to allocate internal ticks to input difficulty without any explicit stopping module. This is qualitatively different from early-exit networks that add a separate halting classifier.
ETD is the most immediately deployable. It requires only mid-training on existing pretrained models, adds no parameters, uses no new data, and produces +28.4% on GSM8K and +36% on MATH for OLMo-2 1B. The key implementation choice — which layers to iterate — is automated via the Kneedle algorithm.
LADDER demonstrates recursive self-improvement through problem structure, not architectural depth. A Llama 3B model improves from 1% to 82% on undergraduate integration through self-generated difficulty gradients and RL, without any human-designed curriculum. The verification signal is the only external component.
CTM scales to small experiments; whether it scales to LLMs is the open question. The current results (maze, ImageNet-1K, parity) are at the 10–30M parameter scale. Darlow et al. (2025) explicitly state these are preliminary. The economic argument — inference cost decoupled from reasoning depth — is only as strong as the scaling hypothesis, which remains untested.
Temporal compute connects to the WM + CL thesis. Per-neuron temporal memory is structurally related to what world models do: state compression over time, predictive structure driving representation quality. Whether CL + WM + temporal compute can be integrated into a single architecture that neither forgets nor stagnates is the open engineering problem this series has been building toward.

References

[1] Darlow, L., Regan, C., Risi, S., Seely, J., & Jones, L. (2025). Continuous Thought Machines. 39th Conference on Neural Information Processing Systems (NeurIPS 2025). arXiv:2505.05522.

[2] Koishekenov, Y., Lipani, A., & Cancedda, N. (2025). Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts. Preprint, October 2025. FAIR at Meta / University College London. arXiv:2510.07358.

[3] Simonds, T., & Yoshiyama, A. (2025). LADDER: Self-Improving LLMs Through Recursive Problem Decomposition. Preprint, March 2025. Tufa Labs. arXiv:2503.00735.

[4] Tur, Y., Naghiyev, J., Fang, H., Tsai, W.-C., Duan, J., Fox, D., & Krishna, R. (2026). Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision–Language–Action Models via Latent Iterative Reasoning. Preprint, 2026. Stanford University / University of Washington / Allen Institute for Artificial Intelligence. arXiv:2602.07845.

Article 12: RL as Educator: Training Teachers, Not Just Students

Datt’s Substack

Discussion about this post

Ready for more?