Research Concepts

Theoretical foundations for structured stochasticity in LLM reasoning

Beyond Single-Trajectory Reasoning

Motivation

Recent work, most notably "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity", demonstrates that contemporary reasoning-oriented language models exhibit a sharp performance collapse as problem complexity increases. This collapse is observed even when models are allowed to generate extensive intermediate reasoning traces and are not constrained by token budgets.

A key empirical observation is that reasoning effort (e.g., length or apparent depth of reasoning traces) initially increases with complexity and then unexpectedly declines, coinciding with accuracy collapse. This suggests a structural limitation in current reasoning models rather than mere resource exhaustion.

Core Hypothesis

We argue that this collapse is not an inherent limitation of neural sequence models, but rather a consequence of a single-trajectory, deterministic inference regime. In current large language models, internal computation is deterministic for a fixed input, and stochasticity is typically injected only at decoding time via token sampling. This confines variability to surface realization rather than internal reasoning pathways.

In contrast, human reasoning appears to involve:

We hypothesize that enabling structured stochasticity in internal reasoning trajectories can mitigate or delay the observed reasoning collapse.

Weak vs. Strong Stochasticity

Let \(X\) denote an input problem and \(Y\) the output answer.

Weak Stochasticity (Standard Decoding)

\[ \begin{align} h &= f_\theta(X) \\ Y &\sim P(Y \mid h) \end{align} \]

Here, all randomness is confined to the output distribution. The internal representation \(h\)—and thus the reasoning trajectory—remains fixed.

Strong Stochasticity (Proposed)

\[ \begin{align} z &\sim P(z \mid X) \\ h &= f_\theta(X, z) \\ Y &\sim P(Y \mid h) \end{align} \]

The latent variable \(z\) induces distinct internal reasoning trajectories for the same input, enabling multiple problem decompositions and solution strategies to be explored.

Why This Matters for Algorithmic Tasks

Tasks such as Tower of Hanoi exhibit:

Empirical failures on such tasks may reflect early commitment to suboptimal internal representations rather than an inability to represent the underlying algorithm. A single deterministic trajectory cannot recover once such commitment occurs.

Structured internal stochasticity enables trajectory resampling, analogous to how humans naturally reframe or restart reasoning when encountering cognitive dead ends.

Cluster-Aware Perturbations

The Problem with Naive Noise

Naive approaches such as adding isotropic Gaussian noise to hidden states tend to degrade performance, as they push activations off the learned semantic manifold. Transformer hidden states are not uniformly distributed in \(\mathbb{R}^d\). Empirically, they form anisotropic, semantically meaningful clusters corresponding to abstraction level, task structure, reasoning mode, and other latent factors.

Random perturbations ignore this structure and frequently disrupt attention and compositionality.

Structured Perturbations

Instead, we propose injecting perturbations that are:

The goal is not to inject "noise" per se, but to enable controlled movement between nearby semantic basins in representation space.

Formalization

Let \(X\) denote an input prompt and let \(h_l(X) \in \mathbb{R}^d\) be the hidden representation at layer \(l\) (e.g., the mean token embedding or a designated reasoning token).

In standard inference:

\[h_{l+1} = f_l(h_l)\]

We consider a modified update rule:

\[h_{l+1} = f_l(h_l + \epsilon)\]

where the perturbation \(\epsilon\) is drawn from a structured distribution rather than an isotropic one.

Cluster-Aware Perturbation Construction

Assume access to a set of direction vectors \(\{v_k\}\) corresponding to semantic or functional clusters in representation space (e.g., derived via clustering, PCA, or probing).

A single perturbation is constructed as:

\[ \begin{align} v_k &\leftarrow \text{sample cluster direction} \\ v_k^\perp &= v_k - \frac{h_l^\top v_k}{\|h_l\|^2} h_l \\ \epsilon &= \alpha \cdot \frac{v_k^\perp}{\|v_k^\perp\|} \end{align} \]

The orthogonalization step ensures that the perturbation alters the internal framing without collapsing the representation back onto the same dominant direction.

Geometric Constraints

To remain on-manifold, the perturbation magnitude \(\alpha\) is constrained such that:

\[\cos(h_l, h_l + \epsilon) \ge \tau\]

or equivalently \(\|\epsilon\| \le \delta\), for small \(\delta\). This preserves the problem semantics while allowing alternative internal decompositions.

Interpretation

Each sampled perturbation corresponds to a different internal "strategy" or representational framing. Importantly, the perturbation is sampled once per trajectory and held fixed across subsequent layers, producing a coherent alternative reasoning path rather than accumulated noise.

Key Insight: The perturbation enables different "problem framings" at the beginning of reasoning, while maintaining coherent execution once a framing is chosen.

Proposed Experimental Test

We propose a minimal intervention experiment:

  1. Fix a base language model and token budget
  2. Introduce a low-dimensional latent variable sampled once per response
  3. Generate \(K\) parallel reasoning trajectories per input
  4. Select or verify answers via agreement or lightweight checking
  5. Measure maximum solvable problem complexity (e.g., number of Hanoi disks) as a function of \(K\)

Improved scaling with \(K\) under constant token budgets would indicate that prior conclusions about reasoning limits are contingent on inference regime rather than model capacity.

Implications

If validated, this would suggest that:

This proposal reframes recent negative results on LLM reasoning as artifacts of single-path inference. Introducing structured stochasticity into internal computation offers a principled and biologically inspired avenue for scaling reasoning performance with problem complexity.

Implementation in This Framework

This framework implements several noise injection strategies to test these ideas:

Strategy Description Use Case
gaussian Standard Gaussian noise \(z \sim N(0, \sigma^2)\) Baseline exploration
uniform Bounded uniform noise in \([-\sigma, \sigma]\) Controlled perturbation range
annealed Noise decreasing over generation: \(\sigma_t = \sigma \cdot \gamma^t\) Strong early exploration, stable execution
once Sample once per trajectory, reuse Coherent alternative paths
layer_selective Different scales for different layers Test early vs late layer sensitivity

See the API Reference for detailed documentation on using these strategies.