Structured Stochasticity - Noise Injection for LLM Reasoning

The Problem

Contemporary large language models exhibit a sharp performance collapse as problem complexity increases. Recent research demonstrates that even state-of-the-art reasoning models fail on algorithmic tasks beyond a certain complexity threshold, regardless of how much "thinking" they're allowed to do.

Key Observation: Reasoning effort (e.g., length of reasoning traces) initially increases with complexity, then unexpectedly declines right when accuracy collapses. This suggests a structural limitation rather than mere resource exhaustion.

Core Hypothesis

We argue that this collapse is not an inherent limitation of neural sequence models, but rather a consequence of single-trajectory, deterministic inference.

In current LLMs:

Internal computation is deterministic for a fixed input
Stochasticity is only injected at decoding time via token sampling
This confines variability to surface realization, not internal reasoning

In contrast, human reasoning involves:

Persistent internal state drift
Implicit hypothesis branching
Reframing and re-instantiation of problem representations
Convergence from multiple cognitive trajectories

The Solution: Structured Stochasticity

Weak vs Strong Stochasticity

Standard (Weak):
  Input X  -->  [Deterministic h = f(X)]  -->  Sample Output Y

Proposed (Strong):
  Input X  +  z ~ P(z|X)  -->  [Stochastic h = f(X,z)]  -->  Sample Output Y

By injecting a latent variable z into the model's hidden states, we enable distinct internal reasoning trajectories for the same input. This allows multiple problem decompositions and solution strategies to be explored.

Framework Components

Noise Injection

Multiple strategies for injecting noise: Gaussian, Uniform, Annealed (decreasing over generation), and Layer-Selective approaches.

PyTorch Hooks

Non-invasive modification of hidden states using forward hooks. Supports all major transformer architectures (Llama, Mistral, GPT-2, etc.).

Benchmark Tasks

Algorithmic tasks with exact solutions: Tower of Hanoi (flagship), Arithmetic Sequences, and Logical Deduction puzzles.

Evaluation

Trajectory aggregation via majority voting, oracle bounds, and K-scaling analysis to measure how performance improves with more trajectories.

Quick Start

# Install the package
pip install -e .

# Run a quick experiment
from structured_stochasticity import run_quick_experiment

results = run_quick_experiment(
    model_name="meta-llama/Llama-3.2-1B",
    noise_scale=0.1,
    complexity_range=(3, 5),
    k_values=[1, 5, 10],
    trials=5
)

Or use the CLI:

# Run with default config
ss-experiment --config configs/default.yaml

# Override parameters
ss-experiment --model meta-llama/Llama-3.2-1B --scale 0.15

Key Question Being Tested

Hypothesis: If maximum solvable complexity increases with K (number of trajectories) under constant token budgets, it would indicate that reasoning collapse is trajectory-determined, not capacity-limited.

Improved scaling with K would suggest that:

Reasoning collapse arises from trajectory determinism, not lack of reasoning representations
Internal stochasticity is qualitatively different from increased chain-of-thought length
Future reasoning models should explicitly model latent reasoning trajectories

Documentation

Research Concepts - Detailed theoretical background from the research papers
API Reference - Complete documentation of all Python modules