The Problem
Contemporary large language models exhibit a sharp performance collapse as problem complexity increases. Recent research demonstrates that even state-of-the-art reasoning models fail on algorithmic tasks beyond a certain complexity threshold, regardless of how much "thinking" they're allowed to do.
Core Hypothesis
We argue that this collapse is not an inherent limitation of neural sequence models, but rather a consequence of single-trajectory, deterministic inference.
In current LLMs:
- Internal computation is deterministic for a fixed input
- Stochasticity is only injected at decoding time via token sampling
- This confines variability to surface realization, not internal reasoning
In contrast, human reasoning involves:
- Persistent internal state drift
- Implicit hypothesis branching
- Reframing and re-instantiation of problem representations
- Convergence from multiple cognitive trajectories
The Solution: Structured Stochasticity
Standard (Weak): Input X --> [Deterministic h = f(X)] --> Sample Output Y Proposed (Strong): Input X + z ~ P(z|X) --> [Stochastic h = f(X,z)] --> Sample Output Y
By injecting a latent variable z into the model's hidden states, we enable distinct internal reasoning trajectories for the same input. This allows multiple problem decompositions and solution strategies to be explored.
Framework Components
Noise Injection
Multiple strategies for injecting noise: Gaussian, Uniform, Annealed (decreasing over generation), and Layer-Selective approaches.
PyTorch Hooks
Non-invasive modification of hidden states using forward hooks. Supports all major transformer architectures (Llama, Mistral, GPT-2, etc.).
Benchmark Tasks
Algorithmic tasks with exact solutions: Tower of Hanoi (flagship), Arithmetic Sequences, and Logical Deduction puzzles.
Evaluation
Trajectory aggregation via majority voting, oracle bounds, and K-scaling analysis to measure how performance improves with more trajectories.
Quick Start
# Install the package
pip install -e .
# Run a quick experiment
from structured_stochasticity import run_quick_experiment
results = run_quick_experiment(
model_name="meta-llama/Llama-3.2-1B",
noise_scale=0.1,
complexity_range=(3, 5),
k_values=[1, 5, 10],
trials=5
)
Or use the CLI:
# Run with default config
ss-experiment --config configs/default.yaml
# Override parameters
ss-experiment --model meta-llama/Llama-3.2-1B --scale 0.15
Key Question Being Tested
Improved scaling with K would suggest that:
- Reasoning collapse arises from trajectory determinism, not lack of reasoning representations
- Internal stochasticity is qualitatively different from increased chain-of-thought length
- Future reasoning models should explicitly model latent reasoning trajectories
Documentation
- Research Concepts - Detailed theoretical background from the research papers
- API Reference - Complete documentation of all Python modules