Evaluation & Aggregation¶

Key insight: With K trajectories, we need a strategy to select or combine answers.

Result Data Classes¶

class structured_stochasticity.evaluation.AggregatedResult(selected_response, is_correct, k_trajectories, selection_method, individual_results=<factory>, agreement_rate=0.0, any_correct=False, num_correct=0, metadata=<factory>)[source]¶

Bases: object

Result of aggregating multiple trajectories.

Parameters:

selected_response (str)
is_correct (bool)
k_trajectories (int)
selection_method (str)
individual_results (list[TaskResult])
agreement_rate (float)
any_correct (bool)
num_correct (int)
metadata (dict)

selected_response: str¶

is_correct: bool¶

k_trajectories: int¶

selection_method: str¶

individual_results: list[TaskResult]¶

agreement_rate: float = 0.0¶

any_correct: bool = False¶

num_correct: int = 0¶

metadata: dict¶

__init__(selected_response, is_correct, k_trajectories, selection_method, individual_results=<factory>, agreement_rate=0.0, any_correct=False, num_correct=0, metadata=<factory>)¶

Parameters:

selected_response (str)
is_correct (bool)
k_trajectories (int)
selection_method (str)
individual_results (list[TaskResult])
agreement_rate (float)
any_correct (bool)
num_correct (int)
metadata (dict)

Return type:

None

class structured_stochasticity.evaluation.EvaluationResult(task_name, k_trajectories, selection_method, accuracy_by_complexity=<factory>, max_solved_complexity=0, all_results=<factory>, oracle_accuracy_by_complexity=<factory>)[source]¶

Bases: object

Results from evaluating across complexity levels.

Parameters:

task_name (str)
k_trajectories (int)
selection_method (str)
accuracy_by_complexity (dict[int, float])
max_solved_complexity (int)
all_results (list[AggregatedResult])
oracle_accuracy_by_complexity (dict[int, float])

task_name: str¶

k_trajectories: int¶

selection_method: str¶

accuracy_by_complexity: dict[int, float]¶

max_solved_complexity: int = 0¶

all_results: list[AggregatedResult]¶

oracle_accuracy_by_complexity: dict[int, float]¶

__init__(task_name, k_trajectories, selection_method, accuracy_by_complexity=<factory>, max_solved_complexity=0, all_results=<factory>, oracle_accuracy_by_complexity=<factory>)¶

Parameters:

task_name (str)
k_trajectories (int)
selection_method (str)
accuracy_by_complexity (dict[int, float])
max_solved_complexity (int)
all_results (list[AggregatedResult])
oracle_accuracy_by_complexity (dict[int, float])

Return type:

None

Trajectory Aggregator¶

class structured_stochasticity.evaluation.TrajectoryAggregator(method='majority_vote', verifier=None)[source]¶

Bases: object

Aggregates multiple reasoning trajectories into a single answer.

This is where the “structured stochasticity” hypothesis gets tested: if K>1 trajectories systematically improve accuracy vs K=1, it suggests reasoning collapse is indeed a trajectory problem.

Aggregation Methods:

majority_vote - Select most common answer
first_valid - Take first response that parses as valid
oracle - Cheating mode - pick correct if any (for upper bound)
verifier - Use custom verifier function to score responses

Example:

from structured_stochasticity.evaluation import TrajectoryAggregator

aggregator = TrajectoryAggregator(method="majority_vote")
result = aggregator.aggregate(responses, task, instance)

print(f"Selected: {result.selected_response}")
print(f"Correct: {result.is_correct}")
print(f"Agreement: {result.agreement_rate:.1%}")

Parameters:

method (str)
verifier (Callable[[str, TaskInstance], float] | None)

__init__(method='majority_vote', verifier=None)[source]¶

Parameters:

method (str) – Aggregation method - “majority_vote”: Select most common answer - “first_valid”: Take first response that parses as valid - “verifier”: Use verifier function to score responses - “oracle”: Cheating mode - pick correct if any (for upper bound)
verifier (Callable[[str, TaskInstance], float] | None) – Optional function (response, instance) -> score

aggregate(responses, task, instance)[source]¶

Aggregate K responses into a single result.

Parameters:

responses (list[str]) – List of K response strings
task (Task) – Task being evaluated
instance (TaskInstance) – The specific task instance

Returns:

AggregatedResult with selected answer and metadata

Return type:

AggregatedResult

Evaluator¶

class structured_stochasticity.evaluation.Evaluator(task, aggregator=None)[source]¶

Bases: object

Runs evaluation across complexity levels.

Main experimental loop: 1. For each complexity level 2. Generate task instances 3. For each K value 4. Generate K trajectories per instance 5. Aggregate and verify 6. Report accuracy curves

Main experimental loop:

For each complexity level
Generate task instances
For each K value
Generate K trajectories per instance
Aggregate and verify
Report accuracy curves

Parameters:

task (Task)
aggregator (TrajectoryAggregator | None)

__init__(task, aggregator=None)[source]¶

Parameters:

task (Task)
aggregator (TrajectoryAggregator | None)

evaluate(generate_fn, complexity_range, k_values=[1, 5, 10], trials_per_complexity=10, verbose=True)[source]¶

Run full evaluation.

Parameters:

generate_fn (Callable[[str, int], list[str]]) – Function (prompt, k) -> list of k responses
complexity_range (tuple[int, int]) – (min_complexity, max_complexity)
k_values (list[int]) – List of K values to test
trials_per_complexity (int) – How many instances per complexity level
verbose (bool) – Print progress

Returns:

Dict mapping K -> EvaluationResult

Return type:

dict[int, EvaluationResult]

compare_k_scaling(results)[source]¶

Analyze how performance scales with K.

This is the key analysis: if max_solved_complexity increases with K under constant token budgets, it supports the hypothesis.

Parameters:: results (dict[int, EvaluationResult])
Return type:: dict

Structured Stochasticity

Navigation

Related Topics

Evaluation & Aggregation¶

Result Data Classes¶

Trajectory Aggregator¶

Evaluator¶