Evaluation & Aggregation

Key insight: With K trajectories, we need a strategy to select or combine answers.

Result Data Classes

class structured_stochasticity.evaluation.AggregatedResult(selected_response, is_correct, k_trajectories, selection_method, individual_results=<factory>, agreement_rate=0.0, any_correct=False, num_correct=0, metadata=<factory>)[source]

Bases: object

Result of aggregating multiple trajectories.

Parameters:
  • selected_response (str)

  • is_correct (bool)

  • k_trajectories (int)

  • selection_method (str)

  • individual_results (list[TaskResult])

  • agreement_rate (float)

  • any_correct (bool)

  • num_correct (int)

  • metadata (dict)

selected_response: str
is_correct: bool
k_trajectories: int
selection_method: str
individual_results: list[TaskResult]
agreement_rate: float = 0.0
any_correct: bool = False
num_correct: int = 0
metadata: dict
__init__(selected_response, is_correct, k_trajectories, selection_method, individual_results=<factory>, agreement_rate=0.0, any_correct=False, num_correct=0, metadata=<factory>)
Parameters:
  • selected_response (str)

  • is_correct (bool)

  • k_trajectories (int)

  • selection_method (str)

  • individual_results (list[TaskResult])

  • agreement_rate (float)

  • any_correct (bool)

  • num_correct (int)

  • metadata (dict)

Return type:

None

class structured_stochasticity.evaluation.EvaluationResult(task_name, k_trajectories, selection_method, accuracy_by_complexity=<factory>, max_solved_complexity=0, all_results=<factory>, oracle_accuracy_by_complexity=<factory>)[source]

Bases: object

Results from evaluating across complexity levels.

Parameters:
  • task_name (str)

  • k_trajectories (int)

  • selection_method (str)

  • accuracy_by_complexity (dict[int, float])

  • max_solved_complexity (int)

  • all_results (list[AggregatedResult])

  • oracle_accuracy_by_complexity (dict[int, float])

task_name: str
k_trajectories: int
selection_method: str
accuracy_by_complexity: dict[int, float]
max_solved_complexity: int = 0
all_results: list[AggregatedResult]
oracle_accuracy_by_complexity: dict[int, float]
__init__(task_name, k_trajectories, selection_method, accuracy_by_complexity=<factory>, max_solved_complexity=0, all_results=<factory>, oracle_accuracy_by_complexity=<factory>)
Parameters:
  • task_name (str)

  • k_trajectories (int)

  • selection_method (str)

  • accuracy_by_complexity (dict[int, float])

  • max_solved_complexity (int)

  • all_results (list[AggregatedResult])

  • oracle_accuracy_by_complexity (dict[int, float])

Return type:

None

Trajectory Aggregator

class structured_stochasticity.evaluation.TrajectoryAggregator(method='majority_vote', verifier=None)[source]

Bases: object

Aggregates multiple reasoning trajectories into a single answer.

This is where the “structured stochasticity” hypothesis gets tested: if K>1 trajectories systematically improve accuracy vs K=1, it suggests reasoning collapse is indeed a trajectory problem.

Aggregation Methods:

  • majority_vote - Select most common answer

  • first_valid - Take first response that parses as valid

  • oracle - Cheating mode - pick correct if any (for upper bound)

  • verifier - Use custom verifier function to score responses

Example:

from structured_stochasticity.evaluation import TrajectoryAggregator

aggregator = TrajectoryAggregator(method="majority_vote")
result = aggregator.aggregate(responses, task, instance)

print(f"Selected: {result.selected_response}")
print(f"Correct: {result.is_correct}")
print(f"Agreement: {result.agreement_rate:.1%}")
Parameters:
  • method (str)

  • verifier (Callable[[str, TaskInstance], float] | None)

__init__(method='majority_vote', verifier=None)[source]
Parameters:
  • method (str) – Aggregation method - “majority_vote”: Select most common answer - “first_valid”: Take first response that parses as valid - “verifier”: Use verifier function to score responses - “oracle”: Cheating mode - pick correct if any (for upper bound)

  • verifier (Callable[[str, TaskInstance], float] | None) – Optional function (response, instance) -> score

aggregate(responses, task, instance)[source]

Aggregate K responses into a single result.

Parameters:
  • responses (list[str]) – List of K response strings

  • task (Task) – Task being evaluated

  • instance (TaskInstance) – The specific task instance

Returns:

AggregatedResult with selected answer and metadata

Return type:

AggregatedResult

Evaluator

class structured_stochasticity.evaluation.Evaluator(task, aggregator=None)[source]

Bases: object

Runs evaluation across complexity levels.

Main experimental loop: 1. For each complexity level 2. Generate task instances 3. For each K value 4. Generate K trajectories per instance 5. Aggregate and verify 6. Report accuracy curves

Main experimental loop:

  1. For each complexity level

  2. Generate task instances

  3. For each K value

  4. Generate K trajectories per instance

  5. Aggregate and verify

  6. Report accuracy curves

Parameters:
__init__(task, aggregator=None)[source]
Parameters:
evaluate(generate_fn, complexity_range, k_values=[1, 5, 10], trials_per_complexity=10, verbose=True)[source]

Run full evaluation.

Parameters:
  • generate_fn (Callable[[str, int], list[str]]) – Function (prompt, k) -> list of k responses

  • complexity_range (tuple[int, int]) – (min_complexity, max_complexity)

  • k_values (list[int]) – List of K values to test

  • trials_per_complexity (int) – How many instances per complexity level

  • verbose (bool) – Print progress

Returns:

Dict mapping K -> EvaluationResult

Return type:

dict[int, EvaluationResult]

compare_k_scaling(results)[source]

Analyze how performance scales with K.

This is the key analysis: if max_solved_complexity increases with K under constant token budgets, it supports the hypothesis.

Parameters:

results (dict[int, EvaluationResult])

Return type:

dict