Evaluation & Aggregation¶
Key insight: With K trajectories, we need a strategy to select or combine answers.
Result Data Classes¶
- class structured_stochasticity.evaluation.AggregatedResult(selected_response, is_correct, k_trajectories, selection_method, individual_results=<factory>, agreement_rate=0.0, any_correct=False, num_correct=0, metadata=<factory>)[source]¶
Bases:
objectResult of aggregating multiple trajectories.
- Parameters:
selected_response (str)
is_correct (bool)
k_trajectories (int)
selection_method (str)
individual_results (list[TaskResult])
agreement_rate (float)
any_correct (bool)
num_correct (int)
metadata (dict)
- selected_response: str¶
- is_correct: bool¶
- k_trajectories: int¶
- selection_method: str¶
- individual_results: list[TaskResult]¶
- agreement_rate: float = 0.0¶
- any_correct: bool = False¶
- num_correct: int = 0¶
- metadata: dict¶
- __init__(selected_response, is_correct, k_trajectories, selection_method, individual_results=<factory>, agreement_rate=0.0, any_correct=False, num_correct=0, metadata=<factory>)¶
- Parameters:
selected_response (str)
is_correct (bool)
k_trajectories (int)
selection_method (str)
individual_results (list[TaskResult])
agreement_rate (float)
any_correct (bool)
num_correct (int)
metadata (dict)
- Return type:
None
- class structured_stochasticity.evaluation.EvaluationResult(task_name, k_trajectories, selection_method, accuracy_by_complexity=<factory>, max_solved_complexity=0, all_results=<factory>, oracle_accuracy_by_complexity=<factory>)[source]¶
Bases:
objectResults from evaluating across complexity levels.
- Parameters:
task_name (str)
k_trajectories (int)
selection_method (str)
accuracy_by_complexity (dict[int, float])
max_solved_complexity (int)
all_results (list[AggregatedResult])
oracle_accuracy_by_complexity (dict[int, float])
- task_name: str¶
- k_trajectories: int¶
- selection_method: str¶
- accuracy_by_complexity: dict[int, float]¶
- max_solved_complexity: int = 0¶
- all_results: list[AggregatedResult]¶
- oracle_accuracy_by_complexity: dict[int, float]¶
- __init__(task_name, k_trajectories, selection_method, accuracy_by_complexity=<factory>, max_solved_complexity=0, all_results=<factory>, oracle_accuracy_by_complexity=<factory>)¶
- Parameters:
task_name (str)
k_trajectories (int)
selection_method (str)
accuracy_by_complexity (dict[int, float])
max_solved_complexity (int)
all_results (list[AggregatedResult])
oracle_accuracy_by_complexity (dict[int, float])
- Return type:
None
Trajectory Aggregator¶
- class structured_stochasticity.evaluation.TrajectoryAggregator(method='majority_vote', verifier=None)[source]¶
Bases:
objectAggregates multiple reasoning trajectories into a single answer.
This is where the “structured stochasticity” hypothesis gets tested: if K>1 trajectories systematically improve accuracy vs K=1, it suggests reasoning collapse is indeed a trajectory problem.
Aggregation Methods:
majority_vote- Select most common answerfirst_valid- Take first response that parses as validoracle- Cheating mode - pick correct if any (for upper bound)verifier- Use custom verifier function to score responses
Example:
from structured_stochasticity.evaluation import TrajectoryAggregator aggregator = TrajectoryAggregator(method="majority_vote") result = aggregator.aggregate(responses, task, instance) print(f"Selected: {result.selected_response}") print(f"Correct: {result.is_correct}") print(f"Agreement: {result.agreement_rate:.1%}")
- Parameters:
method (str)
verifier (Callable[[str, TaskInstance], float] | None)
- __init__(method='majority_vote', verifier=None)[source]¶
- Parameters:
method (str) – Aggregation method - “majority_vote”: Select most common answer - “first_valid”: Take first response that parses as valid - “verifier”: Use verifier function to score responses - “oracle”: Cheating mode - pick correct if any (for upper bound)
verifier (Callable[[str, TaskInstance], float] | None) – Optional function (response, instance) -> score
- aggregate(responses, task, instance)[source]¶
Aggregate K responses into a single result.
- Parameters:
responses (list[str]) – List of K response strings
task (Task) – Task being evaluated
instance (TaskInstance) – The specific task instance
- Returns:
AggregatedResult with selected answer and metadata
- Return type:
Evaluator¶
- class structured_stochasticity.evaluation.Evaluator(task, aggregator=None)[source]¶
Bases:
objectRuns evaluation across complexity levels.
Main experimental loop: 1. For each complexity level 2. Generate task instances 3. For each K value 4. Generate K trajectories per instance 5. Aggregate and verify 6. Report accuracy curves
Main experimental loop:
For each complexity level
Generate task instances
For each K value
Generate K trajectories per instance
Aggregate and verify
Report accuracy curves
- Parameters:
task (Task)
aggregator (TrajectoryAggregator | None)
- __init__(task, aggregator=None)[source]¶
- Parameters:
task (Task)
aggregator (TrajectoryAggregator | None)
- evaluate(generate_fn, complexity_range, k_values=[1, 5, 10], trials_per_complexity=10, verbose=True)[source]¶
Run full evaluation.
- Parameters:
generate_fn (Callable[[str, int], list[str]]) – Function (prompt, k) -> list of k responses
complexity_range (tuple[int, int]) – (min_complexity, max_complexity)
k_values (list[int]) – List of K values to test
trials_per_complexity (int) – How many instances per complexity level
verbose (bool) – Print progress
- Returns:
Dict mapping K -> EvaluationResult
- Return type:
dict[int, EvaluationResult]
- compare_k_scaling(results)[source]¶
Analyze how performance scales with K.
This is the key analysis: if max_solved_complexity increases with K under constant token budgets, it supports the hypothesis.
- Parameters:
results (dict[int, EvaluationResult])
- Return type:
dict