Evaluation & Aggregation ======================== .. module:: structured_stochasticity.evaluation Key insight: With K trajectories, we need a strategy to select or combine answers. Result Data Classes ------------------- .. autoclass:: AggregatedResult :members: :show-inheritance: .. autoclass:: EvaluationResult :members: :show-inheritance: Trajectory Aggregator --------------------- .. autoclass:: TrajectoryAggregator :members: :show-inheritance: **Aggregation Methods:** - ``majority_vote`` - Select most common answer - ``first_valid`` - Take first response that parses as valid - ``oracle`` - Cheating mode - pick correct if any (for upper bound) - ``verifier`` - Use custom verifier function to score responses **Example:** .. code-block:: python from structured_stochasticity.evaluation import TrajectoryAggregator aggregator = TrajectoryAggregator(method="majority_vote") result = aggregator.aggregate(responses, task, instance) print(f"Selected: {result.selected_response}") print(f"Correct: {result.is_correct}") print(f"Agreement: {result.agreement_rate:.1%}") Evaluator --------- .. autoclass:: Evaluator :members: :show-inheritance: Main experimental loop: 1. For each complexity level 2. Generate task instances 3. For each K value 4. Generate K trajectories per instance 5. Aggregate and verify 6. Report accuracy curves