Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group agg rework #1741

Open
wants to merge 61 commits into
base: main
Choose a base branch
from
Open

Group agg rework #1741

wants to merge 61 commits into from

Conversation

lintangsutawika
Copy link
Contributor

@lintangsutawika lintangsutawika commented Apr 23, 2024

  1. By default, group will not feature the aggregate scores of their subtasks.
  2. To show an aggregate, a group_config will need to be defined in the yaml that consists of aggregate_metric (True/False, default False) and weight_by_size (True/False, default False).
  3. Use task_id in ConfigurableGroup and ConfigurableTask to be used as identifier in lieu of task name/group name.

@lintangsutawika lintangsutawika marked this pull request as ready for review April 25, 2024 18:05
@lintangsutawika
Copy link
Contributor Author

@haileyschoelkopf I've only added the group_config to MMLU tasks and flan_held_in. Let me know what other benchmarks that would need the aggregation to be added back in.

lm_eval/evaluator.py Outdated Show resolved Hide resolved
lm_eval/api/task.py Outdated Show resolved Hide resolved
class GroupConfig(dict):
group: Optional[Union[str, list]] = None
aggregate_metric: Optional[str] = False
aggregate_fn: Optional[str] = "mean"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggregate fn being defined explicitly is good, I think we also need to try to be careful of how tasks with (multiple) filters interact with groups.

Currently, if two tasks have filters with different names, a group can't aggregate those correctly. We may want to have groups, for each metric they'll report, expressly define, for each subtask, what metric and what filter to include in the aggregation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants