New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add ranking metrics #974
base: main
Are you sure you want to change the base?
Conversation
Metrics for evaluating rankings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, @bram49!
Do you want to add a brief example under examples
where you have a very simple dataset and show the metrics you defined? Even better would be a user guide that builds on the example (using literalinclude
) and explains the relevance of these metrics. Anyway, just thinking out loud here.
@romanlutz |
Sounds good! Let me know if you have issues with restructed text or literalinclude. The existing rst files should be good examples (for the most part). |
Documentation improvements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, forgot to post this.
fairlearn/metrics/__init__.py
Outdated
exposure, | ||
utility, | ||
exposure_utility_ratio, | ||
allocation_harm_in_ranking_difference, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the doc says Calculate the difference in exposure allocation
, from that, the name for me is then exposure_allocation_difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for expanding the example with some more text!
Based on Hilde's review
"exposure allocation" is now called "exposure" "exposure_utility_ratio" is now called "proportional exposure"
I finished writing the test cases, the notebook example, and the user doc. Think it is almost ready for merging. Would love to get some feedback. Thanks in advance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another look at the user guide and doc strings and made a few small comments!
@@ -194,6 +194,24 @@ group loss primarily seeks to mitigate quality-of-service harms. Equalized | |||
odds and equal opportunity can be used as a diagnostic for both allocation | |||
harms as well as quality-of-service harms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to move this paragraph further down and add the ranking metrics as representing allocation / quality-of-service harm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking so long to get back to this PR. My overall impression is quite positive. I've pointed out the larger questions I have which I think should be resolved before diving into detailed feedback (since lots of the details may (or may not) change as a result). I do think @fairlearn/fairlearn-maintainers input is needed here.
@@ -277,6 +277,9 @@ Base metric :code:`group_min` :code:`group_m | |||
:func:`.selection_rate` . . Y Y | |||
:func:`.true_negative_rate` . . Y Y | |||
:func:`.true_positive_rate` . . Y Y | |||
:func:`.exposure` . . Y Y | |||
:func:`.utility` . . Y Y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utility
seems a bit too generic. (I have a feeling @MiroDudik feels the same way...)
Perhaps ranking_utility
?
To some extent I'm wondering if these should be grouped with the other metrics at all, or whether this deserves its own section with ranking metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since exposure and utility only work for exposure. I also think it is better to call them ranking_exposure
and ranking_utility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second the proposed names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with ranking_exposure
. Another option would be dcg_exposure
. This would allow us to introduce, for example, rbp_exposure
in future, see Eq. (2) here:
My impression is that utility
or even ranking_utility
is not the best naming choice for the objective we are calculating and it is also not very standard, because in most contexts, utility
is just a synonym for score
so I would expect that it refers to things like dcg_score
or ndcg_score
. That is actually how they use the word utility even in the "Fairness of Exposure in Rankings" paper. So, I'd be in favor of using some variation of relevance
. Maybe, average_relevance
or mean_relevance
?
@@ -0,0 +1,145 @@ | |||
# Copyright (c) Fairlearn contributors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised we're adding an example, but not a user guide section. Sometimes the latter can borrow from the former through literalinclude
, but in my mind the first step is always the user guide. That said, I don't think we're making that distinction particularly clear at the moment, which is probably something to discuss on the community call again (topic: "structure of documentation" with a particular focus on user guide vs. examples)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example looks quite okay to me, we can just make sure we have links to it from the right places in the user guide and the API guides.
@MiroDudik did you hear back from your colleague regarding ranking metrics? |
@bram49 did you want to reply to @romanlutz 's queries? |
Yes sorry for taking so long, I was waiting for maintainer input to see if this PR has any chance of being merged. If you think that this is a valuable addition to Fairlearn, then I'll implement all feedback and make sure it can get merged. |
Tagging @MiroDudik |
Thanks for the quick response @bram49 . We're thinking about our next (v0.8) release, and were hoping to have this PR included. |
@riedgar-ms Great! Then I'll update the PR this week |
Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
Hello everyone, Thank you for put this great metric together. I have been trying to fork the code in my machine for hours, but I can't see the original file, such as the notebook. Please guide me to find the location of the ranking metric (Fairness of Exposure in Rankings) in the repository. |
Hi! I'm interested in ranking and fairness too. I like the example and scenario for the exposure metric. Unfortunately, I have no experience with how to put these files together from different branches. I have been trying for days, and I keep getting the same errors. Here is the errors: (mn) C:\Users\MN\exp\fairlearn\examples>python plot_ranking.py |
@mnalk this is a pull request so you would have to either fork @bram49's fork of Fairlearn from which he created this PR, or you can alternatively wait for him to merge it into main and fork the "main"/general Fairlearn repo. If you're using pip install from PyPI you have to wait until this is merged and the next release. I hope that helps. For further questions I'd suggest you open an issue, a discussion, or join us on Discord since we don't want to sidetrack the PR discussions unless they're directly relevant to the PR. If you want to contribute to the ranking work I suggest you let us know via Discord so that we can coordinate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR seems quite mature to me. What do we need for it to get merged @fairlearn/fairlearn-maintainers ?
@@ -0,0 +1,145 @@ | |||
# Copyright (c) Fairlearn contributors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example looks quite okay to me, we can just make sure we have links to it from the right places in the user guide and the API guides.
@@ -277,6 +277,9 @@ Base metric :code:`group_min` :code:`group_m | |||
:func:`.selection_rate` . . Y Y | |||
:func:`.true_negative_rate` . . Y Y | |||
:func:`.true_positive_rate` . . Y Y | |||
:func:`.exposure` . . Y Y | |||
:func:`.utility` . . Y Y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second the proposed names.
update: @MiroDudik needs to have a look at it. |
Sorry I have been a bit absent, but would like to make the final changes to make the merge happen. As I understand it now:
|
@MiroDudik also wanted to double check with the existing literature and see if there's something we need to change. |
*Ranking*: | ||
|
||
Fairlearn includes two constraints for rankings, based on exposure: a measure for the amount of | ||
attention an instance is expected to receive, based on their position in the ranking. Exposure is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be agnostic to the use case for ranking? How would the expectations for how much attention an instance might receive change in situations where rankings are bounded at particular intervals (e.g., a limited number of search results returned per page)?
[#6]_ | ||
|
||
* *Proportional exposure*: We try to keep the exposure that each item gets proportional to its | ||
"ground-truth" relevance. Otherwise small differences in relevance can lead to huge differences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you able to say anything more about how this ground truth relevance is typically determined? i.e., is this something that data scientists would have access to a priori, or if not, is there guidance in the paper this is adapted from on how to determine this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My sense is that there is still some work to do.
I have pointed the main piece in my comments, which is that we should make sure that we are compatible with sklearn's ranking metrics.
The second piece, which I haven't brought up in the specific comments (and which actually might not be too much work) is that I think we should try to make sure that our API decisions around the ranking fairness metrics are compatible with more recent papers about fairness in rankings. In particular, I'm thinking about Section 3 of the paper:
(This second issue might be nothing, but I haven't yet thought carefully about it.)
Before we proceed, we should discuss how important it is to address the compatibility with sklearn's ranking metrics.
_ZERO_DIVISION_ERROR = "Average utility is 0, which causes a zero division error." | ||
|
||
|
||
def exposure(y_true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use API that is similar to sklearn's dcg_score
and other ranking metrics.
That would entail:
- replace
y_pred
byy_score
y_true
andy_score
should be two dimensional, with each row corresponding to a different querysample_weight
would correspond to weighting over queries (i.e., rows), rather than items
I know that we decided to go with the permutation instead of scores (because slicing of scores doesn't allow to calculate exposure), but at that time I didn't realize that existing sklearn metrics for ranking work with scores. If we decide to break away from that convention we should maybe discuss.
return np.dot(v, s_w).sum() / len(y_pred) | ||
|
||
|
||
def utility(y_true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly as with exposure
, we should be using a similar API as sklearn's dcg_score
.
return np.dot(u, s_w).sum() / len(u) | ||
|
||
|
||
def proportional_exposure( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly as with exposure
, we should be using a similar API as sklearn's dcg_score
.
@@ -277,6 +277,9 @@ Base metric :code:`group_min` :code:`group_m | |||
:func:`.selection_rate` . . Y Y | |||
:func:`.true_negative_rate` . . Y Y | |||
:func:`.true_positive_rate` . . Y Y | |||
:func:`.exposure` . . Y Y | |||
:func:`.utility` . . Y Y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with ranking_exposure
. Another option would be dcg_exposure
. This would allow us to introduce, for example, rbp_exposure
in future, see Eq. (2) here:
My impression is that utility
or even ranking_utility
is not the best naming choice for the objective we are calculating and it is also not very standard, because in most contexts, utility
is just a synonym for score
so I would expect that it refers to things like dcg_score
or ndcg_score
. That is actually how they use the word utility even in the "Fairness of Exposure in Rankings" paper. So, I'd be in favor of using some variation of relevance
. Maybe, average_relevance
or mean_relevance
?
return e / u | ||
|
||
|
||
def exposure_difference( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should also follow the API for the ranking metrics.
return result | ||
|
||
|
||
def exposure_ratio( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should also follow the API for the ranking metrics.
return result | ||
|
||
|
||
def proportional_exposure_difference( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should also follow the API for the ranking metrics.
I think that we may want to consider adjusting the definition as they did in Eq. (4) here:
return result | ||
|
||
|
||
def proportional_exposure_ratio( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same comments apply as for proportional_exposure_difference
.
Closes #959
Design choices:
exposure
, which can be used to show allocation harmsutility
, which is the average y_true of a subgroupproportional exposure
, which is used to show quality-of-service harms, since you want the exposure of a group to be comparable to its relevance.Problem
This PR has the same problems as discussed in #756, since the exposure metric does not use y_true. And y_pred is used to input a ranking, which might be confusing.
Example
TODO