Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add ranking metrics #974

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

ENH Add ranking metrics #974

wants to merge 19 commits into from

Conversation

bram49
Copy link
Contributor

@bram49 bram49 commented Oct 6, 2021

Closes #959

Design choices:

  • New metric exposure, which can be used to show allocation harms
  • New metric utility, which is the average y_true of a subgroup
  • New metric proportional exposure, which is used to show quality-of-service harms, since you want the exposure of a group to be comparable to its relevance.
  • y_pred is used for giving the input ranking, so list with integers from 1 to n.

Problem

This PR has the same problems as discussed in #756, since the exposure metric does not use y_true. And y_pred is used to input a ranking, which might be confusing.

Example

TODO

  • Test cases
  • User guide

Metrics for evaluating rankings
Copy link
Member

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @bram49!

Do you want to add a brief example under examples where you have a very simple dataset and show the metrics you defined? Even better would be a user guide that builds on the example (using literalinclude) and explains the relevance of these metrics. Anyway, just thinking out loud here.

fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved
fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved
fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved
@bram49
Copy link
Contributor Author

bram49 commented Oct 7, 2021

@romanlutz
Yes I would love to make a user guide that builds on the example, that is probably better than a notebook. Never worked with .rst files and literalinclude work before, but excited to try it out :)

@romanlutz
Copy link
Member

@romanlutz Yes I would love to make a user guide that builds on the example, that is probably better than a notebook. Never worked with .rst files and literalinclude work before, but excited to try it out :)

Sounds good! Let me know if you have issues with restructed text or literalinclude. The existing rst files should be good examples (for the most part).

docs/user_guide/fairness_in_machine_learning.rst Outdated Show resolved Hide resolved
docs/user_guide/fairness_in_machine_learning.rst Outdated Show resolved Hide resolved
docs/user_guide/fairness_in_machine_learning.rst Outdated Show resolved Hide resolved
docs/user_guide/fairness_in_machine_learning.rst Outdated Show resolved Hide resolved
docs/user_guide/fairness_in_machine_learning.rst Outdated Show resolved Hide resolved
fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved
fairlearn/metrics/_exposure.py Show resolved Hide resolved
fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved
fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved
fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, forgot to post this.

exposure,
utility,
exposure_utility_ratio,
allocation_harm_in_ranking_difference,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the doc says Calculate the difference in exposure allocation, from that, the name for me is then exposure_allocation_difference

Copy link
Contributor

@hildeweerts hildeweerts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for expanding the example with some more text!

examples/plot_ranking.py Outdated Show resolved Hide resolved
examples/plot_ranking.py Outdated Show resolved Hide resolved
examples/plot_ranking.py Outdated Show resolved Hide resolved
examples/plot_ranking.py Outdated Show resolved Hide resolved
examples/plot_ranking.py Outdated Show resolved Hide resolved
examples/plot_ranking.py Outdated Show resolved Hide resolved
examples/plot_ranking.py Outdated Show resolved Hide resolved
examples/plot_ranking.py Outdated Show resolved Hide resolved
examples/plot_ranking.py Outdated Show resolved Hide resolved
examples/plot_ranking.py Outdated Show resolved Hide resolved
@bram49
Copy link
Contributor Author

bram49 commented Oct 31, 2021

I finished writing the test cases, the notebook example, and the user doc.

Think it is almost ready for merging. Would love to get some feedback.
@hildeweerts @romanlutz could you review the user guide and example?
@riedgar-ms @adrinjalali could you review the implementation of the exposure metrics and the test cases?

Thanks in advance!

Copy link
Contributor

@hildeweerts hildeweerts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look at the user guide and doc strings and made a few small comments!

@@ -194,6 +194,24 @@ group loss primarily seeks to mitigate quality-of-service harms. Equalized
odds and equal opportunity can be used as a diagnostic for both allocation
harms as well as quality-of-service harms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to move this paragraph further down and add the ranking metrics as representing allocation / quality-of-service harm.

examples/plot_ranking.py Outdated Show resolved Hide resolved
examples/plot_ranking.py Outdated Show resolved Hide resolved
examples/plot_ranking.py Outdated Show resolved Hide resolved
examples/plot_ranking.py Outdated Show resolved Hide resolved
fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved
fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved
fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved
fairlearn/metrics/_exposure.py Show resolved Hide resolved
fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved
Copy link
Member

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to get back to this PR. My overall impression is quite positive. I've pointed out the larger questions I have which I think should be resolved before diving into detailed feedback (since lots of the details may (or may not) change as a result). I do think @fairlearn/fairlearn-maintainers input is needed here.

@@ -277,6 +277,9 @@ Base metric :code:`group_min` :code:`group_m
:func:`.selection_rate` . . Y Y
:func:`.true_negative_rate` . . Y Y
:func:`.true_positive_rate` . . Y Y
:func:`.exposure` . . Y Y
:func:`.utility` . . Y Y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utility seems a bit too generic. (I have a feeling @MiroDudik feels the same way...)
Perhaps ranking_utility?
To some extent I'm wondering if these should be grouped with the other metrics at all, or whether this deserves its own section with ranking metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since exposure and utility only work for exposure. I also think it is better to call them ranking_exposure and ranking_utility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second the proposed names.

Copy link
Member

@MiroDudik MiroDudik May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with ranking_exposure. Another option would be dcg_exposure. This would allow us to introduce, for example, rbp_exposure in future, see Eq. (2) here:

My impression is that utility or even ranking_utility is not the best naming choice for the objective we are calculating and it is also not very standard, because in most contexts, utility is just a synonym for score so I would expect that it refers to things like dcg_score or ndcg_score. That is actually how they use the word utility even in the "Fairness of Exposure in Rankings" paper. So, I'd be in favor of using some variation of relevance. Maybe, average_relevance or mean_relevance?

test/unit/metrics/test_exposure.py Outdated Show resolved Hide resolved
test/unit/metrics/test_exposure.py Outdated Show resolved Hide resolved
@@ -0,0 +1,145 @@
# Copyright (c) Fairlearn contributors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised we're adding an example, but not a user guide section. Sometimes the latter can borrow from the former through literalinclude, but in my mind the first step is always the user guide. That said, I don't think we're making that distinction particularly clear at the moment, which is probably something to discuss on the community call again (topic: "structure of documentation" with a particular focus on user guide vs. examples)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example looks quite okay to me, we can just make sure we have links to it from the right places in the user guide and the API guides.

@hildeweerts
Copy link
Contributor

@MiroDudik did you hear back from your colleague regarding ranking metrics?

@adrinjalali adrinjalali added this to the V.8 milestone Feb 10, 2022
@riedgar-ms
Copy link
Member

@bram49 did you want to reply to @romanlutz 's queries?

@bram49
Copy link
Contributor Author

bram49 commented Feb 14, 2022

Yes sorry for taking so long, I was waiting for maintainer input to see if this PR has any chance of being merged. If you think that this is a valuable addition to Fairlearn, then I'll implement all feedback and make sure it can get merged.

@hildeweerts
Copy link
Contributor

Tagging @MiroDudik

@riedgar-ms
Copy link
Member

Thanks for the quick response @bram49 . We're thinking about our next (v0.8) release, and were hoping to have this PR included.

@bram49
Copy link
Contributor Author

bram49 commented Feb 14, 2022

@riedgar-ms Great! Then I'll update the PR this week

bram49 and others added 2 commits March 5, 2022 15:27
Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
@mnalk
Copy link

mnalk commented Apr 1, 2022

Hello everyone,

Thank you for put this great metric together. I have been trying to fork the code in my machine for hours, but I can't see the original file, such as the notebook. Please guide me to find the location of the ranking metric (Fairness of Exposure in Rankings) in the repository.

@mnalk
Copy link

mnalk commented Apr 5, 2022

Hi! I'm interested in ranking and fairness too. I like the example and scenario for the exposure metric. Unfortunately, I have no experience with how to put these files together from different branches. I have been trying for days, and I keep getting the same errors.
My question specifically is how to make the plot ranking work? If there is something I missed, like a notebook that explains how this metric works, I mean this one(https://dl.acm.org/doi/10.1145/3219819.3220088), I would appreciate guiding me through that code.

Here is the errors:

(mn) C:\Users\MN\exp\fairlearn\examples>python plot_ranking.py
Traceback (most recent call last):
File "C:\Users\Mn\exp\fairlearn\examples\plot_ranking.py", line 10, in
from fairlearn.metrics import exposure, utility, proportional_exposure
ImportError: cannot import name 'exposure' from 'fairlearn.metrics' (C:\Users\Mn\Anaconda3\envs\dana\lib\site-packages\fairlearn\metrics_init_.py)

@romanlutz
Copy link
Member

@mnalk this is a pull request so you would have to either fork @bram49's fork of Fairlearn from which he created this PR, or you can alternatively wait for him to merge it into main and fork the "main"/general Fairlearn repo. If you're using pip install from PyPI you have to wait until this is merged and the next release. I hope that helps. For further questions I'd suggest you open an issue, a discussion, or join us on Discord since we don't want to sidetrack the PR discussions unless they're directly relevant to the PR. If you want to contribute to the ranking work I suggest you let us know via Discord so that we can coordinate.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems quite mature to me. What do we need for it to get merged @fairlearn/fairlearn-maintainers ?

@@ -0,0 +1,145 @@
# Copyright (c) Fairlearn contributors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example looks quite okay to me, we can just make sure we have links to it from the right places in the user guide and the API guides.

@@ -277,6 +277,9 @@ Base metric :code:`group_min` :code:`group_m
:func:`.selection_rate` . . Y Y
:func:`.true_negative_rate` . . Y Y
:func:`.true_positive_rate` . . Y Y
:func:`.exposure` . . Y Y
:func:`.utility` . . Y Y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second the proposed names.

@adrinjalali
Copy link
Member

update: @MiroDudik needs to have a look at it.

@bram49
Copy link
Contributor Author

bram49 commented Apr 28, 2022

Sorry I have been a bit absent, but would like to make the final changes to make the merge happen. As I understand it now:

  • A section needs to be added to user guide section (using literal include?)
  • The metric names have to be changed
  • Anything else?

@adrinjalali
Copy link
Member

@MiroDudik also wanted to double check with the existing literature and see if there's something we need to change.

*Ranking*:

Fairlearn includes two constraints for rankings, based on exposure: a measure for the amount of
attention an instance is expected to receive, based on their position in the ranking. Exposure is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be agnostic to the use case for ranking? How would the expectations for how much attention an instance might receive change in situations where rankings are bounded at particular intervals (e.g., a limited number of search results returned per page)?

[#6]_

* *Proportional exposure*: We try to keep the exposure that each item gets proportional to its
"ground-truth" relevance. Otherwise small differences in relevance can lead to huge differences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to say anything more about how this ground truth relevance is typically determined? i.e., is this something that data scientists would have access to a priori, or if not, is there guidance in the paper this is adapted from on how to determine this?

Copy link
Member

@MiroDudik MiroDudik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My sense is that there is still some work to do.

I have pointed the main piece in my comments, which is that we should make sure that we are compatible with sklearn's ranking metrics.

The second piece, which I haven't brought up in the specific comments (and which actually might not be too much work) is that I think we should try to make sure that our API decisions around the ranking fairness metrics are compatible with more recent papers about fairness in rankings. In particular, I'm thinking about Section 3 of the paper:

(This second issue might be nothing, but I haven't yet thought carefully about it.)

Before we proceed, we should discuss how important it is to address the compatibility with sklearn's ranking metrics.

_ZERO_DIVISION_ERROR = "Average utility is 0, which causes a zero division error."


def exposure(y_true,
Copy link
Member

@MiroDudik MiroDudik May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use API that is similar to sklearn's dcg_score and other ranking metrics.

That would entail:

  • replace y_pred by y_score
  • y_true and y_score should be two dimensional, with each row corresponding to a different query
  • sample_weight would correspond to weighting over queries (i.e., rows), rather than items

I know that we decided to go with the permutation instead of scores (because slicing of scores doesn't allow to calculate exposure), but at that time I didn't realize that existing sklearn metrics for ranking work with scores. If we decide to break away from that convention we should maybe discuss.

return np.dot(v, s_w).sum() / len(y_pred)


def utility(y_true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly as with exposure, we should be using a similar API as sklearn's dcg_score.

return np.dot(u, s_w).sum() / len(u)


def proportional_exposure(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly as with exposure, we should be using a similar API as sklearn's dcg_score.

@@ -277,6 +277,9 @@ Base metric :code:`group_min` :code:`group_m
:func:`.selection_rate` . . Y Y
:func:`.true_negative_rate` . . Y Y
:func:`.true_positive_rate` . . Y Y
:func:`.exposure` . . Y Y
:func:`.utility` . . Y Y
Copy link
Member

@MiroDudik MiroDudik May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with ranking_exposure. Another option would be dcg_exposure. This would allow us to introduce, for example, rbp_exposure in future, see Eq. (2) here:

My impression is that utility or even ranking_utility is not the best naming choice for the objective we are calculating and it is also not very standard, because in most contexts, utility is just a synonym for score so I would expect that it refers to things like dcg_score or ndcg_score. That is actually how they use the word utility even in the "Fairness of Exposure in Rankings" paper. So, I'd be in favor of using some variation of relevance. Maybe, average_relevance or mean_relevance?

return e / u


def exposure_difference(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should also follow the API for the ranking metrics.

return result


def exposure_ratio(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should also follow the API for the ranking metrics.

return result


def proportional_exposure_difference(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should also follow the API for the ranking metrics.

I think that we may want to consider adjusting the definition as they did in Eq. (4) here:

return result


def proportional_exposure_ratio(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comments apply as for proportional_exposure_difference.

@adrinjalali adrinjalali removed this from the V.8 milestone May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FEAT: Fairness of exposure
8 participants