ENH Add ranking metrics #974

bram49 · 2021-10-06T19:13:44Z

Closes #959

Design choices:

New metric exposure, which can be used to show allocation harms
New metric utility, which is the average y_true of a subgroup
New metric proportional exposure, which is used to show quality-of-service harms, since you want the exposure of a group to be comparable to its relevance.
y_pred is used for giving the input ranking, so list with integers from 1 to n.

Problem

This PR has the same problems as discussed in #756, since the exposure metric does not use y_true. And y_pred is used to input a ranking, which might be confusing.

Example

TODO

Test cases
User guide

Metrics for evaluating rankings

romanlutz

Nice work, @bram49!

Do you want to add a brief example under examples where you have a very simple dataset and show the metrics you defined? Even better would be a user guide that builds on the example (using literalinclude) and explains the relevance of these metrics. Anyway, just thinking out loud here.

fairlearn/metrics/_exposure.py

bram49 · 2021-10-07T16:15:09Z

@romanlutz
Yes I would love to make a user guide that builds on the example, that is probably better than a notebook. Never worked with .rst files and literalinclude work before, but excited to try it out :)

romanlutz · 2021-10-07T19:03:21Z

@romanlutz Yes I would love to make a user guide that builds on the example, that is probably better than a notebook. Never worked with .rst files and literalinclude work before, but excited to try it out :)

Sounds good! Let me know if you have issues with restructed text or literalinclude. The existing rst files should be good examples (for the most part).

docs/user_guide/fairness_in_machine_learning.rst

fairlearn/metrics/_exposure.py

Documentation improvements

adrinjalali

sorry, forgot to post this.

adrinjalali · 2021-10-25T12:51:51Z

fairlearn/metrics/__init__.py

+    exposure,
+    utility,
+    exposure_utility_ratio,
+    allocation_harm_in_ranking_difference,


the doc says Calculate the difference in exposure allocation, from that, the name for me is then exposure_allocation_difference

hildeweerts

Thank you for expanding the example with some more text!

examples/plot_ranking.py

Based on Hilde's review

"exposure allocation" is now called "exposure" "exposure_utility_ratio" is now called "proportional exposure"

bram49 · 2021-10-31T12:48:06Z

I finished writing the test cases, the notebook example, and the user doc.

Think it is almost ready for merging. Would love to get some feedback.
@hildeweerts @romanlutz could you review the user guide and example?
@riedgar-ms @adrinjalali could you review the implementation of the exposure metrics and the test cases?

Thanks in advance!

hildeweerts

I took another look at the user guide and doc strings and made a few small comments!

hildeweerts · 2021-11-02T12:47:21Z

docs/user_guide/fairness_in_machine_learning.rst

@@ -194,6 +194,24 @@ group loss primarily seeks to mitigate quality-of-service harms. Equalized
 odds and equal opportunity can be used as a diagnostic for both allocation
 harms as well as quality-of-service harms.


We might want to move this paragraph further down and add the ranking metrics as representing allocation / quality-of-service harm.

examples/plot_ranking.py

fairlearn/metrics/_exposure.py

romanlutz

Sorry for taking so long to get back to this PR. My overall impression is quite positive. I've pointed out the larger questions I have which I think should be resolved before diving into detailed feedback (since lots of the details may (or may not) change as a result). I do think @fairlearn/fairlearn-maintainers input is needed here.

romanlutz · 2021-12-07T03:31:33Z

docs/user_guide/assessment.rst

@@ -277,6 +277,9 @@ Base metric                                     :code:`group_min` :code:`group_m
 :func:`.selection_rate`                         .                 .                 Y                  Y
 :func:`.true_negative_rate`                     .                 .                 Y                  Y
 :func:`.true_positive_rate`                     .                 .                 Y                  Y
+:func:`.exposure`                               .                 .                 Y                  Y
+:func:`.utility`                                .                 .                 Y                  Y


utility seems a bit too generic. (I have a feeling @MiroDudik feels the same way...)
Perhaps ranking_utility?
To some extent I'm wondering if these should be grouped with the other metrics at all, or whether this deserves its own section with ranking metrics.

Since exposure and utility only work for exposure. I also think it is better to call them ranking_exposure and ranking_utility

I second the proposed names.

I'm fine with ranking_exposure. Another option would be dcg_exposure. This would allow us to introduce, for example, rbp_exposure in future, see Eq. (2) here:

Estimation of Fair Ranking Metrics with Incomplete Judgments

My impression is that utility or even ranking_utility is not the best naming choice for the objective we are calculating and it is also not very standard, because in most contexts, utility is just a synonym for score so I would expect that it refers to things like dcg_score or ndcg_score. That is actually how they use the word utility even in the "Fairness of Exposure in Rankings" paper. So, I'd be in favor of using some variation of relevance. Maybe, average_relevance or mean_relevance?

test/unit/metrics/test_exposure.py

romanlutz · 2021-12-07T03:39:41Z

examples/plot_ranking.py

@@ -0,0 +1,145 @@
+# Copyright (c) Fairlearn contributors.


I'm a bit surprised we're adding an example, but not a user guide section. Sometimes the latter can borrow from the former through literalinclude, but in my mind the first step is always the user guide. That said, I don't think we're making that distinction particularly clear at the moment, which is probably something to discuss on the community call again (topic: "structure of documentation" with a particular focus on user guide vs. examples)

This example looks quite okay to me, we can just make sure we have links to it from the right places in the user guide and the API guides.

hildeweerts · 2022-01-13T10:12:44Z

@MiroDudik did you hear back from your colleague regarding ranking metrics?

riedgar-ms · 2022-02-14T14:14:08Z

@bram49 did you want to reply to @romanlutz 's queries?

bram49 · 2022-02-14T15:17:13Z

Yes sorry for taking so long, I was waiting for maintainer input to see if this PR has any chance of being merged. If you think that this is a valuable addition to Fairlearn, then I'll implement all feedback and make sure it can get merged.

hildeweerts · 2022-02-14T15:37:12Z

Tagging @MiroDudik

riedgar-ms · 2022-02-14T16:30:19Z

Thanks for the quick response @bram49 . We're thinking about our next (v0.8) release, and were hoping to have this PR included.

bram49 · 2022-02-14T18:21:07Z

@riedgar-ms Great! Then I'll update the PR this week

Co-authored-by: Roman Lutz <romanlutz13@gmail.com>

mnalk · 2022-04-01T06:23:36Z

Hello everyone,

Thank you for put this great metric together. I have been trying to fork the code in my machine for hours, but I can't see the original file, such as the notebook. Please guide me to find the location of the ranking metric (Fairness of Exposure in Rankings) in the repository.

mnalk · 2022-04-05T06:11:44Z

Hi! I'm interested in ranking and fairness too. I like the example and scenario for the exposure metric. Unfortunately, I have no experience with how to put these files together from different branches. I have been trying for days, and I keep getting the same errors.
My question specifically is how to make the plot ranking work? If there is something I missed, like a notebook that explains how this metric works, I mean this one(https://dl.acm.org/doi/10.1145/3219819.3220088), I would appreciate guiding me through that code.

Here is the errors:

(mn) C:\Users\MN\exp\fairlearn\examples>python plot_ranking.py
Traceback (most recent call last):
File "C:\Users\Mn\exp\fairlearn\examples\plot_ranking.py", line 10, in
from fairlearn.metrics import exposure, utility, proportional_exposure
ImportError: cannot import name 'exposure' from 'fairlearn.metrics' (C:\Users\Mn\Anaconda3\envs\dana\lib\site-packages\fairlearn\metrics_init_.py)

romanlutz · 2022-04-05T07:30:43Z

@mnalk this is a pull request so you would have to either fork @bram49's fork of Fairlearn from which he created this PR, or you can alternatively wait for him to merge it into main and fork the "main"/general Fairlearn repo. If you're using pip install from PyPI you have to wait until this is merged and the next release. I hope that helps. For further questions I'd suggest you open an issue, a discussion, or join us on Discord since we don't want to sidetrack the PR discussions unless they're directly relevant to the PR. If you want to contribute to the ranking work I suggest you let us know via Discord so that we can coordinate.

adrinjalali

This PR seems quite mature to me. What do we need for it to get merged @fairlearn/fairlearn-maintainers ?

adrinjalali · 2022-04-21T14:49:42Z

examples/plot_ranking.py

@@ -0,0 +1,145 @@
+# Copyright (c) Fairlearn contributors.


This example looks quite okay to me, we can just make sure we have links to it from the right places in the user guide and the API guides.

adrinjalali · 2022-04-21T14:52:05Z

docs/user_guide/assessment.rst

@@ -277,6 +277,9 @@ Base metric                                     :code:`group_min` :code:`group_m
 :func:`.selection_rate`                         .                 .                 Y                  Y
 :func:`.true_negative_rate`                     .                 .                 Y                  Y
 :func:`.true_positive_rate`                     .                 .                 Y                  Y
+:func:`.exposure`                               .                 .                 Y                  Y
+:func:`.utility`                                .                 .                 Y                  Y


I second the proposed names.

adrinjalali · 2022-04-28T15:29:36Z

update: @MiroDudik needs to have a look at it.

bram49 · 2022-04-28T20:19:39Z

Sorry I have been a bit absent, but would like to make the final changes to make the merge happen. As I understand it now:

A section needs to be added to user guide section (using literal include?)
The metric names have to be changed
Anything else?

adrinjalali · 2022-04-29T09:32:11Z

@MiroDudik also wanted to double check with the existing literature and see if there's something we need to change.

mmadaio · 2022-05-04T19:01:25Z

docs/user_guide/fairness_in_machine_learning.rst

+*Ranking*:
+
+Fairlearn includes two constraints for rankings, based on exposure: a measure for the amount of
+attention an instance is expected to receive, based on their position in the ranking. Exposure is


Is this meant to be agnostic to the use case for ranking? How would the expectations for how much attention an instance might receive change in situations where rankings are bounded at particular intervals (e.g., a limited number of search results returned per page)?

mmadaio · 2022-05-04T19:03:20Z

docs/user_guide/fairness_in_machine_learning.rst

+  [#6]_
+
+* *Proportional exposure*: We try to keep the exposure that each item gets proportional to its
+  "ground-truth" relevance. Otherwise small differences in relevance can lead to huge differences


Are you able to say anything more about how this ground truth relevance is typically determined? i.e., is this something that data scientists would have access to a priori, or if not, is there guidance in the paper this is adapted from on how to determine this?

MiroDudik

My sense is that there is still some work to do.

I have pointed the main piece in my comments, which is that we should make sure that we are compatible with sklearn's ranking metrics.

The second piece, which I haven't brought up in the specific comments (and which actually might not be too much work) is that I think we should try to make sure that our API decisions around the ranking fairness metrics are compatible with more recent papers about fairness in rankings. In particular, I'm thinking about Section 3 of the paper:

Estimation of Fair Ranking Metrics with Incomplete Judgments

(This second issue might be nothing, but I haven't yet thought carefully about it.)

Before we proceed, we should discuss how important it is to address the compatibility with sklearn's ranking metrics.

MiroDudik · 2022-05-11T14:27:29Z

fairlearn/metrics/_exposure.py

+_ZERO_DIVISION_ERROR = "Average utility is 0, which causes a zero division error."
+
+
+def exposure(y_true,


We should use API that is similar to sklearn's dcg_score and other ranking metrics.

That would entail:

replace y_pred by y_score

y_true and y_score should be two dimensional, with each row corresponding to a different query

sample_weight would correspond to weighting over queries (i.e., rows), rather than items

I know that we decided to go with the permutation instead of scores (because slicing of scores doesn't allow to calculate exposure), but at that time I didn't realize that existing sklearn metrics for ranking work with scores. If we decide to break away from that convention we should maybe discuss.

MiroDudik · 2022-05-11T14:29:04Z

fairlearn/metrics/_exposure.py

+    return np.dot(v, s_w).sum() / len(y_pred)
+
+
+def utility(y_true,


Similarly as with exposure, we should be using a similar API as sklearn's dcg_score.

MiroDudik · 2022-05-11T16:49:17Z

fairlearn/metrics/_exposure.py

+    return np.dot(u, s_w).sum() / len(u)
+
+
+def proportional_exposure(


Similarly as with exposure, we should be using a similar API as sklearn's dcg_score.

MiroDudik · 2022-05-11T20:56:49Z

docs/user_guide/assessment.rst

@@ -277,6 +277,9 @@ Base metric                                     :code:`group_min` :code:`group_m
 :func:`.selection_rate`                         .                 .                 Y                  Y
 :func:`.true_negative_rate`                     .                 .                 Y                  Y
 :func:`.true_positive_rate`                     .                 .                 Y                  Y
+:func:`.exposure`                               .                 .                 Y                  Y
+:func:`.utility`                                .                 .                 Y                  Y


I'm fine with ranking_exposure. Another option would be dcg_exposure. This would allow us to introduce, for example, rbp_exposure in future, see Eq. (2) here:

Estimation of Fair Ranking Metrics with Incomplete Judgments

My impression is that utility or even ranking_utility is not the best naming choice for the objective we are calculating and it is also not very standard, because in most contexts, utility is just a synonym for score so I would expect that it refers to things like dcg_score or ndcg_score. That is actually how they use the word utility even in the "Fairness of Exposure in Rankings" paper. So, I'd be in favor of using some variation of relevance. Maybe, average_relevance or mean_relevance?

MiroDudik · 2022-05-11T21:10:55Z

fairlearn/metrics/_exposure.py

+    return e / u
+
+
+def exposure_difference(


This one should also follow the API for the ranking metrics.

MiroDudik · 2022-05-11T21:11:11Z

fairlearn/metrics/_exposure.py

+    return result
+
+
+def exposure_ratio(


This one should also follow the API for the ranking metrics.

MiroDudik · 2022-05-11T21:15:12Z

fairlearn/metrics/_exposure.py

+    return result
+
+
+def proportional_exposure_difference(


This one should also follow the API for the ranking metrics.

I think that we may want to consider adjusting the definition as they did in Eq. (4) here:

Policy Learning for Fairness in Ranking

MiroDudik · 2022-05-11T21:15:51Z

fairlearn/metrics/_exposure.py

+    return result
+
+
+def proportional_exposure_ratio(


The same comments apply as for proportional_exposure_difference.

bram49 added 2 commits October 6, 2021 20:32

Added exposure and utility

b7320c6

Metrics for evaluating rankings

Error fix

c83bb33

romanlutz reviewed Oct 6, 2021

View reviewed changes

fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved

fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved

fairlearn/metrics/_exposure.py Outdated Show resolved Hide resolved

romanlutz requested review from hildeweerts and MiroDudik October 6, 2021 20:16

bram49 added 5 commits October 9, 2021 15:46

Improving documentation + bug fix

58684c0

Update guide for rankings

250113b

Add code example ranking

1c0b23e

Add reference

3f5efaf

Small fixes

31c6ae6

hildeweerts reviewed Oct 21, 2021

View reviewed changes

bram49 added 2 commits October 25, 2021 14:50

DOC feedback Hilde

4a00ebf

Documentation improvements

Add ranking example notebook

60dc08e

adrinjalali reviewed Oct 25, 2021

View reviewed changes

hildeweerts reviewed Oct 26, 2021

View reviewed changes

bram49 added 6 commits October 28, 2021 15:45

DOC improved example notebook

75e9efe

Based on Hilde's review

DOC fixing links and references

9a8ed1f

Rename exposure metrics

f185034

"exposure allocation" is now called "exposure" "exposure_utility_ratio" is now called "proportional exposure"

Test cases metrics and disparities

16bc945

Forgot changing metric name in user_doc

ba9db5c

Handling zero division errors

ebb8be2

Rounding test case results

fc057b7

hildeweerts reviewed Nov 2, 2021

View reviewed changes

Feedback Hilde

971128e

romanlutz reviewed Dec 7, 2021

View reviewed changes

adrinjalali added this to the V.8 milestone Feb 10, 2022

bram49 and others added 2 commits March 5, 2022 15:27

Update test/unit/metrics/test_exposure.py

6889aba

Co-authored-by: Roman Lutz <romanlutz13@gmail.com>

Update test/unit/metrics/test_exposure.py

c1a5458

Co-authored-by: Roman Lutz <romanlutz13@gmail.com>

romanlutz mentioned this pull request Apr 17, 2022

exposure metric ranking metrics #974 #1068

Closed

adrinjalali reviewed Apr 21, 2022

View reviewed changes

mmadaio reviewed May 4, 2022

View reviewed changes

MiroDudik reviewed May 11, 2022

View reviewed changes

adrinjalali removed this from the V.8 milestone May 12, 2022

		@@ -194,6 +194,24 @@ group loss primarily seeks to mitigate quality-of-service harms. Equalized
		odds and equal opportunity can be used as a diagnostic for both allocation
		harms as well as quality-of-service harms.

		_ZERO_DIVISION_ERROR = "Average utility is 0, which causes a zero division error."


		def exposure(y_true,

		return np.dot(v, s_w).sum() / len(y_pred)


		def utility(y_true,

		return np.dot(u, s_w).sum() / len(u)


		def proportional_exposure(

ENH Add ranking metrics #974

Are you sure you want to change the base?

ENH Add ranking metrics #974

Conversation

bram49 commented Oct 6, 2021 • edited

Design choices:

Problem

Example

TODO

romanlutz left a comment

Choose a reason for hiding this comment

bram49 commented Oct 7, 2021

romanlutz commented Oct 7, 2021

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hildeweerts left a comment

Choose a reason for hiding this comment

bram49 commented Oct 31, 2021

hildeweerts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romanlutz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MiroDudik May 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hildeweerts commented Jan 13, 2022

riedgar-ms commented Feb 14, 2022

bram49 commented Feb 14, 2022

hildeweerts commented Feb 14, 2022

riedgar-ms commented Feb 14, 2022

bram49 commented Feb 14, 2022

mnalk commented Apr 1, 2022

mnalk commented Apr 5, 2022

romanlutz commented Apr 5, 2022

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Apr 28, 2022

bram49 commented Apr 28, 2022

adrinjalali commented Apr 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MiroDudik left a comment • edited

Choose a reason for hiding this comment

MiroDudik May 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MiroDudik May 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bram49 commented Oct 6, 2021 •

edited

MiroDudik May 11, 2022 •

edited

MiroDudik left a comment •

edited

MiroDudik May 11, 2022 •

edited

MiroDudik May 11, 2022 •

edited