Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Abstention tasks #717

Closed

Conversation

ManuelFay
Copy link
Contributor

@ManuelFay ManuelFay commented May 14, 2024

Hello MTEB team !
In this PR, jointly with @hgissbkh , we propose the introduction of a new task to the MTEB leaderboard: Abstention.

The motivation is quite simple. Over the last years, Neural Retrieval has significantly improved upon heuristic-based IR systems. Yet, no model is perfect, and models often are unable to retrieve documents relevant to the user's query. This indicates abstention mechanisms would go a long way in improving the usability of such models, in order to improve confidence in retrieved results and let users have some control of the recall/precision tradeoff !

Going further, "simple" black-box abstention mechanisms (based on class logit distribution) have been shown to produce strong abstention baselines in classification tasks. In a recent work (https://arxiv.org/abs/2402.12997), we extend this to embedding models and show that using simple abstention heuristics common in the abstention litterature (score of best retrieved passage, score difference between best and second best retrieved passage, etc) we are able to get non trivial abstention mechanisms. In our work (soon to be published at TMLR), we go further proposing more complex calibration based abstention mechanisms but we feel the MTEB is made to evaluate models and not mechanisms. As such, in this PR, we only implement 3 simple abstention heuristics that work out-of-the-box on retrieval and reranking tasks.

What we aim to assess with this PR is whether or not retrieval models (bi-encoder embedding models here) yield sufficiently calibrated (query, document) scores to be able to not only rank the documents and assess the best ones (retrieval task) but also given a ranking and the associated document scores, compute a "confidence score" which would enable abstention if too low. Better calibrated embedding models will perform better, and we feel this task opens up new and very interesting properties of embedding models !

While still new (and so not necessarily a task to add to the leaderboard right away), we feel having the codebase in MTEB is a nice first step. In this PR, we add the abstention task for 13 datasets spanning retrieval and reranking across 4 languages !

I'll further let @hgissbkh describe the metrics used in more detail !

  • I have tested that the tasks runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

@ManuelFay
Copy link
Contributor Author

Tagging @imenelydiaker for the review, don't hesitate in case you have any doubts or questions !
Cheers !

@imenelydiaker
Copy link
Contributor

Tagging @imenelydiaker for the review, don't hesitate in case you have any doubts or questions ! Cheers !

This is very interesting thank you! I'll just leave one comment before reading the paper and try to review everything tomorrow.
Also adding other reviewers that may be interested @KennethEnevoldsen @Muennighoff @orionw

docs/tasks.md Outdated Show resolved Hide resolved
@imenelydiaker
Copy link
Contributor

imenelydiaker commented May 14, 2024

[Didn't review code yet, just read the paper, very interesting btw]

So if I get this right, for a search query: first a retrieval step is performed, then you get similarity scores and then asses confidence scores to each score using a function like the max or std, then if your confidence is higher than a threshold you add the doc to the reranker pipeline, else you discard it. Is this right?

Some questions:

  1. Is the main idea about discarding uncertain documents in a search/IR pipeline?
  2. How does the linear model that asses confidence score generalize on other datasets? Or is it trained specifically for each dataset?
  3. From my understanding this works in a pipeline combining retrieval and reranking, so when applied to Retrieval only, what should happen after assessing confidence scores? It just discards less relevant documents? Or maybe I'm missing something here

@hgissbkh
Copy link
Contributor

hgissbkh commented May 15, 2024

Hello Imene and thanks Manu for the explanations!
Here are some further precisions regarding your questions.

  1. Is the main idea about discarding uncertain documents in a search/IR pipeline?

Not exactly, the idea is about discarding a whole instance, i.e., a query and its set of candidate documents. Assume that for a given query we have K candidate documents, first we compute the K query-document relevance scores (e.g., cosine similarity on embeddings) and then build our confidence function on them (the most simple approach is to take the max). I have joined a diagram to make it clearer.

Capture d’écran 2024-05-15 à 09 43 24
  1. How does the linear model that assess confidence score generalize on other datasets? Or is it trained specifically for each dataset?

In our paper, the linear confidence function is trained in-domain, i.e., for each specific dataset. However, the confidence functions we propose here are simple heuristics that don’t need to be trained (maximum score, standard deviation, highest score minus second highest score). For example, taking the maximum score has been proven to be a good proxy for how much a model is confident on a given classification instance (Hendrycks et al., 2016). We show in our paper that this observation transfers well to document ranking in general.

  1. From my understanding this works in a pipeline combining retrieval and reranking, so when applied to Retrieval only, what should happen after assessing confidence scores? It just discards less relevant documents? Or maybe I'm missing something here.

The code implementation we propose is built to work when performing retrieval and reranking independently and not necessarily in a pipeline. In a practical setting in which we have one query and K documents, we can compute K similarity scores. We can then compute the maximum score and use it as a confidence estimator. If it is above a certain threshold (the higher, the more probably we abstain), then the whole instance is discarded, i.e., the query and the K documents.

Hope this helps!
Thanks again for your questions and don’t hesitate if you have any additional doubts!

@orionw
Copy link
Contributor

orionw commented May 15, 2024

In our paper, the linear confidence function is trained in-domain, i.e., for each specific dataset. However, the confidence functions we propose here are simple heuristics that don’t need to be trained (maximum score, standard deviation, highest score minus second highest score). For example, taking the maximum score has been proven to be a good proxy for how much a model is confident on a given classification instance (Hendrycks et al., 2016). We show in our paper that this observation transfers well to document ranking in general.

Really interesting task @hgissbkh!

I am still a little confused on the task definition, so I will explain what I understand:

  • The model does standard search and returns K documents with scores S_1, ..., S_K
  • Some heuristic is applied on top of that to remove any scores less than a standard deviation away (or perhaps a number threshold)
  • We then have some number of documents left that passed this heuristic D, where |D| <= |K|

I guess my questions are:

  1. How are you defining ground truth here? Are there annotations for what queries should be abstained from and which ones should return documents?
  2. Can new models implement their own functions for abstention or do they use these heuristics the same? I think this may get rather complicated if models need to define a new function for calibration along with retrieval
  3. Is there an evaluation of calibration in the metrics or just the documents that are returned?

I've typically seen this abstention done as a 2nd stage problem after the initial retriever is used, so it's taking me a bit to adjust to how this could be used for a first stage model.

@hgissbkh
Copy link
Contributor

hgissbkh commented May 15, 2024

Hi @orionw and thanks for your comment!
I will try to clarify things a bit.

The question we are trying to address with abstention is not to evaluate confidence at the document level but at the whole instance level. Let’s take your example again:

  • The model does standard search and returns the K most-relevant documents D_1, …, D_K with respect to query Q. Those documents have scores S_1 > … > S_K. Note that what I call an instance here is the tuple (Q, D_1, …., D_K) (the query and the K retrieved documents).
  • The confidence heuristic is then applied on the tuple of scores (S_1, …, S_K), not to each of them separately. For instance, we can take the maximum score as a confidence estimator (Hendrycks et al., 2016): c(S_1, …, S_K) = max(S_1, …, S_K).
  • Then, we decide if we return the retrieved documents or not, depending on whether c(S_1, …, S_K) is greater than a given threshold that controls abstention probability (the higher the threshold, the more we abstain overall). Put in a more formal way, say a retriever is a function R that takes a query Q, a document database DD, and a number of documents to return K as input and that returns the top-K documents: R(Q, DD, K) = (D_1, …, D_K). Then, if c(S_1, …, S_K) < tau, R_abst(Q, DD, K) = "I'm not sure", otherwise, R_abst(Q, DD, K) = (D_1, …, D_K).

Now, regarding your questions more specifically:

  1. How are you defining ground truth here? Are there annotations for what queries should be abstained from and which ones should return documents?

Good question! No, we do not need any additional annotations for abstention evaluation. We evaluate abstention quality by computing a normalized area under the metric-abstention curve. Let me explain:
Let’s take a test dataset T comprising of N instances (Q_1, D_1,1, …, D_1,K, Y_1), …, (Q_N, D_N,1, …, D_N,K, Y_N). In the vanilla case, we would simply evaluate the retrieval system by assessing the relevance of the K returned documents with respect to ground truth Y, for instance using NDCG@K. After doing it for each of the N instances, we can finally compute an average NDCG@K on T.
Now, let’s assume we want to keep only instances for which the confidence function is above a certain threshold tau. We would get a new dataset T_tau included in T (|T_tau| < |T|) for which we could also compute the average NDCG@K (hopefully greater than without abstention). Doing this for increasing values of tau, we would get an increasing curve. We finally compute the Area Under the Curve to assess overall abstention quality.
To get further details, you can have a look at Figure 2 and Section 4.3 of our paper (https://arxiv.org/abs/2402.12997).

  1. Can new models implement their own functions for abstention or do they use these heuristics the same? I think this may get rather complicated if models need to define a new function for calibration along with retrieval

The heuristics we use are deliberately general and can be implemented on any models without any needs for adjustment. The three confidence functions we use are the max, the std and S_1-S_2 (highest score minus second highest score). These are common functions used for confidence estimation in the classification setting (Narayanan et al., 2012; Hendrycks et al., 2016; Pang et al., 2021; Gawlikowski et al., 2023). In our paper (Gisserot-Boukhlef et al., 2024), we show that those heuristics transfer well the to the document ranking setting.

  1. Is there an evaluation of calibration in the metrics or just the documents that are returned?

I guess my response to question 1 answers this one too! :)

Hope this clarifies a bit and of course don’t hesitate if you have additional questions!

@ManuelFay
Copy link
Contributor Author

Hello,
Any thoughts on this @orionw ?

@orionw
Copy link
Contributor

orionw commented May 17, 2024

Thanks for explaining @hgissbkh and for the ping @ManuelFay! This is a novel idea and I'm still trying to parse the details of your statement. Perhaps I'll have to give the paper a full read to understand.

I'm still a little worried about the automatic evaluation without gold data, as each embedding model will have a different calibration but I also admit I don't fully understand the example you gave. If @imenelydiaker understands she can continue with the review, otherwise I'll need to take some time to work an example.

Thanks and sorry for the delay.

@ManuelFay
Copy link
Contributor Author

To be clear, there is gold data ! It corresponds to the original dataset's gold data.
Basically given a retriever, we compute the (query, documents) pair scores (as in retrieval).
Given these scores, we then output a scalar "confidence value" (using simple techniques).
This confidence value will enable us to compute AUC scores that assess whether or not abstention improves the results or not !
Basically, if abstaining on the 20% of queries with the least "confident" scores improves MAP values, the AUC score will be >0.

In conclusion, this is not a reference free task ! We are using the gold queries,document pairs used in retrieving and reranking scores to look at the performance delta with different values of abstention ratios.

@KennethEnevoldsen
Copy link
Contributor

@orionw did you have a chance to look over an example? (I am going over PRs to get all datasets in)

@hgissbkh
Copy link
Contributor

Hey @orionw, I suggest we look at a concrete example!

Let’s assume we have a query Q and want to retrieve the top-5 documents using a retrieval system R. We retrieve those 5 documents, let’s say with the following scores: 0.9, 0.8, 0.7, 0.3, 0.2. If we choose the max as the confidence function (Hendrycks et al., 2016), then we get a confidence score of 0.9 (maximum of the 5 retrieved documents’ scores). Assume also that I have previously chosen a confidence threshold of 0.7. As 0.9 > 0.7, we decide to return the 5 retrieved documents. On the contrary, if we had set a confidence threshold of 0.95, we would have discarded the instance and returned an abstention message to the user.

If we extend this rationale to the whole dataset, we see that the number of abstained instances changes depending on the threshold we choose. And that’s how we construct AUC!

  • Assume we have the following test dataset, comprising of 3 instances (each instance comprises of a query, the 5 retrieved documents’ scores and the ground truth, 1 representing relevant documents and 0 representing irrelevant documents):
    {(Q_1, [0.5, 0.4, 0.3, 0.2, 0.1], [0, 1, 0, 0, 0])
    (Q_2, [0.7, 0.5, 0.3, 0.2, 0.2], [1, 0, 1, 0, 0])
    (Q_3, [0.9, 0.8, 0.3, 0.3, 0.2], [1, 1, 0, 0, 0])}
  • We can first evaluate the 3 instances using NDCG@5: we get 0.63, 0.92, 1 respectively.
  • Then, let’s make the confidence threshold vary. If we set it equal to 0.4, no instance is discarded and the test NDCG is equal to (0.63 + 0.92 + 1) / 3 = 0.85. If we set it now to 0.6, the first instance is discarded (max score equal to 0.5) and the test NDCG becomes (0.92 + 1) / 2 = 0.96. Finally, setting the threshold to 0.8, the two first instances are discarded and the test NDCG becomes 1.
  • We get an increasing NDCG-abstention curve: (0%, 0.85), (33%, 0.92), (66%, 1).
  • We finally compute the area under this NDCG-abstention curve to evaluate abstention on the test dataset.

@orionw
Copy link
Contributor

orionw commented May 21, 2024

Thanks @hgissbkh, that is very helpful and I definitely see the appeal of the task.

A couple follow up questions: it seems like this is reducing the length of the returned list (since the least confident entries will be at the bottom). So in essence, this is a metric computing the area under the nDCG@k score curve at various confidence scores -- is that correct?

Does that metric (nDCG@5 say) stay fixed or is it nDCG@length of the list? Looking at the paper it seems fixed? So it's like taking the nDCG@5 score (or mAP score, etc.) at various score thresholds.

I'm also trying to think about the edge cases here: if we have a retrieval system who's minimum score is 0.9 the nDCG score will be the same until the last confidence values, correct?

This seems like we're asking the models to have absolute scores (between 0-1) whereas perhaps the model does well but uses a relative range (0.9-1 for example). Is there a reason to prefer the one with the broader range?

I would also be interested in hearing @imenelydiaker or @KennethEnevoldsen's thoughts on this PR so we could get a perspective other than mine. This is a unique approach to using the retrieval scores, which is super neat!


Separately from a code perspective: could we make it so that this abstract task can wrap other tasks?

It seems like this could take in a RetrievalTask (or RerankingTask), do the normal retrieval/reranking, get the run file, and then perform the AbstentionTask on top of it. If this understanding is correct, we wouldn't need a separate AbstentionTask for each normal task and instead could re-use every existing retrieval and re-ranking task with it, which would be really nice.

I could be missing the reason why you'd do it differently, so please let me know!

@imenelydiaker
Copy link
Contributor

imenelydiaker commented May 21, 2024

Hey @hgissbkh and @ManuelFay, thank you for the detailed explanations. If abstention becomes a standard in RAG systems for example, this task would be very helpful I suppose.

Currently, I am failing to see maybe something obvious:

  • Is it correct to have a fixed threshold for all models? as @orionw mentionned it some models may use relative score ranges.
  • If a model gets higher abstention score than another one, what would this mean?

Help me undestand if I'm not getting this right please:
The way I see abstention is as a top-layer after retrieval (or any other task), so I'm failing to see is how is it an embedding evaluation task? It relies mostly on the scores of the task it is applied to (retrieval in your example).

According to my understanding, abstention scores will depend on initial retieval task scores and a fixed threshold. So there is a correlation between an embedder peformance on the initial task & absention scores, an embedder that is good at retrieval will be good at abstention. Is this correct?

Can we say the same about other evaluation tasks of MTEB? e.g., is there a correlation between classification and STS tasks or any other tasks?

@ManuelFay
Copy link
Contributor Author

Is it correct to have a fixed threshold for all models? as @orionw mentionned it some models may use relative score ranges.

---> No, there is no fixed threshold, we are computing an "area under the curve" score. Basically, you vary the threshold and see how much better it is than a baseline random policy. This is a classic metric for classification tasks requiring a threshold.
ROC

If a model gets higher abstention score than another one, what would this mean?

It would mean it is better calibrated. A "calibrated" retrieving model does not only need to rank documents in the correct order, but also to output score magnitudes that are coherent. This metric is not captured by ranking metrics at the moment ! Typically, we would like the "best" document for a given query to have a low score if it is not that relevant, and inversely.

So there is a correlation between an embedder peformance on the initial task & abstention scores, an embedder that is good at retrieval will be good at abstention. Is this correct?

We have found this to be mostly the case but this is not true by design. Again, since this is essentially a measure of calibration, a better calibrated model might not necessarily be a better retrieving model, this will depend on the loss function, regularization, retriever architecture, etc... However, a very bad retrieval model will have a hard time being correctly calibrated since it's inherently bad.

Can we say the same about other evaluation tasks of MTEB? e.g., is there a correlation between classification and STS tasks or any other tasks?

There definitely is a strong correlation between some tasks in MTEB.

To sum it up, what we are proposing here is a tractable way of measuring retriever calibration in a practical and useful setting.

Having said that, from this issue discussions, I feel the task is still a bit unclearly explained and we probably were very bad at explaining the core concepts ! Thanks a ton for taking the time, it is super valuable feedback in all cases !

@ManuelFay
Copy link
Contributor Author

@orionw hadn't seen your comment about code.

In essence, the abstention task is a wrapper around other tasks (reranking, retrieval). It inherits all methods and properties or underlying tasks, essentially the only thing that changes is the associated scorer class. Since all tasks have their custom input format and methods, the wrapper just routes to the correct underlying task, retrieves the scores, and computes the abstention scores.

I decided to make abstention a separate task, in order to contain all modifications to a single directory and avoid having to make big breaking changes in the evaluation methods of other methods. I put some thought into it and figured it should be treated as a task on it's own to guarantee consistency and allow for granular code modifications, rather than adding another set of metrics in the other tasks. I also feel it makes the task more readable.

@hgissbkh
Copy link
Contributor

hgissbkh commented May 22, 2024

Hi @orionw , regarding your follow-up questions:

-> So in essence, this is a metric computing the area under the nDCG@k score curve at various confidence scores -- is that correct?
At various confidence thresholds I would say, but you got the idea!

-> Does that metric (nDCG@5 say) stay fixed or is it nDCG@length of the list?
Yes it is fixed. First you choose your metric, say NDCG@5, and then you compute AUC based on this metric. But you can of course compute the AUC for any metric you want (NDCG@k, MAP…).

-> If we have a retrieval system who's minimum score is 0.9 the nDCG score will be the same until the last confidence values, correct?
We took care of this edge case in our implementation so that is flexible to any retrieval systems. We first have a look at all the NDCGs in the test set and then select the thresholds adequately. Taking my example from above, if the confidence scores are 0.9, 0.95 and 0.99 instead of 0.5, 0.7 and 0.9, we would simply choose the confidence thresholds differently and take for example 0.8, 0.92 and 0.96.

-> This seems like we're asking the models to have absolute scores (between 0-1) whereas perhaps the model does well but uses a relative range (0.9-1 for example).
I am not sure to get this one, sorry 😅

@orionw
Copy link
Contributor

orionw commented May 22, 2024

@ManuelFay, thanks, that makes sense. I think my suggestion would be to make the task classes themselves simpler, since we're just wrapping the other ones -- like if there was any way to wrap GerDaLIRSmall without having to redefine all those fields. That way we don't have the same task repeated twice, if possible.


@hgissbkh thanks for the detailed response.

We first have a look at all the NDCGs in the test set and then select the thresholds adequately.

This was basically the question I had at the end, how do we determine these thresholds?

It seems tricky because on the one hand if you use a standard range (0-1) there may be some thresholds where nothing is removed and the scores penalize models which have smaller ranges (like a model that predicts between 0.9-1 always). On the other hand, if you dynamically calculate it (e.g. 0.91, 0.92, ... 0.99) it doesn't seem like it measures confidence but rather just the normal ranking stat (e.g. a dynamic nDCG@k curve where you take every value of k and calculate the AUC). Of these, the former seems more suited to calibration even if it ends up punishing the model for not fully using the range of similarity scores.

I could still be misunderstanding, so thanks for your patience :)

First you choose your metric, say NDCG@5, and then you compute AUC based on this metric.

Aren't metrics like NDCG@5 not well suited for this, as the top five documents won't change until you remove nearly all the documents?

@hgissbkh
Copy link
Contributor

Hi @orionw,
From what I understand, you are defending that to measure calibration, we should have a fixed range of thresholds that should not vary depending on the domain. By dynamically adjusting our thresholding (basically varying abstention rates from 0 to 1 - rather than the threshold from 0 to 1), we intended to obtain a clearer and more meaningful signal, that would often be squashed if computed on the 0 to 1 threshold interval for every domain.
We understand that this is an added complexity that complicates the readibility of a leaderboard like MTEB, and although we feel it makes the most sense, it definitely is an arbitrary choice.
Given the discussions in this PR, we do not intend to insist on merging this PR if you all feel it's not the right place for it at the moment.
In all cases, thanks for the precious feedback and the great work !
Cheers,
Hippo and Manu

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general thoughts:

Generally find the task both promising and meaningful. Follows a meaningful trend where we are not only interested in performing well but also failing gracefully if either the answer don't exist or the model is unable to find it.

While still new (and so not necessarily a task to add to the leaderboard right away), we feel having the codebase in MTEB is a nice first step. In this PR, we add the abstention task for 13 datasets spanning retrieval and reranking across 4 languages !

I very much agree with this point. That at least for now it seems meaningful to add the task without nec. adding it into a benchmark as the first thing.

Reading through the very detailed comments above (a big thanks to everyone who spent the time on these):

  • There is some concerns about the use of AUC, I believe it is reasonable in this case assuming we allow for the model to produce their own confidence scores (see comment below) and have a reasonable default
  • there seems to be some concern about the task wrapping, which I reasonably understand. Essentially we are duplicating a task. Am I wrong to argue that we could extend the existing retrieval and reranking task to also measure abstention? If so I would argue that it might be a more promising avenue. This would avoid duplicates in the benchmark while introducing what I believe to be a good measure to a wider set of tasks.

I have some comments to the code as well (but I have only added those which are important to whether or not the task should be added) if we decide it should be added I will give a new review.

Comment on lines +286 to +294
conf_scores[i] = {
"max": pred_scores_sort[0],
"std": (
sum((sc - pred_scores_mean) ** 2 for sc in pred_scores)
/ len(pred_scores)
)
** (1 / 2),
"P1P2": pred_scores_sort[0] - pred_scores_sort[1],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would allow the model to choose their own conf score but with a meaningful default. An option would be to check if the model has a model.confidence_abstention(...) method (or even return a boolean abstention score).

category="s2p",
eval_splits=["test"],
eval_langs=["fra-Latn"],
main_score="map",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
main_score="map",
main_score="map",

Unsure if you are proposing map as the main score (given the discussion around AUC)

@hgissbkh
Copy link
Contributor

hgissbkh commented May 29, 2024

Hi @KennethEnevoldsen and many thanks for your remarks!
We have incorporated abstention as an evaluation metric rather than as a task in this new PR: #841. For the moment, we have implemented abstention metrics for Retrieval and Reranking only but they could also be relevant for all classification tasks.
We would be very happy to hear your opinion on this new implementation!

@KennethEnevoldsen
Copy link
Contributor

Wonderful @hgissbkh I will take a look - Will close this here as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants