-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Abstention tasks #717
Add Abstention tasks #717
Conversation
e4e33b8
to
17659ec
Compare
Tagging @imenelydiaker for the review, don't hesitate in case you have any doubts or questions ! |
This is very interesting thank you! I'll just leave one comment before reading the paper and try to review everything tomorrow. |
[Didn't review code yet, just read the paper, very interesting btw] So if I get this right, for a search query: first a retrieval step is performed, then you get similarity scores and then asses confidence scores to each score using a function like the max or std, then if your confidence is higher than a threshold you add the doc to the reranker pipeline, else you discard it. Is this right? Some questions:
|
Really interesting task @hgissbkh! I am still a little confused on the task definition, so I will explain what I understand:
I guess my questions are:
I've typically seen this abstention done as a 2nd stage problem after the initial retriever is used, so it's taking me a bit to adjust to how this could be used for a first stage model. |
Hi @orionw and thanks for your comment! The question we are trying to address with abstention is not to evaluate confidence at the document level but at the whole instance level. Let’s take your example again:
Now, regarding your questions more specifically:
Good question! No, we do not need any additional annotations for abstention evaluation. We evaluate abstention quality by computing a normalized area under the metric-abstention curve. Let me explain:
The heuristics we use are deliberately general and can be implemented on any models without any needs for adjustment. The three confidence functions we use are the max, the std and S_1-S_2 (highest score minus second highest score). These are common functions used for confidence estimation in the classification setting (Narayanan et al., 2012; Hendrycks et al., 2016; Pang et al., 2021; Gawlikowski et al., 2023). In our paper (Gisserot-Boukhlef et al., 2024), we show that those heuristics transfer well the to the document ranking setting.
I guess my response to question 1 answers this one too! :) Hope this clarifies a bit and of course don’t hesitate if you have additional questions! |
Hello, |
Thanks for explaining @hgissbkh and for the ping @ManuelFay! This is a novel idea and I'm still trying to parse the details of your statement. Perhaps I'll have to give the paper a full read to understand. I'm still a little worried about the automatic evaluation without gold data, as each embedding model will have a different calibration but I also admit I don't fully understand the example you gave. If @imenelydiaker understands she can continue with the review, otherwise I'll need to take some time to work an example. Thanks and sorry for the delay. |
To be clear, there is gold data ! It corresponds to the original dataset's gold data. In conclusion, this is not a reference free task ! We are using the gold queries,document pairs used in retrieving and reranking scores to look at the performance delta with different values of abstention ratios. |
@orionw did you have a chance to look over an example? (I am going over PRs to get all datasets in) |
Hey @orionw, I suggest we look at a concrete example! Let’s assume we have a query Q and want to retrieve the top-5 documents using a retrieval system R. We retrieve those 5 documents, let’s say with the following scores: 0.9, 0.8, 0.7, 0.3, 0.2. If we choose the max as the confidence function (Hendrycks et al., 2016), then we get a confidence score of 0.9 (maximum of the 5 retrieved documents’ scores). Assume also that I have previously chosen a confidence threshold of 0.7. As 0.9 > 0.7, we decide to return the 5 retrieved documents. On the contrary, if we had set a confidence threshold of 0.95, we would have discarded the instance and returned an abstention message to the user. If we extend this rationale to the whole dataset, we see that the number of abstained instances changes depending on the threshold we choose. And that’s how we construct AUC!
|
Thanks @hgissbkh, that is very helpful and I definitely see the appeal of the task. A couple follow up questions: it seems like this is reducing the length of the returned list (since the least confident entries will be at the bottom). So in essence, this is a metric computing the area under the nDCG@k score curve at various confidence scores -- is that correct? Does that metric (nDCG@5 say) stay fixed or is it nDCG@length of the list? Looking at the paper it seems fixed? So it's like taking the nDCG@5 score (or mAP score, etc.) at various score thresholds. I'm also trying to think about the edge cases here: if we have a retrieval system who's minimum score is 0.9 the nDCG score will be the same until the last confidence values, correct? This seems like we're asking the models to have absolute scores (between 0-1) whereas perhaps the model does well but uses a relative range (0.9-1 for example). Is there a reason to prefer the one with the broader range? I would also be interested in hearing @imenelydiaker or @KennethEnevoldsen's thoughts on this PR so we could get a perspective other than mine. This is a unique approach to using the retrieval scores, which is super neat! Separately from a code perspective: could we make it so that this abstract task can wrap other tasks? It seems like this could take in a RetrievalTask (or RerankingTask), do the normal retrieval/reranking, get the run file, and then perform the AbstentionTask on top of it. If this understanding is correct, we wouldn't need a separate AbstentionTask for each normal task and instead could re-use every existing retrieval and re-ranking task with it, which would be really nice. I could be missing the reason why you'd do it differently, so please let me know! |
Hey @hgissbkh and @ManuelFay, thank you for the detailed explanations. If abstention becomes a standard in RAG systems for example, this task would be very helpful I suppose. Currently, I am failing to see maybe something obvious:
Help me undestand if I'm not getting this right please: According to my understanding, abstention scores will depend on initial retieval task scores and a fixed threshold. So there is a correlation between an embedder peformance on the initial task & absention scores, an embedder that is good at retrieval will be good at abstention. Is this correct? Can we say the same about other evaluation tasks of MTEB? e.g., is there a correlation between classification and STS tasks or any other tasks? |
Is it correct to have a fixed threshold for all models? as @orionw mentionned it some models may use relative score ranges. ---> No, there is no fixed threshold, we are computing an "area under the curve" score. Basically, you vary the threshold and see how much better it is than a baseline random policy. This is a classic metric for classification tasks requiring a threshold. If a model gets higher abstention score than another one, what would this mean? It would mean it is better calibrated. A "calibrated" retrieving model does not only need to rank documents in the correct order, but also to output score magnitudes that are coherent. This metric is not captured by ranking metrics at the moment ! Typically, we would like the "best" document for a given query to have a low score if it is not that relevant, and inversely. So there is a correlation between an embedder peformance on the initial task & abstention scores, an embedder that is good at retrieval will be good at abstention. Is this correct? We have found this to be mostly the case but this is not true by design. Again, since this is essentially a measure of calibration, a better calibrated model might not necessarily be a better retrieving model, this will depend on the loss function, regularization, retriever architecture, etc... However, a very bad retrieval model will have a hard time being correctly calibrated since it's inherently bad. Can we say the same about other evaluation tasks of MTEB? e.g., is there a correlation between classification and STS tasks or any other tasks? There definitely is a strong correlation between some tasks in MTEB. To sum it up, what we are proposing here is a tractable way of measuring retriever calibration in a practical and useful setting. Having said that, from this issue discussions, I feel the task is still a bit unclearly explained and we probably were very bad at explaining the core concepts ! Thanks a ton for taking the time, it is super valuable feedback in all cases ! |
@orionw hadn't seen your comment about code. In essence, the abstention task is a wrapper around other tasks (reranking, retrieval). It inherits all methods and properties or underlying tasks, essentially the only thing that changes is the associated scorer class. Since all tasks have their custom input format and methods, the wrapper just routes to the correct underlying task, retrieves the scores, and computes the abstention scores. I decided to make abstention a separate task, in order to contain all modifications to a single directory and avoid having to make big breaking changes in the evaluation methods of other methods. I put some thought into it and figured it should be treated as a task on it's own to guarantee consistency and allow for granular code modifications, rather than adding another set of metrics in the other tasks. I also feel it makes the task more readable. |
Hi @orionw , regarding your follow-up questions: -> So in essence, this is a metric computing the area under the nDCG@k score curve at various confidence scores -- is that correct? -> Does that metric (nDCG@5 say) stay fixed or is it nDCG@length of the list? -> If we have a retrieval system who's minimum score is 0.9 the nDCG score will be the same until the last confidence values, correct? -> This seems like we're asking the models to have absolute scores (between 0-1) whereas perhaps the model does well but uses a relative range (0.9-1 for example). |
@ManuelFay, thanks, that makes sense. I think my suggestion would be to make the task classes themselves simpler, since we're just wrapping the other ones -- like if there was any way to wrap @hgissbkh thanks for the detailed response.
This was basically the question I had at the end, how do we determine these thresholds? It seems tricky because on the one hand if you use a standard range (0-1) there may be some thresholds where nothing is removed and the scores penalize models which have smaller ranges (like a model that predicts between 0.9-1 always). On the other hand, if you dynamically calculate it (e.g. 0.91, 0.92, ... 0.99) it doesn't seem like it measures confidence but rather just the normal ranking stat (e.g. a dynamic nDCG@k curve where you take every value of k and calculate the AUC). Of these, the former seems more suited to calibration even if it ends up punishing the model for not fully using the range of similarity scores. I could still be misunderstanding, so thanks for your patience :)
Aren't metrics like NDCG@5 not well suited for this, as the top five documents won't change until you remove nearly all the documents? |
Hi @orionw, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some general thoughts:
Generally find the task both promising and meaningful. Follows a meaningful trend where we are not only interested in performing well but also failing gracefully if either the answer don't exist or the model is unable to find it.
While still new (and so not necessarily a task to add to the leaderboard right away), we feel having the codebase in MTEB is a nice first step. In this PR, we add the abstention task for 13 datasets spanning retrieval and reranking across 4 languages !
I very much agree with this point. That at least for now it seems meaningful to add the task without nec. adding it into a benchmark as the first thing.
Reading through the very detailed comments above (a big thanks to everyone who spent the time on these):
- There is some concerns about the use of AUC, I believe it is reasonable in this case assuming we allow for the model to produce their own confidence scores (see comment below) and have a reasonable default
- there seems to be some concern about the task wrapping, which I reasonably understand. Essentially we are duplicating a task. Am I wrong to argue that we could extend the existing retrieval and reranking task to also measure abstention? If so I would argue that it might be a more promising avenue. This would avoid duplicates in the benchmark while introducing what I believe to be a good measure to a wider set of tasks.
I have some comments to the code as well (but I have only added those which are important to whether or not the task should be added) if we decide it should be added I will give a new review.
conf_scores[i] = { | ||
"max": pred_scores_sort[0], | ||
"std": ( | ||
sum((sc - pred_scores_mean) ** 2 for sc in pred_scores) | ||
/ len(pred_scores) | ||
) | ||
** (1 / 2), | ||
"P1P2": pred_scores_sort[0] - pred_scores_sort[1], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would allow the model to choose their own conf score but with a meaningful default. An option would be to check if the model has a model.confidence_abstention(...)
method (or even return a boolean abstention score).
category="s2p", | ||
eval_splits=["test"], | ||
eval_langs=["fra-Latn"], | ||
main_score="map", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main_score="map", | |
main_score="map", |
Unsure if you are proposing map as the main score (given the discussion around AUC)
Hi @KennethEnevoldsen and many thanks for your remarks! |
Wonderful @hgissbkh I will take a look - Will close this here as well |
Hello MTEB team !
In this PR, jointly with @hgissbkh , we propose the introduction of a new task to the MTEB leaderboard:
Abstention
.The motivation is quite simple. Over the last years, Neural Retrieval has significantly improved upon heuristic-based IR systems. Yet, no model is perfect, and models often are unable to retrieve documents relevant to the user's query. This indicates abstention mechanisms would go a long way in improving the usability of such models, in order to improve confidence in retrieved results and let users have some control of the recall/precision tradeoff !
Going further, "simple" black-box abstention mechanisms (based on class logit distribution) have been shown to produce strong abstention baselines in classification tasks. In a recent work (https://arxiv.org/abs/2402.12997), we extend this to embedding models and show that using simple abstention heuristics common in the abstention litterature (score of best retrieved passage, score difference between best and second best retrieved passage, etc) we are able to get non trivial abstention mechanisms. In our work (soon to be published at TMLR), we go further proposing more complex calibration based abstention mechanisms but we feel the MTEB is made to evaluate models and not mechanisms. As such, in this PR, we only implement 3 simple abstention heuristics that work out-of-the-box on retrieval and reranking tasks.
What we aim to assess with this PR is whether or not retrieval models (bi-encoder embedding models here) yield sufficiently calibrated (query, document) scores to be able to not only rank the documents and assess the best ones (retrieval task) but also given a ranking and the associated document scores, compute a "confidence score" which would enable abstention if too low. Better calibrated embedding models will perform better, and we feel this task opens up new and very interesting properties of embedding models !
While still new (and so not necessarily a task to add to the leaderboard right away), we feel having the codebase in MTEB is a nice first step. In this PR, we add the abstention task for 13 datasets spanning retrieval and reranking across 4 languages !
I'll further let @hgissbkh describe the metrics used in more detail !
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
make test
.make lint
.438.jsonl
).