Add Abstention tasks #717

ManuelFay · 2024-05-14T20:49:07Z

Hello MTEB team !
In this PR, jointly with @hgissbkh , we propose the introduction of a new task to the MTEB leaderboard: Abstention.

The motivation is quite simple. Over the last years, Neural Retrieval has significantly improved upon heuristic-based IR systems. Yet, no model is perfect, and models often are unable to retrieve documents relevant to the user's query. This indicates abstention mechanisms would go a long way in improving the usability of such models, in order to improve confidence in retrieved results and let users have some control of the recall/precision tradeoff !

Going further, "simple" black-box abstention mechanisms (based on class logit distribution) have been shown to produce strong abstention baselines in classification tasks. In a recent work (https://arxiv.org/abs/2402.12997), we extend this to embedding models and show that using simple abstention heuristics common in the abstention litterature (score of best retrieved passage, score difference between best and second best retrieved passage, etc) we are able to get non trivial abstention mechanisms. In our work (soon to be published at TMLR), we go further proposing more complex calibration based abstention mechanisms but we feel the MTEB is made to evaluate models and not mechanisms. As such, in this PR, we only implement 3 simple abstention heuristics that work out-of-the-box on retrieval and reranking tasks.

What we aim to assess with this PR is whether or not retrieval models (bi-encoder embedding models here) yield sufficiently calibrated (query, document) scores to be able to not only rank the documents and assess the best ones (retrieval task) but also given a ranking and the associated document scores, compute a "confidence score" which would enable abstention if too low. Better calibrated embedding models will perform better, and we feel this task opens up new and very interesting properties of embedding models !

While still new (and so not necessarily a task to add to the leaderboard right away), we feel having the codebase in MTEB is a nice first step. In this PR, we add the abstention task for 13 datasets spanning retrieval and reranking across 4 languages !

I'll further let @hgissbkh describe the metrics used in more detail !

I have tested that the tasks runs with the mteb package.
I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.
I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

Add nauc

…n_reranker

ManuelFay · 2024-05-14T21:11:17Z

Tagging @imenelydiaker for the review, don't hesitate in case you have any doubts or questions !
Cheers !

imenelydiaker · 2024-05-14T21:17:15Z

Tagging @imenelydiaker for the review, don't hesitate in case you have any doubts or questions ! Cheers !

This is very interesting thank you! I'll just leave one comment before reading the paper and try to review everything tomorrow.
Also adding other reviewers that may be interested @KennethEnevoldsen @Muennighoff @orionw

docs/tasks.md

imenelydiaker · 2024-05-14T21:48:17Z

[Didn't review code yet, just read the paper, very interesting btw]

So if I get this right, for a search query: first a retrieval step is performed, then you get similarity scores and then asses confidence scores to each score using a function like the max or std, then if your confidence is higher than a threshold you add the doc to the reranker pipeline, else you discard it. Is this right?

Some questions:

Is the main idea about discarding uncertain documents in a search/IR pipeline?
How does the linear model that asses confidence score generalize on other datasets? Or is it trained specifically for each dataset?
From my understanding this works in a pipeline combining retrieval and reranking, so when applied to Retrieval only, what should happen after assessing confidence scores? It just discards less relevant documents? Or maybe I'm missing something here

hgissbkh · 2024-05-15T08:06:56Z

Hello Imene and thanks Manu for the explanations!
Here are some further precisions regarding your questions.

Is the main idea about discarding uncertain documents in a search/IR pipeline?

Not exactly, the idea is about discarding a whole instance, i.e., a query and its set of candidate documents. Assume that for a given query we have K candidate documents, first we compute the K query-document relevance scores (e.g., cosine similarity on embeddings) and then build our confidence function on them (the most simple approach is to take the max). I have joined a diagram to make it clearer.

How does the linear model that assess confidence score generalize on other datasets? Or is it trained specifically for each dataset?

In our paper, the linear confidence function is trained in-domain, i.e., for each specific dataset. However, the confidence functions we propose here are simple heuristics that don’t need to be trained (maximum score, standard deviation, highest score minus second highest score). For example, taking the maximum score has been proven to be a good proxy for how much a model is confident on a given classification instance (Hendrycks et al., 2016). We show in our paper that this observation transfers well to document ranking in general.

From my understanding this works in a pipeline combining retrieval and reranking, so when applied to Retrieval only, what should happen after assessing confidence scores? It just discards less relevant documents? Or maybe I'm missing something here.

The code implementation we propose is built to work when performing retrieval and reranking independently and not necessarily in a pipeline. In a practical setting in which we have one query and K documents, we can compute K similarity scores. We can then compute the maximum score and use it as a confidence estimator. If it is above a certain threshold (the higher, the more probably we abstain), then the whole instance is discarded, i.e., the query and the K documents.

Hope this helps!
Thanks again for your questions and don’t hesitate if you have any additional doubts!

…y/mteb into add_abstention_reranker

orionw · 2024-05-15T13:18:51Z

In our paper, the linear confidence function is trained in-domain, i.e., for each specific dataset. However, the confidence functions we propose here are simple heuristics that don’t need to be trained (maximum score, standard deviation, highest score minus second highest score). For example, taking the maximum score has been proven to be a good proxy for how much a model is confident on a given classification instance (Hendrycks et al., 2016). We show in our paper that this observation transfers well to document ranking in general.

Really interesting task @hgissbkh!

I am still a little confused on the task definition, so I will explain what I understand:

The model does standard search and returns K documents with scores S_1, ..., S_K
Some heuristic is applied on top of that to remove any scores less than a standard deviation away (or perhaps a number threshold)
We then have some number of documents left that passed this heuristic D, where |D| <= |K|

I guess my questions are:

How are you defining ground truth here? Are there annotations for what queries should be abstained from and which ones should return documents?
Can new models implement their own functions for abstention or do they use these heuristics the same? I think this may get rather complicated if models need to define a new function for calibration along with retrieval
Is there an evaluation of calibration in the metrics or just the documents that are returned?

I've typically seen this abstention done as a 2nd stage problem after the initial retriever is used, so it's taking me a bit to adjust to how this could be used for a first stage model.

hgissbkh · 2024-05-15T14:32:01Z

Hi @orionw and thanks for your comment!
I will try to clarify things a bit.

The question we are trying to address with abstention is not to evaluate confidence at the document level but at the whole instance level. Let’s take your example again:

The model does standard search and returns the K most-relevant documents D_1, …, D_K with respect to query Q. Those documents have scores S_1 > … > S_K. Note that what I call an instance here is the tuple (Q, D_1, …., D_K) (the query and the K retrieved documents).
The confidence heuristic is then applied on the tuple of scores (S_1, …, S_K), not to each of them separately. For instance, we can take the maximum score as a confidence estimator (Hendrycks et al., 2016): c(S_1, …, S_K) = max(S_1, …, S_K).
Then, we decide if we return the retrieved documents or not, depending on whether c(S_1, …, S_K) is greater than a given threshold that controls abstention probability (the higher the threshold, the more we abstain overall). Put in a more formal way, say a retriever is a function R that takes a query Q, a document database DD, and a number of documents to return K as input and that returns the top-K documents: R(Q, DD, K) = (D_1, …, D_K). Then, if c(S_1, …, S_K) < tau, R_abst(Q, DD, K) = "I'm not sure", otherwise, R_abst(Q, DD, K) = (D_1, …, D_K).

Now, regarding your questions more specifically:

How are you defining ground truth here? Are there annotations for what queries should be abstained from and which ones should return documents?

Good question! No, we do not need any additional annotations for abstention evaluation. We evaluate abstention quality by computing a normalized area under the metric-abstention curve. Let me explain:
Let’s take a test dataset T comprising of N instances (Q_1, D_1,1, …, D_1,K, Y_1), …, (Q_N, D_N,1, …, D_N,K, Y_N). In the vanilla case, we would simply evaluate the retrieval system by assessing the relevance of the K returned documents with respect to ground truth Y, for instance using NDCG@K. After doing it for each of the N instances, we can finally compute an average NDCG@K on T.
Now, let’s assume we want to keep only instances for which the confidence function is above a certain threshold tau. We would get a new dataset T_tau included in T (|T_tau| < |T|) for which we could also compute the average NDCG@K (hopefully greater than without abstention). Doing this for increasing values of tau, we would get an increasing curve. We finally compute the Area Under the Curve to assess overall abstention quality.
To get further details, you can have a look at Figure 2 and Section 4.3 of our paper (https://arxiv.org/abs/2402.12997).

Can new models implement their own functions for abstention or do they use these heuristics the same? I think this may get rather complicated if models need to define a new function for calibration along with retrieval

The heuristics we use are deliberately general and can be implemented on any models without any needs for adjustment. The three confidence functions we use are the max, the std and S_1-S_2 (highest score minus second highest score). These are common functions used for confidence estimation in the classification setting (Narayanan et al., 2012; Hendrycks et al., 2016; Pang et al., 2021; Gawlikowski et al., 2023). In our paper (Gisserot-Boukhlef et al., 2024), we show that those heuristics transfer well the to the document ranking setting.

Is there an evaluation of calibration in the metrics or just the documents that are returned?

I guess my response to question 1 answers this one too! :)

Hope this clarifies a bit and of course don’t hesitate if you have additional questions!

ManuelFay · 2024-05-17T09:42:19Z

Hello,
Any thoughts on this @orionw ?

orionw · 2024-05-17T10:15:41Z

Thanks for explaining @hgissbkh and for the ping @ManuelFay! This is a novel idea and I'm still trying to parse the details of your statement. Perhaps I'll have to give the paper a full read to understand.

I'm still a little worried about the automatic evaluation without gold data, as each embedding model will have a different calibration but I also admit I don't fully understand the example you gave. If @imenelydiaker understands she can continue with the review, otherwise I'll need to take some time to work an example.

Thanks and sorry for the delay.

ManuelFay · 2024-05-17T13:15:30Z

To be clear, there is gold data ! It corresponds to the original dataset's gold data.
Basically given a retriever, we compute the (query, documents) pair scores (as in retrieval).
Given these scores, we then output a scalar "confidence value" (using simple techniques).
This confidence value will enable us to compute AUC scores that assess whether or not abstention improves the results or not !
Basically, if abstaining on the 20% of queries with the least "confident" scores improves MAP values, the AUC score will be >0.

In conclusion, this is not a reference free task ! We are using the gold queries,document pairs used in retrieving and reranking scores to look at the performance delta with different values of abstention ratios.

KennethEnevoldsen · 2024-05-21T09:34:57Z

@orionw did you have a chance to look over an example? (I am going over PRs to get all datasets in)

hgissbkh · 2024-05-21T11:33:51Z

Hey @orionw, I suggest we look at a concrete example!

Let’s assume we have a query Q and want to retrieve the top-5 documents using a retrieval system R. We retrieve those 5 documents, let’s say with the following scores: 0.9, 0.8, 0.7, 0.3, 0.2. If we choose the max as the confidence function (Hendrycks et al., 2016), then we get a confidence score of 0.9 (maximum of the 5 retrieved documents’ scores). Assume also that I have previously chosen a confidence threshold of 0.7. As 0.9 > 0.7, we decide to return the 5 retrieved documents. On the contrary, if we had set a confidence threshold of 0.95, we would have discarded the instance and returned an abstention message to the user.

If we extend this rationale to the whole dataset, we see that the number of abstained instances changes depending on the threshold we choose. And that’s how we construct AUC!

Assume we have the following test dataset, comprising of 3 instances (each instance comprises of a query, the 5 retrieved documents’ scores and the ground truth, 1 representing relevant documents and 0 representing irrelevant documents):
{(Q_1, [0.5, 0.4, 0.3, 0.2, 0.1], [0, 1, 0, 0, 0])
(Q_2, [0.7, 0.5, 0.3, 0.2, 0.2], [1, 0, 1, 0, 0])
(Q_3, [0.9, 0.8, 0.3, 0.3, 0.2], [1, 1, 0, 0, 0])}
We can first evaluate the 3 instances using NDCG@5: we get 0.63, 0.92, 1 respectively.
Then, let’s make the confidence threshold vary. If we set it equal to 0.4, no instance is discarded and the test NDCG is equal to (0.63 + 0.92 + 1) / 3 = 0.85. If we set it now to 0.6, the first instance is discarded (max score equal to 0.5) and the test NDCG becomes (0.92 + 1) / 2 = 0.96. Finally, setting the threshold to 0.8, the two first instances are discarded and the test NDCG becomes 1.
We get an increasing NDCG-abstention curve: (0%, 0.85), (33%, 0.92), (66%, 1).
We finally compute the area under this NDCG-abstention curve to evaluate abstention on the test dataset.

orionw · 2024-05-21T16:06:31Z

Thanks @hgissbkh, that is very helpful and I definitely see the appeal of the task.

A couple follow up questions: it seems like this is reducing the length of the returned list (since the least confident entries will be at the bottom). So in essence, this is a metric computing the area under the nDCG@k score curve at various confidence scores -- is that correct?

Does that metric (nDCG@5 say) stay fixed or is it nDCG@length of the list? Looking at the paper it seems fixed? So it's like taking the nDCG@5 score (or mAP score, etc.) at various score thresholds.

I'm also trying to think about the edge cases here: if we have a retrieval system who's minimum score is 0.9 the nDCG score will be the same until the last confidence values, correct?

This seems like we're asking the models to have absolute scores (between 0-1) whereas perhaps the model does well but uses a relative range (0.9-1 for example). Is there a reason to prefer the one with the broader range?

I would also be interested in hearing @imenelydiaker or @KennethEnevoldsen's thoughts on this PR so we could get a perspective other than mine. This is a unique approach to using the retrieval scores, which is super neat!

Separately from a code perspective: could we make it so that this abstract task can wrap other tasks?

It seems like this could take in a RetrievalTask (or RerankingTask), do the normal retrieval/reranking, get the run file, and then perform the AbstentionTask on top of it. If this understanding is correct, we wouldn't need a separate AbstentionTask for each normal task and instead could re-use every existing retrieval and re-ranking task with it, which would be really nice.

I could be missing the reason why you'd do it differently, so please let me know!

imenelydiaker · 2024-05-21T17:38:48Z

Hey @hgissbkh and @ManuelFay, thank you for the detailed explanations. If abstention becomes a standard in RAG systems for example, this task would be very helpful I suppose.

Currently, I am failing to see maybe something obvious:

Is it correct to have a fixed threshold for all models? as @orionw mentionned it some models may use relative score ranges.
If a model gets higher abstention score than another one, what would this mean?

Help me undestand if I'm not getting this right please:
The way I see abstention is as a top-layer after retrieval (or any other task), so I'm failing to see is how is it an embedding evaluation task? It relies mostly on the scores of the task it is applied to (retrieval in your example).

According to my understanding, abstention scores will depend on initial retieval task scores and a fixed threshold. So there is a correlation between an embedder peformance on the initial task & absention scores, an embedder that is good at retrieval will be good at abstention. Is this correct?

Can we say the same about other evaluation tasks of MTEB? e.g., is there a correlation between classification and STS tasks or any other tasks?

ManuelFay · 2024-05-21T21:22:06Z

Is it correct to have a fixed threshold for all models? as @orionw mentionned it some models may use relative score ranges.

---> No, there is no fixed threshold, we are computing an "area under the curve" score. Basically, you vary the threshold and see how much better it is than a baseline random policy. This is a classic metric for classification tasks requiring a threshold.
ROC

If a model gets higher abstention score than another one, what would this mean?

It would mean it is better calibrated. A "calibrated" retrieving model does not only need to rank documents in the correct order, but also to output score magnitudes that are coherent. This metric is not captured by ranking metrics at the moment ! Typically, we would like the "best" document for a given query to have a low score if it is not that relevant, and inversely.

So there is a correlation between an embedder peformance on the initial task & abstention scores, an embedder that is good at retrieval will be good at abstention. Is this correct?

We have found this to be mostly the case but this is not true by design. Again, since this is essentially a measure of calibration, a better calibrated model might not necessarily be a better retrieving model, this will depend on the loss function, regularization, retriever architecture, etc... However, a very bad retrieval model will have a hard time being correctly calibrated since it's inherently bad.

Can we say the same about other evaluation tasks of MTEB? e.g., is there a correlation between classification and STS tasks or any other tasks?

There definitely is a strong correlation between some tasks in MTEB.

To sum it up, what we are proposing here is a tractable way of measuring retriever calibration in a practical and useful setting.

Having said that, from this issue discussions, I feel the task is still a bit unclearly explained and we probably were very bad at explaining the core concepts ! Thanks a ton for taking the time, it is super valuable feedback in all cases !

ManuelFay · 2024-05-22T07:17:21Z

@orionw hadn't seen your comment about code.

In essence, the abstention task is a wrapper around other tasks (reranking, retrieval). It inherits all methods and properties or underlying tasks, essentially the only thing that changes is the associated scorer class. Since all tasks have their custom input format and methods, the wrapper just routes to the correct underlying task, retrieves the scores, and computes the abstention scores.

I decided to make abstention a separate task, in order to contain all modifications to a single directory and avoid having to make big breaking changes in the evaluation methods of other methods. I put some thought into it and figured it should be treated as a task on it's own to guarantee consistency and allow for granular code modifications, rather than adding another set of metrics in the other tasks. I also feel it makes the task more readable.

hgissbkh · 2024-05-22T08:56:20Z

Hi @orionw , regarding your follow-up questions:

-> So in essence, this is a metric computing the area under the nDCG@k score curve at various confidence scores -- is that correct?
At various confidence thresholds I would say, but you got the idea!

-> Does that metric (nDCG@5 say) stay fixed or is it nDCG@length of the list?
Yes it is fixed. First you choose your metric, say NDCG@5, and then you compute AUC based on this metric. But you can of course compute the AUC for any metric you want (NDCG@k, MAP…).

-> If we have a retrieval system who's minimum score is 0.9 the nDCG score will be the same until the last confidence values, correct?
We took care of this edge case in our implementation so that is flexible to any retrieval systems. We first have a look at all the NDCGs in the test set and then select the thresholds adequately. Taking my example from above, if the confidence scores are 0.9, 0.95 and 0.99 instead of 0.5, 0.7 and 0.9, we would simply choose the confidence thresholds differently and take for example 0.8, 0.92 and 0.96.

-> This seems like we're asking the models to have absolute scores (between 0-1) whereas perhaps the model does well but uses a relative range (0.9-1 for example).
I am not sure to get this one, sorry 😅

orionw · 2024-05-22T15:10:31Z

@ManuelFay, thanks, that makes sense. I think my suggestion would be to make the task classes themselves simpler, since we're just wrapping the other ones -- like if there was any way to wrap GerDaLIRSmall without having to redefine all those fields. That way we don't have the same task repeated twice, if possible.

@hgissbkh thanks for the detailed response.

We first have a look at all the NDCGs in the test set and then select the thresholds adequately.

This was basically the question I had at the end, how do we determine these thresholds?

It seems tricky because on the one hand if you use a standard range (0-1) there may be some thresholds where nothing is removed and the scores penalize models which have smaller ranges (like a model that predicts between 0.9-1 always). On the other hand, if you dynamically calculate it (e.g. 0.91, 0.92, ... 0.99) it doesn't seem like it measures confidence but rather just the normal ranking stat (e.g. a dynamic nDCG@k curve where you take every value of k and calculate the AUC). Of these, the former seems more suited to calibration even if it ends up punishing the model for not fully using the range of similarity scores.

I could still be misunderstanding, so thanks for your patience :)

First you choose your metric, say NDCG@5, and then you compute AUC based on this metric.

Aren't metrics like NDCG@5 not well suited for this, as the top five documents won't change until you remove nearly all the documents?

hgissbkh · 2024-05-23T10:18:46Z

Hi @orionw,
From what I understand, you are defending that to measure calibration, we should have a fixed range of thresholds that should not vary depending on the domain. By dynamically adjusting our thresholding (basically varying abstention rates from 0 to 1 - rather than the threshold from 0 to 1), we intended to obtain a clearer and more meaningful signal, that would often be squashed if computed on the 0 to 1 threshold interval for every domain.
We understand that this is an added complexity that complicates the readibility of a leaderboard like MTEB, and although we feel it makes the most sense, it definitely is an arbitrary choice.
Given the discussions in this PR, we do not intend to insist on merging this PR if you all feel it's not the right place for it at the moment.
In all cases, thanks for the precious feedback and the great work !
Cheers,
Hippo and Manu

KennethEnevoldsen

Some general thoughts:

Generally find the task both promising and meaningful. Follows a meaningful trend where we are not only interested in performing well but also failing gracefully if either the answer don't exist or the model is unable to find it.

While still new (and so not necessarily a task to add to the leaderboard right away), we feel having the codebase in MTEB is a nice first step. In this PR, we add the abstention task for 13 datasets spanning retrieval and reranking across 4 languages !

I very much agree with this point. That at least for now it seems meaningful to add the task without nec. adding it into a benchmark as the first thing.

Reading through the very detailed comments above (a big thanks to everyone who spent the time on these):

There is some concerns about the use of AUC, I believe it is reasonable in this case assuming we allow for the model to produce their own confidence scores (see comment below) and have a reasonable default
there seems to be some concern about the task wrapping, which I reasonably understand. Essentially we are duplicating a task. Am I wrong to argue that we could extend the existing retrieval and reranking task to also measure abstention? If so I would argue that it might be a more promising avenue. This would avoid duplicates in the benchmark while introducing what I believe to be a good measure to a wider set of tasks.

I have some comments to the code as well (but I have only added those which are important to whether or not the task should be added) if we decide it should be added I will give a new review.

KennethEnevoldsen · 2024-05-27T07:54:19Z

mteb/evaluation/evaluators/AbstentionEvaluator.py

+            conf_scores[i] = {
+                "max": pred_scores_sort[0],
+                "std": (
+                    sum((sc - pred_scores_mean) ** 2 for sc in pred_scores)
+                    / len(pred_scores)
+                )
+                ** (1 / 2),
+                "P1P2": pred_scores_sort[0] - pred_scores_sort[1],
+            }


I would allow the model to choose their own conf score but with a meaningful default. An option would be to check if the model has a model.confidence_abstention(...) method (or even return a boolean abstention score).

KennethEnevoldsen · 2024-05-27T08:06:10Z

mteb/tasks/Abstention/fra/SyntecRerankingAbstention.py

+        category="s2p",
+        eval_splits=["test"],
+        eval_langs=["fra-Latn"],
+        main_score="map",


Suggested change

main_score="map",

main_score="map",

Unsure if you are proposing map as the main score (given the discussion around AUC)

hgissbkh · 2024-05-29T12:00:01Z

Hi @KennethEnevoldsen and many thanks for your remarks!
We have incorporated abstention as an evaluation metric rather than as a task in this new PR: #841. For the moment, we have implemented abstention metrics for Retrieval and Reranking only but they could also be relevant for all classification tasks.
We would be very happy to hear your opinion on this new implementation!

KennethEnevoldsen · 2024-05-29T19:03:18Z

Wonderful @hgissbkh I will take a look - Will close this here as well

ManuelFay and others added 24 commits April 29, 2024 16:01

initial abstention task

4707a39

fix and create todo

70670d2

implemented nAUC

1dc6301

implemented nAUC

ec32fd5

temporary abstention

c5fb2a1

syntec

212bdbf

syntec

320f74e

add compute batched

275268d

add new languages and tasks

3f02d47

Merge pull request #1 from ManuelFay/add_nauc

36b8a73

Add nauc

fix

753ee43

implemented nAUC for reranking

9e8b15c

remove too long

7b8bfa0

Merge remote-tracking branch 'origin/add_nauc_rrk' into add_abstentio…

20722d8

…n_reranker

merge and lint

822cd91

black and isort

f25b540

add tasks to table

8ac817f

linting

84c15a2

more results

93c188f

Merge branch 'embeddings-benchmark:main' into add_abstention_reranker

3e6d0f8

linting and results

f6704c8

last results

ec497b2

points

acc2c6c

contributors

17659ec

ManuelFay force-pushed the add_abstention_reranker branch from e4e33b8 to 17659ec Compare May 14, 2024 21:07

Merge branch 'embeddings-benchmark:main' into add_abstention_reranker

22ee559

imenelydiaker reviewed May 14, 2024

View reviewed changes

docs/tasks.md Outdated Show resolved Hide resolved

ManuelFay and others added 6 commits May 15, 2024 11:19

fix tests

9f1270f

Merge branch 'add_abstention_reranker' of https://github.com/ManuelFa…

707e2d0

…y/mteb into add_abstention_reranker

lint

c3140a8

Merge branch 'main' into add_abstention_reranker

5aff45a

revert to main version of tasks.md

3af9ad3

rebase

0bcade9

ManuelFay added 2 commits May 15, 2024 15:20

remove added tasks

ae49e14

remove table formatting

f0febbb

Merge branch 'main' into add_abstention_reranker

40c7465

imenelydiaker assigned orionw May 24, 2024

KennethEnevoldsen reviewed May 27, 2024

View reviewed changes

hgissbkh mentioned this pull request May 29, 2024

WIP: Add abstention metrics #841

Open

KennethEnevoldsen closed this May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Abstention tasks #717

Add Abstention tasks #717

ManuelFay commented May 14, 2024 •

edited

ManuelFay commented May 14, 2024

imenelydiaker commented May 14, 2024

imenelydiaker commented May 14, 2024 •

edited

hgissbkh commented May 15, 2024 •

edited

orionw commented May 15, 2024 •

edited

hgissbkh commented May 15, 2024 •

edited

ManuelFay commented May 17, 2024

orionw commented May 17, 2024

ManuelFay commented May 17, 2024

KennethEnevoldsen commented May 21, 2024

hgissbkh commented May 21, 2024

orionw commented May 21, 2024 •

edited

imenelydiaker commented May 21, 2024 •

edited

ManuelFay commented May 21, 2024

ManuelFay commented May 22, 2024

hgissbkh commented May 22, 2024 •

edited

orionw commented May 22, 2024

hgissbkh commented May 23, 2024

KennethEnevoldsen left a comment

KennethEnevoldsen May 27, 2024

KennethEnevoldsen May 27, 2024

hgissbkh commented May 29, 2024 •

edited

KennethEnevoldsen commented May 29, 2024

Add Abstention tasks #717

Add Abstention tasks #717

Conversation

ManuelFay commented May 14, 2024 • edited

ManuelFay commented May 14, 2024

imenelydiaker commented May 14, 2024

imenelydiaker commented May 14, 2024 • edited

hgissbkh commented May 15, 2024 • edited

orionw commented May 15, 2024 • edited

hgissbkh commented May 15, 2024 • edited

ManuelFay commented May 17, 2024

orionw commented May 17, 2024

ManuelFay commented May 17, 2024

KennethEnevoldsen commented May 21, 2024

hgissbkh commented May 21, 2024

orionw commented May 21, 2024 • edited

imenelydiaker commented May 21, 2024 • edited

ManuelFay commented May 21, 2024

ManuelFay commented May 22, 2024

hgissbkh commented May 22, 2024 • edited

orionw commented May 22, 2024

hgissbkh commented May 23, 2024

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

KennethEnevoldsen May 27, 2024

Choose a reason for hiding this comment

KennethEnevoldsen May 27, 2024

Choose a reason for hiding this comment

hgissbkh commented May 29, 2024 • edited

KennethEnevoldsen commented May 29, 2024

ManuelFay commented May 14, 2024 •

edited

imenelydiaker commented May 14, 2024 •

edited

hgissbkh commented May 15, 2024 •

edited

orionw commented May 15, 2024 •

edited

hgissbkh commented May 15, 2024 •

edited

orionw commented May 21, 2024 •

edited

imenelydiaker commented May 21, 2024 •

edited

hgissbkh commented May 22, 2024 •

edited

hgissbkh commented May 29, 2024 •

edited