Feature-based stopping criterion (??) #1344

rohitgarud · 2023-03-09T14:42:16Z

rohitgarud
Mar 9, 2023

A great amount of discussion is happening in #557 (comment) around the stopping criteria and when or how to stop screening. There is a discussion about randomly sampling a few records and estimating the distribution of the relevant records or total recall in the dataset from the sampled records. However, there is something we know about the entire dataset in advance, i.e., the feature vectors. So I would like to propose a possible idea to use these feature vectors to reliably estimate the stopping point for the screening. (I don't know if this approach has already been proposed.)
The idea is very simple. We start by taking the resultant of the feature vectors of the entire dataset and keep other two resultant vectors, one for the relevant and one for the irrelevant records. As we go on screening, we subtract the feature vector from the unlabelled resultant and add to the appropriate resultant (relevant or irrelevant) at each iteration. Then we calculate the cosine similarity between these resultant vectors. So the assumption here is that the cosine similarity of the resultant of the feature vectors of unlabelled and irrelevant records will slowly plateau as there will be no relevant records left in the unlabelled group to move the resultant towards the relevant resultant direction. We can use some metric to identify the point after which the screening can be safely stopped based on the cosine similarity of the unlabeled and irrelevant records.

Here is the portion of the code implementing this. The entire code is here, which you can use for experimentation.

from sklearn.metrics.pairwise import cosine_similarity  
import numpy as np
import pandas as pd

from asreview import open_state
from asreviewcontrib.insights.metrics import time_to_discovery

with open_state("tmp_data/api_example.asreview") as state:
    df = state.get_dataset()  
    df['labeling_order'] = df.index
    labels = state.get_labels(priors=True) 
    labeling_order = df.record_id
    td_last = time_to_discovery(state)[-1][1]
    
feature_extraction_id = project.feature_matrices[0]["id"]
print(feature_extraction_id)
feature_matrix = project.get_feature_matrix(feature_extraction_id)
tfidf_features = feature_matrix.toarray()

relevant = tfidf_features[labeling_order[0]].reshape(1,-1)
irrelevant = tfidf_features[labeling_order[1]].reshape(1,-1)
unlabelled = sum(tfidf_features).reshape(1,-1) - relevant - irrelevant
unlabelled_relevant = []
unlabelled_irrelevant = []
relevant_irrelevant = []
for i,record in enumerate(labeling_order[2:]):
    if labels[record] == 1:
        relevant += tfidf_features[record].reshape(1,-1)
    if labels[record] == 0:
        irrelevant += tfidf_features[record].reshape(1,-1)

    unlabelled -= tfidf_features[record].reshape(1,-1)

    # Calcuate cosine similarities
    unlabelled_relevant.append(cosine_similarity(unlabelled, relevant)[0])
    unlabelled_irrelevant.append(cosine_similarity(unlabelled, irrelevant)[0])
    relevant_irrelevant.append(cosine_similarity(relevant, irrelevant)[0])

rohitgarud · 2023-03-09T15:45:37Z

rohitgarud
Mar 9, 2023
Author

1 reply

rohitgarud Mar 9, 2023
Author

There are some other "features" that can be seen in the plot for relevant_irrelevant and unlabelled_relevant that can possibly be used for judging the stopping point.

rohitgarud · 2023-03-09T15:57:23Z

rohitgarud
Mar 9, 2023
Author

0 replies

rohitgarud · 2023-03-09T16:00:07Z

rohitgarud
Mar 9, 2023
Author

0 replies

rohitgarud · 2023-03-09T16:02:11Z

rohitgarud
Mar 9, 2023
Author

1 reply

rohitgarud Mar 9, 2023
Author

Some datasets do not show a clear plateau, but this is generally the case for smaller datasets.

rohitgarud · 2023-03-09T16:05:48Z

rohitgarud
Mar 9, 2023
Author

0 replies

rohitgarud · 2023-03-09T16:13:06Z

rohitgarud
Mar 9, 2023
Author

0 replies

rohitgarud · 2023-03-09T16:14:45Z

rohitgarud
Mar 9, 2023
Author

0 replies

rohitgarud · 2023-03-10T07:56:54Z

rohitgarud
Mar 10, 2023
Author

From my understanding, this approach can be used for large datasets where the assumption of clear separation between relevant and irrelevant records holds like in the case of van de Schoot 2017 and Bos 2018 datasets, where the cosine similarity plot of relevant_irrelevant (green plot) is much below the unlabelled_irrelevant plot (orange plot) and the green plot decreases sharply and remains somewhat steady at a low value of cosine similarity for a significant number of consecutive records.

0 replies

rvdinter · 2023-03-17T08:17:29Z

rvdinter
Mar 17, 2023

Nice work @rohitgarud! However, the plots are probably hard to read for users without a data science background. As such, could we leverage this algorithm and transform the outputs to a human-readable output? Furthermore, what does it bring compared to "if we read N irrelevant articles, we stopped screening" with N for 10, 20, or 50?

1 reply

rohitgarud Mar 17, 2023
Author

@rvdinter Thank you for your feedback. The simulation study is an initial concept validation study., so the details are not yet thought out and some further investigation is required. I am hoping to have some discussion around this by the experts here and then formulate/optimize a usable metric from the results. I am open to collaboration if experts think this concept has potential. What are your thoughts on formulating a usable metric?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature-based stopping criterion (??) #1344

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Feature-based stopping criterion (??) #1344

rohitgarud Mar 9, 2023

Replies: 9 comments · 3 replies

rohitgarud Mar 9, 2023 Author

rohitgarud Mar 9, 2023 Author

rohitgarud Mar 9, 2023 Author

rohitgarud Mar 9, 2023 Author

rohitgarud Mar 9, 2023 Author

rohitgarud Mar 9, 2023 Author

rohitgarud Mar 9, 2023 Author

rohitgarud Mar 9, 2023 Author

rohitgarud Mar 9, 2023 Author

rohitgarud Mar 10, 2023 Author

rvdinter Mar 17, 2023

rohitgarud Mar 17, 2023 Author

rohitgarud
Mar 9, 2023

Replies: 9 comments 3 replies

rohitgarud
Mar 9, 2023
Author

rohitgarud Mar 9, 2023
Author

rohitgarud
Mar 9, 2023
Author

rohitgarud
Mar 9, 2023
Author

rohitgarud
Mar 9, 2023
Author

rohitgarud Mar 9, 2023
Author

rohitgarud
Mar 9, 2023
Author

rohitgarud
Mar 9, 2023
Author

rohitgarud
Mar 9, 2023
Author

rohitgarud
Mar 10, 2023
Author

rvdinter
Mar 17, 2023

rohitgarud Mar 17, 2023
Author