Feature-based stopping criterion (??) #1344
Replies: 9 comments 3 replies
-
From my understanding, this approach can be used for large datasets where the assumption of clear separation between relevant and irrelevant records holds like in the case of van de Schoot 2017 and Bos 2018 datasets, where the cosine similarity plot of relevant_irrelevant (green plot) is much below the unlabelled_irrelevant plot (orange plot) and the green plot decreases sharply and remains somewhat steady at a low value of cosine similarity for a significant number of consecutive records. |
Beta Was this translation helpful? Give feedback.
-
Nice work @rohitgarud! However, the plots are probably hard to read for users without a data science background. As such, could we leverage this algorithm and transform the outputs to a human-readable output? Furthermore, what does it bring compared to "if we read N irrelevant articles, we stopped screening" with N for 10, 20, or 50? |
Beta Was this translation helpful? Give feedback.
-
A great amount of discussion is happening in #557 (comment) around the stopping criteria and when or how to stop screening. There is a discussion about randomly sampling a few records and estimating the distribution of the relevant records or total recall in the dataset from the sampled records. However, there is something we know about the entire dataset in advance, i.e., the feature vectors. So I would like to propose a possible idea to use these feature vectors to reliably estimate the stopping point for the screening. (I don't know if this approach has already been proposed.)
The idea is very simple. We start by taking the resultant of the feature vectors of the entire dataset and keep other two resultant vectors, one for the relevant and one for the irrelevant records. As we go on screening, we subtract the feature vector from the unlabelled resultant and add to the appropriate resultant (relevant or irrelevant) at each iteration. Then we calculate the cosine similarity between these resultant vectors. So the assumption here is that the cosine similarity of the resultant of the feature vectors of unlabelled and irrelevant records will slowly plateau as there will be no relevant records left in the unlabelled group to move the resultant towards the relevant resultant direction. We can use some metric to identify the point after which the screening can be safely stopped based on the cosine similarity of the unlabeled and irrelevant records.
Here is the portion of the code implementing this. The entire code is here, which you can use for experimentation.
Beta Was this translation helpful? Give feedback.
All reactions