How to stop screening? #557
Replies: 13 comments 30 replies
-
It is recommended to decide on a stopping rule before starting the screening process. Your stopping rule can either be time-based, data-driven or a mix of these two. Time-based strategy: If you choose a time-based strategy, you decide to stop after an x amount of time. This strategy can be useful when you have a limited amount of time to screen. Data-driven strategy: When using a data-driven strategy, you e.g. decide to stop after an x amount of consecutive irrelevant papers (this number can be found in the statistics panel). Whether you choose 50, 100, 250, 500, etc. is dependent on the size of the dataset and the goal of the user. You can ask yourself: how important is it to find all the relevant papers? Mixed strategy: Another option is to stop after an x amount of time, unless you exceed the predetermined threshold of consecutive irrelevant papers before that time. |
Beta Was this translation helpful? Give feedback.
-
@MartijnUX, where can I find the statistics panel showing the 42 irrelevant records you were referring to? |
Beta Was this translation helpful? Give feedback.
-
For a scoping review I was wondering if a rule like the following would be advisable or not: Screening saturation (the point we stop screening) is defined in this scoping review when ASReview gives 1% of the total found papers with (a minimum of 25) as consecutive non-relevant papers. When this point is reached we choose to end the screening. (For example if 1500 articles are found, when 25 consecutive articles are not relevant anymore the screening ends; when 3000 articles are found, when 30 consecutive articles are not relevant the screening ends). |
Beta Was this translation helpful? Give feedback.
-
Same as Jan. I don't see why a scoping review should be different from any review.
That said, if we could generate a credible estimate of recall, then we could stop whenever that estimate reaches the recall that is sufficient for the particular project, AND we would be able to generate some measure of confidence in that interval.
Work has been published on estimators of recall, and at least one team has implemented theirs for the purpose of providing quantitative stopping information (Howard et al. 2020 https://doi.org/10.1016/j.envint.2020.105623). They think the estimate they generate is valid and conservative, but I would love to have the opinion of the AS-Review developers about that.
Note that Howard et al.'s estimator of recall is based on the statistical properties of the gap between successive positive records in an ordered queue, which is also the basis for the heuristic proposed by Jan.
|
Beta Was this translation helpful? Give feedback.
-
In their implementation, we repeatedly notice the "clumping" you mention, where the positive records are not smoothly spread out throughout the queue. We often have "streaks" of positive records followed by "droughts".
The recall estimate gets recomputed every time the ranking gets recomputed (every 30 records or by user request). It seems as if the recall estimation just gets adjusted as the work passes through plateaus.
I personally notice the similarity between screening data and time-to-event data, such as survival data or reliability data. I have used time-to-event methods to compare the efficiency of different ranking models. Parametric time-to-event methods can generate an estimate of the endpoint of the process (100% events). I wonder if this has been described.
|
Beta Was this translation helpful? Give feedback.
-
Thank you for the offer!!
I am on the US east coast, so wouldn't that be 5:00 am?
That is way too early for me, but the offer is extremely tempting.
Add me to the list and I'll see if I can manage to wake up.
|
Beta Was this translation helpful? Give feedback.
-
Those are three great sources. Thank you.
Your suggestion of using a preliminary random sample to estimate the population frequency of positives is intuitively very attractive.
However, our work deals routinely with populations of 100000-300000 references, among which there may only be 1000-3000 positives. The size of the random sample that would give us a credible estimate of total recall that is also reasonably accurate may be prohibitive. Or it may be acceptable, especially if the margin of error we require is not too small.
|
Beta Was this translation helpful? Give feedback.
-
You can also check the stopping rules used by other researchers. We have a list of systematic reviews where ASReview was used as a screening tool and the authors have reported their stopping rule. |
Beta Was this translation helpful? Give feedback.
-
See also the discussion in #1115 about calculating the 'knee' criterion based on the output of the recall plot. |
Beta Was this translation helpful? Give feedback.
-
Hereby I would like to share my stopping rules for two rounds of screening in my review protocol. The stopping rule of the first screening phase is a three-fold rule. (1) screening in this phase will be stopped when at least 25% of the records have been screened. Van de Schoot et al. (2021) showed that 95% of the eligible studies will be found after screening between only 8% to 33% of the total number of records; (2) screening will be stopped only when all key papers have been marked as relevant; (3) Based on the results of the screening of 25% of the records, a ‘knee method’ stopping criterion algorithm (Cormack & Grossman, 2016) will be determined. The knee method is a a geometric stopping procedure, based on the shape of the gain curve (i.e. recall versus effort). The recall plot generated by the software, plotting the number of identified relevant records against the number of viewed records, will be visually inspected after screening batches of 5% of the total number of records, to see whether a plateau (or ‘knee’) is reached. When a plateau is visually identified, we will mathematically evaluate to see if the slope at that earlier evaluation points is more than a predefined slope cutoff ratio (e.g. 6) as high. When this point is reached we choose to end the active learning screening phase. Screening phase 2: Deep learning |
Beta Was this translation helpful? Give feedback.
-
Hello everyone, I hope you're doing well. I have recently conducted a simulation study on utilizing extracted features to develop a stopping criterion for screening. I would be extremely grateful for your expert opinions and any feedback you might have. Thank you in advance for taking the time to review my work. |
Beta Was this translation helpful? Give feedback.
-
see also the pre-print "The SAFE Procedure: A Practical Stopping Heuristic for Active Learning-Based Screening in Systematic Reviews and Meta-Analyses": https://psyarxiv.com/c93gq |
Beta Was this translation helpful? Give feedback.
-
Hi, What do you think about combining a stopping rule and the human error rate (10.76% [95% CI: 7.43% to 14.09%]) from doi: 10.1371/journal.pone.0227742.
All the above numbers are imaginary. What do you think? Emanuel |
Beta Was this translation helpful? Give feedback.
-
This first post is continuously updated based on the discussions in this thread
In the active learning cycle, the model incrementally improves its predictions on the remaining unlabeled records, but hopefully, all relevant records are identified as early in the process as possible. The reviewer decides to stop at some point during the process to conserve resources or when all records have been labeled. In the latter case, no time was saved and therefore the main question is to decide when to stop: i.e. to determine the point at which the cost of labeling more papers by the reviewer is greater than the cost of the errors made by the current model (e.g., Cohen, 2011). Finding 100% of the relevant papers appears to be almost impossible, even for human annotators(Wang, Nayfeh, Tetzlaff, O’Blenis, & Murad, 2020). Therefore, we typically aim to find 95% of the inclusions. However, in the situation of an unlabeled dataset, you don’t know how many relevant papers there are left to be found. So researchers might either stop too early and potentially miss many relevant papers, or stop too late, causing unnecessary further reading(Z. Yu, N. Kraft, & T. Menzies, 2018a).
There are potential stopping rules which have to be implemented, estimating the number of potentially relevant papers or finding an inflection point(Cormack & Grossman, 2015, 2016; Kastner, Straus, McKibbon, & Goldsmith, 2009; Stelfox, Foster, Niven, Kirkpatrick, & Goldsmith, 2013; Ros, Bjarnason, & Runeson, 2017; Wallace et al., 2010, 2012; Webster & Kemp, 2013; Yu & Menzies, 2019).
Another option is to use heuristics (Bloodgood & Vijay-Shanker, 2014; Olsson & Tomanek, 2009; Vlachos, 2008), for example:
Time-based strategy: If you choose a time-based strategy, you decide to stop after an x amount of time. This strategy can be useful when you have a limited amount of time to screen.
Data-driven strategy: When using a data-driven strategy, you e.g. decide to stop after an x amount of consecutive irrelevant papers (this number can be found in the statistics panel). Whether you choose 50, 100, 250, 500, etc. is dependent on the size of the dataset and the goal of the user. You can ask yourself: how important is it to find all the relevant papers?
Mixed strategy: Another option is to stop after an x amount of time unless you exceed the predetermined threshold of consecutive irrelevant papers before that time.
Below we discuss more options in detail. Join the discussion!!
Some useful references:
Beta Was this translation helpful? Give feedback.
All reactions