self-enforcing model bias and guaranteeing selection diversity #1387

bramiozo · 2023-03-31T09:11:20Z

bramiozo
Mar 31, 2023

Is there any consideration of paper diversity and self-enforcing model bias?
I don't think "% of relevant publications found" is a sufficient metric by and in itself. It might be that the easiest criteria to fulfil leads to a selection bias that reinforces certain model-weights, leading to, de facto, the fulfilment of only a part of the initial (implicit) paper selection criteria.

I would definitely add a diversity metric to augment the precision metric; highly precise selections are not very useful if they only select papers within a small area of the "acceptable" joint probability distribution, and likewise for a high coverage of this "acceptable joint probability distribution" if it produces a lot of selections outside this distribution. You can get some estimate of diversity by considering the entropy over the cluster assignments of the current paper selection. That would be the clustering based on the abstract embeddings. Clustering is expensive so that might be an impediment (bootstrapping helps). I would also add serendipity, if you have not already done this, to occasionally select papers from the unlikely pool.
One suggestion regarding diversion-inclusion in selection is to approach s.vrijenhoek@uva.nl. She is working on diversity in recommender systems.

I would also include the possibility to add discrete/explicit selection criteria:

minimum sample size
specific methodology used
etc.
This might greatly reduce the prior probability of false negatives, but does require that the papers are pre-parsed, which might be easy to generalise for certain criteria and regex-templates might help.

I would also consider the option to chain multiple simple models. You would have to monitor model saturation, i.e. whether the model itself is updated "significantly", this is of course an interplay between model-selection bias and model capacity. If you decide a model is saturated, you append a new model. This new model then starts from scratch, but the paper selection then excludes papers that are similar to the papers selected by the prior model. You keep chaining until you have reached some overall diversity criterium (and the precision criterium is reached per each model so that is nice). One thing that pops in my mind is the use of a probabilistic clustering technique such as GMM's to pre-seed the data selection for the model chain. This increases the predicability but also makes it more stupid since you don't per se apply focus on the most relevant aspects.

Just some ideas.

rohitgarud · 2023-04-01T06:28:15Z

rohitgarud
Apr 1, 2023

Great suggestions @bramiozo. I think, although the default is max query strategy, the mixed query strategy is also available which can potentially introduce the diversity and serendipity into the record selection as you mentioned. The mixture ratio can also be used for increasing these effects. Default for mixed query strategy is 95% Max with 5% random.
I am working on Similarity Classifier which uses cosine similarity of the abstract embeddings to implicitly create two clusters, one for relevant records and one for irrelevant records assumming such clusters exist. However, the embeddings do not guarantee this, as the difference between relevant and irrelevant records might be too subtle to be effectively captured by the embeddings. The scope and objectives of the systematic review affect the selection of relevant records by the reviewers and the same set of documents can have different set of relevant and irrelevant records depending upon the SLR objectives.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

self-enforcing model bias and guaranteeing selection diversity #1387

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

self-enforcing model bias and guaranteeing selection diversity #1387

bramiozo Mar 31, 2023

Replies: 1 comment

rohitgarud Apr 1, 2023

bramiozo
Mar 31, 2023

rohitgarud
Apr 1, 2023