Replies: 1 comment
-
Great suggestions @bramiozo. I think, although the default is max query strategy, the mixed query strategy is also available which can potentially introduce the diversity and serendipity into the record selection as you mentioned. The mixture ratio can also be used for increasing these effects. Default for mixed query strategy is 95% Max with 5% random. |
Beta Was this translation helpful? Give feedback.
-
Is there any consideration of paper diversity and self-enforcing model bias?
I don't think "% of relevant publications found" is a sufficient metric by and in itself. It might be that the easiest criteria to fulfil leads to a selection bias that reinforces certain model-weights, leading to, de facto, the fulfilment of only a part of the initial (implicit) paper selection criteria.
I would definitely add a diversity metric to augment the precision metric; highly precise selections are not very useful if they only select papers within a small area of the "acceptable" joint probability distribution, and likewise for a high coverage of this "acceptable joint probability distribution" if it produces a lot of selections outside this distribution. You can get some estimate of diversity by considering the entropy over the cluster assignments of the current paper selection. That would be the clustering based on the abstract embeddings. Clustering is expensive so that might be an impediment (bootstrapping helps). I would also add serendipity, if you have not already done this, to occasionally select papers from the unlikely pool.
One suggestion regarding diversion-inclusion in selection is to approach s.vrijenhoek@uva.nl. She is working on diversity in recommender systems.
I would also include the possibility to add discrete/explicit selection criteria:
This might greatly reduce the prior probability of false negatives, but does require that the papers are pre-parsed, which might be easy to generalise for certain criteria and regex-templates might help.
I would also consider the option to chain multiple simple models. You would have to monitor model saturation, i.e. whether the model itself is updated "significantly", this is of course an interplay between model-selection bias and model capacity. If you decide a model is saturated, you append a new model. This new model then starts from scratch, but the paper selection then excludes papers that are similar to the papers selected by the prior model. You keep chaining until you have reached some overall diversity criterium (and the precision criterium is reached per each model so that is nice). One thing that pops in my mind is the use of a probabilistic clustering technique such as GMM's to pre-seed the data selection for the model chain. This increases the predicability but also makes it more stupid since you don't per se apply focus on the most relevant aspects.
Just some ideas.
Beta Was this translation helpful? Give feedback.
All reactions