Duplicates in datafile #860
-
I have tried to remove all duplicates from my dataset using Endnote. However, once I started reviewing I found out there is a substantial number of duplicates left in my dataset. I was wondering to what extent this influences the algorithm and whether you consider this to be a problem or not? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
We have not done research into the effect of duplicates on the performance of ASReview. However we expect that the risk is mainly in the records that you include and that have duplicates. First of all you will have to include them twice (you probably will never see the duplicate of an exclusion because it is pushed to the back of your set). But this is only inconvenient, it does not harm your results. More important, a duplicate inclusion gets more weight then an inclusion without duplicates. This might have a negative impact on the performance. For instance if inclusions with duplicates represent a specific subset of your results, this subset will be more prominent in your inclusions because of the duplicates. If for instance you have searched a general database/search engine + a very specific database/search engine, all the duplicates will be from the very specific database and inclusions that are in that specific database get more weight then inclusions that are only in the general database (or only in the specific one). It's hard to say if this is really a problem for you or not. It probably depends mainly on the distribution of your duplicates (randomly of representing specific subsets). You could try additional deduplication methods (e.g.https://osf.io/qbjpa/) or just be a little more carefull choosing your stopping rule (see #557) |
Beta Was this translation helpful? Give feedback.
We have not done research into the effect of duplicates on the performance of ASReview. However we expect that the risk is mainly in the records that you include and that have duplicates.
First of all you will have to include them twice (you probably will never see the duplicate of an exclusion because it is pushed to the back of your set). But this is only inconvenient, it does not harm your results.
More important, a duplicate inclusion gets more weight then an inclusion without duplicates. This might have a negative impact on the performance. For instance if inclusions with duplicates represent a specific subset of your results, this subset will be more prominent in your inclusions bec…