Duplicates in datafile #860

Aminaodw · 2021-11-29T15:15:23Z

Aminaodw
Nov 29, 2021

I have tried to remove all duplicates from my dataset using Endnote. However, once I started reviewing I found out there is a substantial number of duplicates left in my dataset. I was wondering to what extent this influences the algorithm and whether you consider this to be a problem or not?

Answered by boer0107-zz

Dec 1, 2021

We have not done research into the effect of duplicates on the performance of ASReview. However we expect that the risk is mainly in the records that you include and that have duplicates.

First of all you will have to include them twice (you probably will never see the duplicate of an exclusion because it is pushed to the back of your set). But this is only inconvenient, it does not harm your results.

More important, a duplicate inclusion gets more weight then an inclusion without duplicates. This might have a negative impact on the performance. For instance if inclusions with duplicates represent a specific subset of your results, this subset will be more prominent in your inclusions bec…

View full answer

boer0107-zz · 2021-12-01T09:37:27Z

boer0107-zz
Dec 1, 2021

We have not done research into the effect of duplicates on the performance of ASReview. However we expect that the risk is mainly in the records that you include and that have duplicates.

First of all you will have to include them twice (you probably will never see the duplicate of an exclusion because it is pushed to the back of your set). But this is only inconvenient, it does not harm your results.

More important, a duplicate inclusion gets more weight then an inclusion without duplicates. This might have a negative impact on the performance. For instance if inclusions with duplicates represent a specific subset of your results, this subset will be more prominent in your inclusions because of the duplicates. If for instance you have searched a general database/search engine + a very specific database/search engine, all the duplicates will be from the very specific database and inclusions that are in that specific database get more weight then inclusions that are only in the general database (or only in the specific one).

It's hard to say if this is really a problem for you or not. It probably depends mainly on the distribution of your duplicates (randomly of representing specific subsets). You could try additional deduplication methods (e.g.https://osf.io/qbjpa/) or just be a little more carefull choosing your stopping rule (see #557)

4 replies

Aminaodw Dec 1, 2021
Author

Thank you very much for the extensive reply, I appreciate it!

mruderman Nov 8, 2023

@boer0107-zz I found myself in a similar situation as @Aminaodw. I'm wondering if it would it be reasonable to deduplicate the dataset midway through using ASReview DataTools, Zotero, or something similar; then just continue reviewing. @Aminaodw what did you end up doing and how did it turn out?

MirushHRD May 23, 2024

@mruderman Hi I also have the same issue and am wondering if we could de-duplicate in ASReview even though we have started the reviewing. Did you find out whether / how it can be done?

Rensvandeschoot May 28, 2024
Maintainer

You can always export the dataset, run de-duplication, and start a new project. ASReview will recognize the labels, and you can continue screening. We are working on v2.0 in which de-duplication will be possible after uploading your dataset. We have not thought about doing it during screening... If you think this would be an important feature, please start a feature request: https://github.com/asreview/asreview/issues/new/choose

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicates in datafile #860

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Duplicates in datafile #860

Aminaodw Nov 29, 2021

Replies: 1 comment · 4 replies

boer0107-zz Dec 1, 2021

Aminaodw Dec 1, 2021 Author

mruderman Nov 8, 2023

MirushHRD May 23, 2024

Rensvandeschoot May 28, 2024 Maintainer

Aminaodw
Nov 29, 2021

Replies: 1 comment 4 replies

boer0107-zz
Dec 1, 2021

Aminaodw Dec 1, 2021
Author

Rensvandeschoot May 28, 2024
Maintainer