removing different versions of the same paper #1542
-
Hi, I have a dataset (around 3.5k articles) which includes duplicates in the form of different versions of the same paper. So, for some papers the dataset contains the final published version but also the working paper versions. These sometimes have the exact same title (after normalization) and abstract but sometimes especially the abstract differs as the paper evolved during the review process. Is there any way to determine which version of the paper will be removed if I use the datatools deduplication feature? Ideally, it would always keep the newer version. Or is this not possible because these might not even be considered duplicates by datatools? If datatools is not suitable for this, I would also appreciate any tips for other software which could help me with this task. Currently I am using Rayyan but this will require me making around 700 manual duplicate decisions. Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 6 replies
-
Hi @kurtenbach, I don't know which version datatools picks. Here's two other options that could be helpful: o https://camarades.shinyapps.io/RDedup/ |
Beta Was this translation helpful? Give feedback.
-
The algorithm used in Datatools first removes all duplicates based on a persistent identifier (PID). Then it concatenates the title and abstract, whereafter it removes all non-alphanumeric tokens. Then the duplicates are removed. |
Beta Was this translation helpful? Give feedback.
-
Hi @kurtenbach, in addition to what was said already, VU University's Kirsten Ziesemer and colleagues were working on generalizing their approach to deduplication of records, maybe they can point you to some tools. |
Beta Was this translation helpful? Give feedback.
Indeed, the first record is considered unique, while others will get removed from dataset. By sorting your dataset in advance (on the type of publication), you can ensure that published versions show up before other versions of the same work. Then you can easily remove them.