Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat identical citations always as duplicates? #160

Open
LukasWallrich opened this issue Jun 15, 2023 · 3 comments
Open

Treat identical citations always as duplicates? #160

LukasWallrich opened this issue Jun 15, 2023 · 3 comments

Comments

@LukasWallrich
Copy link
Collaborator

Currently, CiteSource does not always treat identical citations as duplicates - if they are not complete enough, ASySD does not achieve sufficient confidence. For instance, if we import the working example final.ris with 242 results twice, ASySD finds 272 unique citations before manual deduplication.

I would be minded to add a default in CiteSource that treats identical entries as duplicates, if this appears too risky for ASySD - as it stands, this means that summaries across stages are predictably misleading until one completes the manual deduplication (which makes CiteSource less useful for quick exploration than it could be ...)

@kaitlynhair @TNRiley what are your thoughts?

@TNRiley
Copy link
Collaborator

TNRiley commented Jun 15, 2023

I'd like to take a look at that final.ris twice example. Interested in what "enough" means exactly and what metadata those records are missing. We should provide users with instructions to ensure specific fields are complete, but could see this as an argument in the dedup.

@TNRiley
Copy link
Collaborator

TNRiley commented Jun 21, 2023

I ran the same file of 242 final articles twice. Found that there were 34 pairs that were not identified as duplicates, however, they did come up on the manual deduplication.

If you proceed without the manual deduplication, you get the following pop-up, which is confusing. I'm not sure where the 272 number is coming from. The 484 makes sense as that is the 242 x2, the 34 also makes sense as it's the number of pairs that I mentioned above.

captures_chrome-capture-2023-5-21

The upset plot and individual record table are both off too, each showing 30 citations in each source that are unique. So I'm not sure how these show 30 unique, instead of 34, which were identified as potential duplicates.
captures_chrome-capture-2023-5-21 (1)

This issue is more of a metadata quality issue, however, I do agree that we should reach some consensus on exact matches. Something like if at least x number of fields are exact matches they are identified as duplicates. There may also be specific combinations we want to identify (eg. IF title and DOI are an exact match)

@TNRiley
Copy link
Collaborator

TNRiley commented Jun 21, 2023

I'm going to add a discussion thread about building a test .ris file. This file should include known duplicates and false positives. We can easily label these in citesource to test various deduplication changes.

@TNRiley TNRiley mentioned this issue Jun 23, 2023
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants