Treat identical citations always as duplicates? #160

LukasWallrich · 2023-06-15T21:40:56Z

Currently, CiteSource does not always treat identical citations as duplicates - if they are not complete enough, ASySD does not achieve sufficient confidence. For instance, if we import the working example final.ris with 242 results twice, ASySD finds 272 unique citations before manual deduplication.

I would be minded to add a default in CiteSource that treats identical entries as duplicates, if this appears too risky for ASySD - as it stands, this means that summaries across stages are predictably misleading until one completes the manual deduplication (which makes CiteSource less useful for quick exploration than it could be ...)

@kaitlynhair @TNRiley what are your thoughts?

The text was updated successfully, but these errors were encountered:

TNRiley · 2023-06-15T21:55:27Z

I'd like to take a look at that final.ris twice example. Interested in what "enough" means exactly and what metadata those records are missing. We should provide users with instructions to ensure specific fields are complete, but could see this as an argument in the dedup.

TNRiley · 2023-06-21T12:59:56Z

I ran the same file of 242 final articles twice. Found that there were 34 pairs that were not identified as duplicates, however, they did come up on the manual deduplication.

If you proceed without the manual deduplication, you get the following pop-up, which is confusing. I'm not sure where the 272 number is coming from. The 484 makes sense as that is the 242 x2, the 34 also makes sense as it's the number of pairs that I mentioned above.

The upset plot and individual record table are both off too, each showing 30 citations in each source that are unique. So I'm not sure how these show 30 unique, instead of 34, which were identified as potential duplicates.

This issue is more of a metadata quality issue, however, I do agree that we should reach some consensus on exact matches. Something like if at least x number of fields are exact matches they are identified as duplicates. There may also be specific combinations we want to identify (eg. IF title and DOI are an exact match)

TNRiley · 2023-06-21T13:01:46Z

I'm going to add a discussion thread about building a test .ris file. This file should include known duplicates and false positives. We can easily label these in citesource to test various deduplication changes.

LukasWallrich mentioned this issue Jun 15, 2023

shiny: metadata editing fails when files have the same name #158

Closed

TNRiley mentioned this issue Jun 23, 2023

Release CiteSource 0.1.0 #117

Open

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat identical citations always as duplicates? #160

Treat identical citations always as duplicates? #160

LukasWallrich commented Jun 15, 2023

TNRiley commented Jun 15, 2023

TNRiley commented Jun 21, 2023

TNRiley commented Jun 21, 2023

Treat identical citations always as duplicates? #160

Treat identical citations always as duplicates? #160

Comments

LukasWallrich commented Jun 15, 2023

TNRiley commented Jun 15, 2023

TNRiley commented Jun 21, 2023

TNRiley commented Jun 21, 2023