Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different results from unique() and difference of deduplicated set #68

Open
larsgrobe opened this issue Dec 8, 2023 · 2 comments
Open
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@larsgrobe
Copy link
Contributor

larsgrobe commented Dec 8, 2023

Dear all,
I have a document set that returns a duplicate accorind to unique():
len(docset) -> 1014
len(docset.unique()) -> 1013
However, len(docset-docset.unique()) -> 0
I found this when I wanted to output the title of the duplicate that is supposedly eliminated by unique, however I do not get any since the difference has zero documents.
Best, Lars.

@larsgrobe larsgrobe changed the title Dj Different results from unique() and difference of deduplicated set Dec 8, 2023
@stijnh
Copy link
Member

stijnh commented Dec 12, 2023

Hi! Thanks for the bug report.

I've given this some careful thought, and although this behavior might seem counter-intutive, it is indeed correct.

The - operator relies on "fuzzy" matching to determine which documents from the left-hand set should be excluded, based on the right-hand set. In the case you described, where there are two identical documents, docset-docset.unique() results in an empty set. This happens because the "fuzzy" matching treats the same document as present in both sets (likely due to matching DOI).

Nonetheless, I can see how it is odd that there is no way to retrieve which documents were removed by unique.

Would it work for you if we were to add a duplicates() method? This method would specifically return the duplicate documents, ensuring that len(docset) = len(docset.unique()) + len(docset.duplicates()).

@stijnh stijnh added enhancement New feature or request help wanted Extra attention is needed labels Dec 12, 2023
@larsgrobe
Copy link
Contributor Author

Hi, yes, that would help. It was exactly the idea - I just wanted to see what had been identified as duplicated. Best, Lars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants