Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge works with same title and spelling differences in author name #1928

Closed
nemobis opened this issue Feb 25, 2019 · 5 comments
Closed

Merge works with same title and spelling differences in author name #1928

nemobis opened this issue Feb 25, 2019 · 5 comments
Assignees
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Module: Merging Record merging Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]
Projects

Comments

@nemobis
Copy link

nemobis commented Feb 25, 2019

Description

Some editions are not merged to the work they belong to (and new unnecessary work pages are created) due to minor differences in the author name spelling.

Evidence

Lacapra vs. LaCapra kept these two separate:
https://openlibrary.org/works/OL8382164W
https://openlibrary.org/works/OL2731955W

Expectation

I think an automatic merge is in order for such minor mistakes/differences in spelling of either title or author name.

Proposal & Constraints

A case insensitive comparison would fix the specific case, I believe; computing a Levenshtein distance may be trickier, or should be very restrictive (max 1 character difference?) given middle names, cf. #77 (comment)

Doing merges manually is very tedious, if at all feasible; cf. #684 #805

@LeadSongDog
Copy link

The issue isn't just capitalization. It is also a matter of accents, whitespaces, translations, transliterations, and codespace normalizations. We simply must move away from using spelling as the identifier for an authority. There's a sound reason for using VIAF, ISNI, or Wikidata identifiers: simple spelling cannot reliably distinguish author identities.

@nemobis
Copy link
Author

nemobis commented Feb 28, 2019 via email

@LeadSongDog
Copy link

Even identical spelling of author and title does not reliably indicate that the works are the same. We have many problem titles that are very common, such as "Journal" or "Works". We also have some very common (often incomple) author names such as "Smith" or "Brown". Unless a human user makes the comparison between two author records, we won't be able to trust they refer to the same identity.
I agree that ISNI or Wikidata would be more reliable than VIAF, but any of them would be better than simple text comparison we have now. This is not a new issue, see #853 for instance, or even earlier.

@hornc hornc added the Module: Merging Record merging label Apr 17, 2019
@brad2014 brad2014 added Affects: Data Issues that affect book/author metadata or user/account data. [managed] Type: Bug Something isn't working. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Jun 11, 2019
@brad2014
Copy link
Collaborator

I'll lean on @hornc assessment to decide whether to subsume this under #853 (this also relates to work @cdrini is doing on solr), or whether there is bandwidth to do a stopgap solution for this specific case.

@xayhewalo xayhewalo added this to Un-Triaged in Triage Oct 18, 2019
@xayhewalo xayhewalo added Priority: 3 Issues that we can consider at our leisure. [managed] State: Backlogged labels Nov 14, 2019
@xayhewalo xayhewalo moved this from Un-Triaged to Needs: Assessment in Triage Nov 14, 2019
@mekarpeles
Copy link
Member

We have ~10 issues all surrounding merging (works, editions, authors). I think this is somewhat blocked on our merging infrastructure (e.g. #2553). Let's track this as related to #2114 and close this issue.

There is no clear beginning and end to this issue -- it is a proposal that we merge works w/ similar title and author name. We can also use isbn, ocaid, lccn, year, and several other fields to do this at scale.

Closing for now.

Triage automation moved this from Needs: Assessment to Closed Dec 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Module: Merging Record merging Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]
Projects
No open projects
Triage
  
Closed
Development

No branches or pull requests

6 participants