New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize Unicode #149
Comments
Related issue, possibly due to same lack of Unicode normalisation: #135 |
Coming up on the 8 year anniversary on Friday - the Launchpad bug report is from April 28, 2009 (although only 5 yr 10 mo in this tracker). While search normalization is arguably more important, having the actual title normalized is useful as well. |
I'm guessing perhaps you meant #178? |
Sorry, yes. Also seen as a problem in author names, as OL4967990A vs OL1505796A vs OL5769314A all look the same and refer to same person but are encoded differently. This bug needs squashing, it is a colossal time waster. |
Per @tfmorris Apr 27 comment, it would also be useful to consistently downcase allcaps titles, authors, and publishers to titlecase, sentencecase, or even lowercase for the purpose of indexing, deduplicating, and merging. |
@LeadSongDog They are currently lowercased for indexing; i.e. searching for openlibrary/conf/solr/conf/schema.xml Lines 220 to 251 in 6f25e65
|
Well, something's amiss. https://openlibrary.org/search?q=Jack+Higgins+Exocet&mode=everything finds some but misses more. Most specifically it misses https://openlibrary.org/works/OL157850W/Exocet |
Oh, and it misses https://openlibrary.org/works/OL9313162W/EXOCET |
@LeadSongDog Those are the first two results for me: |
Weird. I see it now too. Earlier I saw only 4 works listed vice 7 now. Of course they should just be one. |
@mekarpeles I disagree that this is unactionable. "Normalize Unicode" is a request for action (though it could probably be clearer). Please reopen this issue, or create more specific issues and link them before closing this issue. Compare the issue to a water leakage: you must fix the pipes to stop the leak, but if you don't address the water that already came through, you're going to get mould (or worse) in your house. Mould is one reason that I'm not so involved with OpenLibrary anymore. |
@bencomp the only actionable statement I see is "Can a bot please normalize everything according to NFC?" This in fact is opposite of what @anandology suggested in the launchpad issue, which was:
To which @EdwardBetts agreed:
TL;DR I'm not singling out this issue In response (b) We just did this [prioritization] yesterday for #845. We had 7 people on the call. You're welcome to join and partake in prioritization! That's where this issue was raised by @tfmorris (nominated also by @LeadSongDog) and we brought it up for discussion. I think several people from the community agree -- including @LeadSongDog, @tfmorris, and @hornc -- with the sentiment of this issue. That unicode normalization, writ large, could help us improve search for titles and authors and improve our user experience. Its also worth conceding though (re: prioritization) that @JeffKaplan receives OpenLibrary email every day and our 225k daily uniques are not complaining predominantly about how unicode is being rendered on book + author pages. (c) Reading this issue was not enough to help me understand:
My main contention with this issue is, I feel it represents an ideology (e.g. there should be no unicode inconsistencies or problems anywhere in openlibrary -- i.e. mold). I'm being hyperbolic when I say this, but my point is, the issue as is requires further discussion in order to scope, plan, and turn into something actionable. It feels as if we may be making the statement, "Issue #149 is not complete until we can no longer find cases of NFC inconsistencies happening". But to me, this begs the question we should be outlining in the issue. (d) This [a deficit of technical volunteers] was true for several years, we're fortunate that the community is really pushing forward right now to make OL great -- we'd also love to dust off the mold (in all the many places it still exists) and make the service better for users. A part of this demolding process, to me, is going through our issues (which have also collected some mold) and making sure we're focusing on the right issues, clearly describing our mold problems, and outlining solutions. Path Forward By closing this issue, I'm asking for those who care about this issue to step up and give it a chance to succeed. I'm asserting that this issue -- as it stands -- must fight against the current to find a developer whose willing to champion it. And the reason this issue is getting attention / is being discussed, is because members of the community (like you @bencomp, @LeadSongDog, @tfmorris, @hornc, et al) are asking to see forward progress on issues like this and I honestly care about giving them the greatest chance to succeed and be implemented. I am asking that if there are old issues which are not receiving attention, let's figure out what's preventing them from moving forward, update titles, add scope, discuss files + code, add success criteria. Otherwise, I'd rather close the issue and wait for then community to edit it (so we can re-open) or have the community re-create it at a later point because it becomes newly important again. |
From Launchpad - https://bugs.launchpad.net/openlibrary/+bug/598204
Many names and titles on openlibrary.org are not normalized according to Normalization Form C. It causes Firefox on Windows to misplace some diacritical marks. But I guess having NFC and NFD in the same data is not pretty.
Edward Betts provided this in the Launchpad bug report, but apparently there are still 100,000s of items with NFD.
Can a bot please normalize everything according to NFC?
The text was updated successfully, but these errors were encountered: