Normalize Unicode #149

bencomp · 2012-06-21T20:53:04Z

From Launchpad - https://bugs.launchpad.net/openlibrary/+bug/598204

Many names and titles on openlibrary.org are not normalized according to Normalization Form C. It causes Firefox on Windows to misplace some diacritical marks. But I guess having NFC and NFD in the same data is not pretty.

Edward Betts provided this in the Launchpad bug report, but apparently there are still 100,000s of items with NFD.

from unicodedata import normalize

def norm(s):
    return normalize('NFC', s)

Can a bot please normalize everything according to NFC?

The text was updated successfully, but these errors were encountered:

bencomp · 2013-02-24T00:14:14Z

Related issue, possibly due to same lack of Unicode normalisation: #135

tfmorris · 2017-04-27T04:04:14Z

Coming up on the 8 year anniversary on Friday - the Launchpad bug report is from April 28, 2009 (although only 5 yr 10 mo in this tracker).

While search normalization is arguably more important, having the actual title normalized is useful as well.

LeadSongDog · 2017-06-01T15:15:13Z

Related to ~~#128~~#178 which is tagged priority. Any progress?

tfmorris · 2017-06-01T16:51:59Z

I'm guessing perhaps you meant #178?

LeadSongDog · 2017-06-02T20:39:00Z

Sorry, yes. Also seen as a problem in author names, as OL4967990A vs OL1505796A vs OL5769314A all look the same and refer to same person but are encoded differently. This bug needs squashing, it is a colossal time waster.

LeadSongDog · 2017-11-01T22:22:12Z

Per @tfmorris Apr 27 comment, it would also be useful to consistently downcase allcaps titles, authors, and publishers to titlecase, sentencecase, or even lowercase for the purpose of indexing, deduplicating, and merging.

cdrini · 2017-11-01T23:51:11Z

@LeadSongDog They are currently lowercased for indexing; i.e. searching for publisher:PENGUIN is converted to publisher:penguin, both before indexing and before querying. The publisher field (for example) is defined here in solr to have a type of text. Here's the definition of type="text":

openlibrary/conf/solr/conf/schema.xml

Lines 220 to 251 in 6f25e65

    
               <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> 
        
                 <analyzer type="index"> 
        
                   <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
        
                   <!-- in this example, we will only use synonyms at query time 
        
                   <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> 
        
                   --> 
        
                   <!-- Case insensitive stop word removal. 
        
                     add enablePositionIncrements=true in both the index and query 
        
                     analyzers to leave a 'gap' for more accurate phrase queries. 
        
                   --> 
        
                   <filter class="solr.StopFilterFactory" 
        
                           ignoreCase="true" 
        
                           words="stopwords.txt" 
        
                           enablePositionIncrements="true" 
        
                           /> 
        
                   <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> 
        
                   <filter class="solr.LowerCaseFilterFactory"/> 
        
                   <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> 
        
                 </analyzer> 
        
                 <analyzer type="query"> 
        
                   <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
        
                   <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> 
        
                   <filter class="solr.StopFilterFactory" 
        
                           ignoreCase="true" 
        
                           words="stopwords.txt" 
        
                           enablePositionIncrements="true" 
        
                           /> 
        
                   <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> 
        
                   <filter class="solr.LowerCaseFilterFactory"/> 
        
                   <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> 
        
                 </analyzer> 
        
               </fieldType>

LeadSongDog · 2017-11-02T00:35:34Z

Well, something's amiss. https://openlibrary.org/search?q=Jack+Higgins+Exocet&mode=everything finds some but misses more. Most specifically it misses https://openlibrary.org/works/OL157850W/Exocet

LeadSongDog · 2017-11-02T00:37:57Z

Oh, and it misses https://openlibrary.org/works/OL9313162W/EXOCET

cdrini · 2017-11-02T00:39:07Z

@LeadSongDog Those are the first two results for me:

LeadSongDog · 2017-11-02T01:00:38Z

Weird. I see it now too. Earlier I saw only 4 works listed vice 7 now. Of course they should just be one.

mekarpeles · 2018-03-13T21:30:29Z

I agree that this is an important issue, and I think @tfmorris, @hornc and @cdrini all agree. However, I am going to close this specific issue on account of it being unactionable. Let's see how we can phrase this in specific ways which will treat the source of the issue.

bencomp · 2018-03-14T07:35:58Z

@mekarpeles I disagree that this is unactionable. "Normalize Unicode" is a request for action (though it could probably be clearer). Please reopen this issue, or create more specific issues and link them before closing this issue.

Compare the issue to a water leakage: you must fix the pipes to stop the leak, but if you don't address the water that already came through, you're going to get mould (or worse) in your house. Mould is one reason that I'm not so involved with OpenLibrary anymore.

mekarpeles · 2018-03-14T19:53:46Z

@bencomp the only actionable statement I see is "Can a bot please normalize everything according to NFC?"

This in fact is opposite of what @anandology suggested in the launchpad issue, which was:

I don't think we should fix it in the database. what if such strings
come later? I think, the best approach is to write code to handle all cases.

How about having a public function to normalize strings and call it
before displaying strings in the browser?

To which @EdwardBetts agreed:

Good idea. Here is the code:

from unicodedata import normalize

def norm(s):
    return normalize('NFC', s)

TL;DR
If this is something the community wants, we are going to have to be more explicit with our scope. I personally am not going to re-open the issue or rephrase it as there are 270 other issues I'm struggling to help triage. I do agree with the sentiments, as do I support the community interest in making our titles and author names display correctly (though honestly this specific case I don't think pays dividends / highly impacts our bottom lines), and making it possible to search for authors and titles which contain special characters / diacritics.

I'm not singling out this issue
I believe any issue which is left open for 6 years merits some degree of challenge / investigation. To me it may suggest something isn't right. Either (a) we have a lot of / too many open issues and not enough engineers, (b) we are not prioritizing correctly, (c) the issue is not clearly actionable or is too abstract in scope -- e.g. possible disconnect between skills entailed in library science v. the engineering work, or (d) there aren't engineers willing to work on the issue or there is a priority mismatch or miscommunication between submitter

In response
re: (a) this is kind of life. We can do a better job triaging -- we meet every Tuesday @ 11:30am PT to triage, go over code reviews, and set priorities.

(b) We just did this [prioritization] yesterday for #845. We had 7 people on the call. You're welcome to join and partake in prioritization! That's where this issue was raised by @tfmorris (nominated also by @LeadSongDog) and we brought it up for discussion. I think several people from the community agree -- including @LeadSongDog, @tfmorris, and @hornc -- with the sentiment of this issue. That unicode normalization, writ large, could help us improve search for titles and authors and improve our user experience. Its also worth conceding though (re: prioritization) that @JeffKaplan receives OpenLibrary email every day and our 225k daily uniques are not complaining predominantly about how unicode is being rendered on book + author pages.

(c) Reading this issue was not enough to help me understand:

a scoping of the problem, more specific than "everything". e.g. "all templates, starting with books, authors, search"
the strategy / a clearly proposed implementation. From this issue, especially it's become confused whether the goal is to fix template rendering to display unicode, update our database to store only NFC, or fix solr-updater to read as NFC? It seems as if @anandology was advocating for the former of these three, and if so, this strikes me as being a different conversation and issue than the one currently open. Let's create a new issue for that.
the significance of the problem (to help us prioritize) -- e.g. are libraries specifically not adding their book records to OL's catalog because we display unicode wrong? Which library / potential partner?
which file(s) one might look into
what it looks like when the problem is solved:

My main contention with this issue is, I feel it represents an ideology (e.g. there should be no unicode inconsistencies or problems anywhere in openlibrary -- i.e. mold). I'm being hyperbolic when I say this, but my point is, the issue as is requires further discussion in order to scope, plan, and turn into something actionable. It feels as if we may be making the statement, "Issue #149 is not complete until we can no longer find cases of NFC inconsistencies happening". But to me, this begs the question we should be outlining in the issue.

(d) This [a deficit of technical volunteers] was true for several years, we're fortunate that the community is really pushing forward right now to make OL great -- we'd also love to dust off the mold (in all the many places it still exists) and make the service better for users. A part of this demolding process, to me, is going through our issues (which have also collected some mold) and making sure we're focusing on the right issues, clearly describing our mold problems, and outlining solutions.

Path Forward
I'm sure everyone would have preferred if I had just opened a new issue for this and or used this time responding to the unicode issue instead of ranting about why I don't plan on re-opening this issue. I'm not going to re-open it because I don't think this issue is sufficient as it stands and I don't know enough about the scope / severity of the problem to open a new issue and propose a solution.

By closing this issue, I'm asking for those who care about this issue to step up and give it a chance to succeed. I'm asserting that this issue -- as it stands -- must fight against the current to find a developer whose willing to champion it. And the reason this issue is getting attention / is being discussed, is because members of the community (like you @bencomp, @LeadSongDog, @tfmorris, @hornc, et al) are asking to see forward progress on issues like this and I honestly care about giving them the greatest chance to succeed and be implemented.

I am asking that if there are old issues which are not receiving attention, let's figure out what's preventing them from moving forward, update titles, add scope, discuss files + code, add success criteria. Otherwise, I'd rather close the issue and wait for then community to edit it (so we can re-open) or have the community re-create it at a later point because it becomes newly important again.

bfalling mentioned this issue Sep 22, 2016

Remediation: Unicode Book titles mangled during Import #135

Open

hornc added the unicode label May 8, 2017

hornc added the hackathon label Nov 1, 2017

LeadSongDog mentioned this issue Mar 12, 2018

Roadmapping 2018 Q2 #845

Closed

mekarpeles added this to @LeadSongDog in 2018 Q2 Mar 13, 2018

mekarpeles closed this as completed Mar 13, 2018

mekarpeles added the unactionable label Mar 13, 2018

mekarpeles removed this from To Do in On-Going Tasks Mar 13, 2018

mekarpeles removed this from @LeadSongDog in 2018 Q2 Mar 13, 2018

LeadSongDog mentioned this issue Nov 9, 2020

Fixing unicode urls in python3 #4049

Merged

tfmorris mentioned this issue Jan 22, 2022

Search for exact title with different encoding fails #6059

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize Unicode #149

Normalize Unicode #149

bencomp commented Jun 21, 2012

bencomp commented Feb 24, 2013

tfmorris commented Apr 27, 2017

LeadSongDog commented Jun 1, 2017 •

edited

tfmorris commented Jun 1, 2017

LeadSongDog commented Jun 2, 2017

LeadSongDog commented Nov 1, 2017

cdrini commented Nov 1, 2017

LeadSongDog commented Nov 2, 2017

LeadSongDog commented Nov 2, 2017

cdrini commented Nov 2, 2017

LeadSongDog commented Nov 2, 2017

mekarpeles commented Mar 13, 2018

bencomp commented Mar 14, 2018

mekarpeles commented Mar 14, 2018 •

edited

Normalize Unicode #149

Normalize Unicode #149

Comments

bencomp commented Jun 21, 2012

bencomp commented Feb 24, 2013

tfmorris commented Apr 27, 2017

LeadSongDog commented Jun 1, 2017 • edited

tfmorris commented Jun 1, 2017

LeadSongDog commented Jun 2, 2017

LeadSongDog commented Nov 1, 2017

cdrini commented Nov 1, 2017

LeadSongDog commented Nov 2, 2017

LeadSongDog commented Nov 2, 2017

cdrini commented Nov 2, 2017

LeadSongDog commented Nov 2, 2017

mekarpeles commented Mar 13, 2018

bencomp commented Mar 14, 2018

mekarpeles commented Mar 14, 2018 • edited

LeadSongDog commented Jun 1, 2017 •

edited

mekarpeles commented Mar 14, 2018 •

edited