Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize Unicode #149

Closed
bencomp opened this issue Jun 21, 2012 · 14 comments
Closed

Normalize Unicode #149

bencomp opened this issue Jun 21, 2012 · 14 comments

Comments

@bencomp
Copy link
Contributor

bencomp commented Jun 21, 2012

From Launchpad - https://bugs.launchpad.net/openlibrary/+bug/598204

Many names and titles on openlibrary.org are not normalized according to Normalization Form C. It causes Firefox on Windows to misplace some diacritical marks. But I guess having NFC and NFD in the same data is not pretty.

Edward Betts provided this in the Launchpad bug report, but apparently there are still 100,000s of items with NFD.

from unicodedata import normalize

def norm(s):
    return normalize('NFC', s)

Can a bot please normalize everything according to NFC?

@bencomp
Copy link
Contributor Author

bencomp commented Feb 24, 2013

Related issue, possibly due to same lack of Unicode normalisation: #135

@tfmorris
Copy link
Contributor

Coming up on the 8 year anniversary on Friday - the Launchpad bug report is from April 28, 2009 (although only 5 yr 10 mo in this tracker).

While search normalization is arguably more important, having the actual title normalized is useful as well.

@hornc hornc added the unicode label May 8, 2017
@LeadSongDog
Copy link

LeadSongDog commented Jun 1, 2017

Related to #128#178 which is tagged priority. Any progress?

@tfmorris
Copy link
Contributor

tfmorris commented Jun 1, 2017

I'm guessing perhaps you meant #178?

@LeadSongDog
Copy link

Sorry, yes. Also seen as a problem in author names, as OL4967990A vs OL1505796A vs OL5769314A all look the same and refer to same person but are encoded differently. This bug needs squashing, it is a colossal time waster.

@hornc hornc added the hackathon label Nov 1, 2017
@LeadSongDog
Copy link

Per @tfmorris Apr 27 comment, it would also be useful to consistently downcase allcaps titles, authors, and publishers to titlecase, sentencecase, or even lowercase for the purpose of indexing, deduplicating, and merging.

@cdrini
Copy link
Collaborator

cdrini commented Nov 1, 2017

@LeadSongDog They are currently lowercased for indexing; i.e. searching for publisher:PENGUIN is converted to publisher:penguin, both before indexing and before querying. The publisher field (for example) is defined here in solr to have a type of text. Here's the definition of type="text":

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
</analyzer>
</fieldType>

@LeadSongDog
Copy link

Well, something's amiss. https://openlibrary.org/search?q=Jack+Higgins+Exocet&mode=everything finds some but misses more. Most specifically it misses https://openlibrary.org/works/OL157850W/Exocet

@LeadSongDog
Copy link

Oh, and it misses https://openlibrary.org/works/OL9313162W/EXOCET

@cdrini
Copy link
Collaborator

cdrini commented Nov 2, 2017

@LeadSongDog Those are the first two results for me:
image

@LeadSongDog
Copy link

Weird. I see it now too. Earlier I saw only 4 works listed vice 7 now. Of course they should just be one.

@mekarpeles mekarpeles added this to @LeadSongDog in 2018 Q2 Mar 13, 2018
@mekarpeles
Copy link
Member

I agree that this is an important issue, and I think @tfmorris, @hornc and @cdrini all agree. However, I am going to close this specific issue on account of it being unactionable. Let's see how we can phrase this in specific ways which will treat the source of the issue.

@mekarpeles mekarpeles removed this from To Do in On-Going Tasks Mar 13, 2018
@mekarpeles mekarpeles removed this from @LeadSongDog in 2018 Q2 Mar 13, 2018
@bencomp
Copy link
Contributor Author

bencomp commented Mar 14, 2018

@mekarpeles I disagree that this is unactionable. "Normalize Unicode" is a request for action (though it could probably be clearer). Please reopen this issue, or create more specific issues and link them before closing this issue.

Compare the issue to a water leakage: you must fix the pipes to stop the leak, but if you don't address the water that already came through, you're going to get mould (or worse) in your house. Mould is one reason that I'm not so involved with OpenLibrary anymore.

@mekarpeles
Copy link
Member

mekarpeles commented Mar 14, 2018

@bencomp the only actionable statement I see is "Can a bot please normalize everything according to NFC?"

This in fact is opposite of what @anandology suggested in the launchpad issue, which was:

I don't think we should fix it in the database. what if such strings
come later? I think, the best approach is to write code to handle all cases.

How about having a public function to normalize strings and call it
before displaying strings in the browser?

To which @EdwardBetts agreed:

Good idea. Here is the code:

from unicodedata import normalize

def norm(s):
    return normalize('NFC', s)

TL;DR
If this is something the community wants, we are going to have to be more explicit with our scope. I personally am not going to re-open the issue or rephrase it as there are 270 other issues I'm struggling to help triage. I do agree with the sentiments, as do I support the community interest in making our titles and author names display correctly (though honestly this specific case I don't think pays dividends / highly impacts our bottom lines), and making it possible to search for authors and titles which contain special characters / diacritics.

I'm not singling out this issue
I believe any issue which is left open for 6 years merits some degree of challenge / investigation. To me it may suggest something isn't right. Either (a) we have a lot of / too many open issues and not enough engineers, (b) we are not prioritizing correctly, (c) the issue is not clearly actionable or is too abstract in scope -- e.g. possible disconnect between skills entailed in library science v. the engineering work, or (d) there aren't engineers willing to work on the issue or there is a priority mismatch or miscommunication between submitter

In response
re: (a) this is kind of life. We can do a better job triaging -- we meet every Tuesday @ 11:30am PT to triage, go over code reviews, and set priorities.

(b) We just did this [prioritization] yesterday for #845. We had 7 people on the call. You're welcome to join and partake in prioritization! That's where this issue was raised by @tfmorris (nominated also by @LeadSongDog) and we brought it up for discussion. I think several people from the community agree -- including @LeadSongDog, @tfmorris, and @hornc -- with the sentiment of this issue. That unicode normalization, writ large, could help us improve search for titles and authors and improve our user experience. Its also worth conceding though (re: prioritization) that @JeffKaplan receives OpenLibrary email every day and our 225k daily uniques are not complaining predominantly about how unicode is being rendered on book + author pages.

(c) Reading this issue was not enough to help me understand:

  • a scoping of the problem, more specific than "everything". e.g. "all templates, starting with books, authors, search"
  • the strategy / a clearly proposed implementation. From this issue, especially it's become confused whether the goal is to fix template rendering to display unicode, update our database to store only NFC, or fix solr-updater to read as NFC? It seems as if @anandology was advocating for the former of these three, and if so, this strikes me as being a different conversation and issue than the one currently open. Let's create a new issue for that.
  • the significance of the problem (to help us prioritize) -- e.g. are libraries specifically not adding their book records to OL's catalog because we display unicode wrong? Which library / potential partner?
  • which file(s) one might look into
  • what it looks like when the problem is solved:

My main contention with this issue is, I feel it represents an ideology (e.g. there should be no unicode inconsistencies or problems anywhere in openlibrary -- i.e. mold). I'm being hyperbolic when I say this, but my point is, the issue as is requires further discussion in order to scope, plan, and turn into something actionable. It feels as if we may be making the statement, "Issue #149 is not complete until we can no longer find cases of NFC inconsistencies happening". But to me, this begs the question we should be outlining in the issue.

(d) This [a deficit of technical volunteers] was true for several years, we're fortunate that the community is really pushing forward right now to make OL great -- we'd also love to dust off the mold (in all the many places it still exists) and make the service better for users. A part of this demolding process, to me, is going through our issues (which have also collected some mold) and making sure we're focusing on the right issues, clearly describing our mold problems, and outlining solutions.

Path Forward
I'm sure everyone would have preferred if I had just opened a new issue for this and or used this time responding to the unicode issue instead of ranting about why I don't plan on re-opening this issue. I'm not going to re-open it because I don't think this issue is sufficient as it stands and I don't know enough about the scope / severity of the problem to open a new issue and propose a solution.

By closing this issue, I'm asking for those who care about this issue to step up and give it a chance to succeed. I'm asserting that this issue -- as it stands -- must fight against the current to find a developer whose willing to champion it. And the reason this issue is getting attention / is being discussed, is because members of the community (like you @bencomp, @LeadSongDog, @tfmorris, @hornc, et al) are asking to see forward progress on issues like this and I honestly care about giving them the greatest chance to succeed and be implemented.

I am asking that if there are old issues which are not receiving attention, let's figure out what's preventing them from moving forward, update titles, add scope, discuss files + code, add success criteria. Otherwise, I'd rather close the issue and wait for then community to edit it (so we can re-open) or have the community re-create it at a later point because it becomes newly important again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants