Remediation: Unicode Book titles mangled during Import #135

anandology · 2012-01-23T15:48:18Z

This issue is reported in the ol-tech mailing list.

I don't know how widespread this problem is, but I noticed that these
two records have messed up book titles, but if you click through to
the associated MARC records on IA, the titles get rendered correctly.

http://openlibrary.org/books/OL7155555M/The_M%C2%A9%C3%98alavik%C2%A9%C3%98agnimitra
http://openlibrary.org/books/OL7165183M/The_Vikramorva%C2%A9%C3%98s%C2%A9%C4%90iyam

The text was updated successfully, but these errors were encountered:

anandology · 2012-01-23T15:52:03Z

Both these records came from archive.org.

I've looked at the _marc.xml files on archive.org. Last modified time of both the marc.xml files are in 2007 and these recrods were created in 2008. It looks like the issue is with the script that parsed those titles.

amillar503 · 2012-06-04T05:46:29Z

There are thousands of MARC import records where accented characters have been mangled or handled incorrectly. Another common scenario is that accents or other diacritical marks have been replaced by a space before or after the vowel.

See for example:

http://openlibrary.org/authors/OL4459814A/Heinrich_Schro_der

http://openlibrary.org/works/OL10684450W/Tonbandgera_te-Messpraxis

http://openlibrary.org/show-records/talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:299045317:529

In both the author and title, the umlaut has been changed to a space after the vowel. The linked MARC record shows correctly in the browser.

bfalling · 2016-09-22T20:23:52Z

Should we consider re-importing? And is #149, which also references https://bugs.launchpad.net/openlibrary/+bug/598204, a dependency?

hornc · 2017-05-08T11:06:31Z

Some I have found recently: https://openlibrary.org/works/OL17670297W
https://openlibrary.org/works/OL17677126W
multiple works by this author:
https://openlibrary.org/authors/OL2450531A/Matthieu_Joseph_Bonaventure_Orfila

LeadSongDog · 2017-10-30T13:51:10Z

https://openlibrary.org/search?q=title%3A+%22©♭%22&mode=everything Still finds well over 17 million matches. This for "é", likely the most common accented letter. Edits like https://openlibrary.org/books/OL26303038M/Anatomie_générale_appliquée_à_la_physiologie_et_à_la_médecine?b=3&a=1&_compare=Compare&m=diff should not be necessarily manual.

LeadSongDog · 2017-11-01T22:10:54Z

@hornc re your May 8 comment, those works were created from editions created from importing
https://openlibrary.org/show-records/ia:b28044277_0001
and
https://openlibrary.org/show-records/ia:b2202010x
Until they're fixed in the ia MARC records, there's no value in reimporting unless the importation gets them passed through normalization

hornc · 2017-11-03T21:30:15Z

@LeadSongDog interesting, the MARC display you link to shows the characters garbled, but if you click through to the XML representation https://ia800202.us.archive.org/34/items/b28044277_0001/b28044277_0001_marc.xml the accented e's display correctly. There may be an issue with encoding types set incorrectly? I will pick this up shortly, the new openlibrary-client is now in a state where it can be used to make bulk data corrections.

hornc · 2017-11-03T23:25:36Z

@LeadSongDog I may have figured out how the mangling is happening, in this example marc xml
https://ia600208.us.archive.org/25/items/b2202010x/b2202010x_marc.xml

the a-grave of "Secours à donner" displays correctly in utf-8 encoding

a-grave is U+00E0, which in binary (pythonic notation) is \xC3\xA0

if those bytes were interpreted as MARC8 and "converted", C3 becomes the copyright symbol, and 'A0' becomes a space, which is exactly what we see on the OL pages with "Secours © donner"

I now think these MARC records have utf-8 character encodings, but were imported to OL as if they were MARC8, which explains the mangling.

I did the MARC8 conversion manually from the tables found here https://memory.loc.gov/diglib/codetables/45.html I'll need to use yaz or something to test this out properly, but this will provide a good path to fixing the MARC errors programatically.

I know that there are other unicode mangling errors affecting Amazon imported records, but I think that is from incorrect conversion from Windows or ISO charsets

Thanks for your comment @LeadSongDog, in trying to figure out whether the MARC records were actually wrong or not I think I have stumbled upon the root cause of the issue!

mekarpeles · 2018-03-13T21:31:45Z

@hornc any updates on MARC mangling and/or if we resolved this issue?

tfmorris · 2018-03-14T05:26:53Z

The issue is definitely not resolved. When the import script is fixed, @bfalling suggest of reimporting will most likely be necessary.

From the point of view of triage, it would probably be useful to get an actual count. "Thousands" isn't a very big percentage of 25 million editions.

cclauss · 2020-12-06T08:56:26Z

Has this been resolved with our Python 3 changes or can someone provide steps-to-reproduce on Python 3?

LeadSongDog · 2020-12-06T17:16:29Z

Well https://openlibrary.org/books/OL12903648M/Etudes_Conomiques_De_L'Ocde certainly isn’t fixed, but perhaps we’re done with digging the hole...
There were at least three problem classes:

Bad import of good data
Literal import of bad data
Bad data in place from old cases of 1 or 2 since remedied.
The move to py3 will at most fix number 1.

cclauss · 2020-12-06T17:24:52Z

Steps-to-reproduce problem class 1?

tfmorris · 2020-12-06T19:30:25Z

The earlier examples are better than the most recent one which is an import from crappy Amazon data (which we should not be imported).
https://openlibrary.org/books/OL7165183M/The_Vikramorva%C2%A9%C3%98s%C2%A9%C4%90iyam
https://openlibrary.org/authors/OL4459814A/Heinrich_Schro_der
https://openlibrary.org/books/OL13956174M/Tonbandgera_te-Messpraxis
https://openlibrary.org/books/OL26280693M/Secours_%C2%A9_donner_aux_personnes_empoisonn%C2%A9%E2%99%ADes_ou_asphyxi%C2%A9%E2%99%ADes_suivis_des_moyens_propres_%C2%A9_reconna%C2%A9%CA%BEtre

If the bug has been fixed, reimporting the records should result in the correct encoding. Then the task simply becomes reimporting the millions of corrupted records.

The search which was claimed to return 17+ million records before: https://openlibrary.org/search?q=title%3A+%22%C2%A9%E2%99%AD%22&mode=everything
now returns 23.4M results, but I think that's actually a separate bug and it's just returning all works in the database.

LeadSongDog · 2020-12-06T21:31:48Z

@tfmorris As https://openlibrary.org/search?q=title%3A+%22+%22&mode=everything gets the same result, it seems that yes, it is a simple case of title search for an effectively-blank string.

tfmorris · 2020-12-07T00:38:09Z

I created #4223 for the search bug.

hornc · 2023-09-15T23:03:36Z

Update, running a current search
https://openlibrary.org/search?q=%C2%A9%E2%99%AD&mode=everything
with the new "Solr Editions Beta" checkbox disabled gives 1,798 results. (0 results with the beta feature enabled -- I do 't know what the expected difference should be).

I believe all MARC record character encoding import issues are resolved, so there is just data clean up to perform.
The number of these affected records seems to have decreased considerably over time, so that is good.

I'll try to come up with some clear queries to identify classes of MARC encoding caused character issues

Accents etc incorrectly moved from MARC encoding to UTF-8 like ©♭
Spaces and dropped letters in place of an accented character

I think these issues are limited to work/edition titles and author names, but it'd pay to check whether locations or publishers are affected too.

I'm also not 100% sure a reimport will overwrite all the fields if they are already populated, unfortunately.

Garbled Amazon titles should be dealt with separately as the cause of mangling is more variable there.

I'll take a another look at MARC sourced mangled data remediation and see how much is left to close this after a decade...

I have just created a project wiki page to track my notes on this: https://github.com/internetarchive/openlibrary/wiki/Mangled-MARC

cdrini · 2023-11-20T16:26:05Z

I'm finding about 16,000 works affected on open library: https://openlibrary.org/search?q=title_suggest%3A%C2%A9+AND+ia%3A*&mode=everything

bencomp mentioned this issue Feb 24, 2013

Normalize Unicode #149

Closed

hornc added the unicode label May 8, 2017

hornc self-assigned this Nov 9, 2017

hornc changed the title ~~Fix the book titles that were imported incorrectly~~ Fix book titles with mangled Unicode May 5, 2019

hornc added the openlibrary-client label May 5, 2019

hornc added Affects: Data Issues that affect book/author metadata or user/account data. [managed] and removed openlibrary-client labels Jun 4, 2019

xayhewalo added this to Un-Triaged in Triage Oct 18, 2019

xayhewalo moved this from Un-Triaged to Triaged in Triage Nov 14, 2019

xayhewalo removed the State: Backlogged label Mar 17, 2020

cdrini added the Needs: Lead label Apr 20, 2020

mekarpeles unassigned hornc Apr 22, 2020

mekarpeles added Lead: @hornc Issues overseen by Charles (Staff: Data Engineering Lead) [managed] and removed Needs: Lead labels Apr 22, 2020

LeadSongDog mentioned this issue Jun 25, 2020

Error when searching for a book title with french accent. #3491

Closed

hornc removed the CH: unicode label Nov 16, 2020

mekarpeles changed the title ~~Fix book titles with mangled Unicode~~ Remediation: Unicode Book titles mangled during Import Jan 25, 2021

cclauss added the Theme: Unicode Issues and pull requests related to Unicode characters label Mar 9, 2021

cdrini mentioned this issue Sep 14, 2021

Move oldump.sh from olsystem in scripts #5656

Merged

tfmorris mentioned this issue Dec 7, 2022

Modifying author names does not cause works in search results to show new name #7222

Open

hornc self-assigned this Sep 15, 2023

mekarpeles added the Data Cleanup label Sep 16, 2023

cdrini mentioned this issue Feb 7, 2024

Create unit test to detect marc unicode encoding issues #8798

Open

mekarpeles unassigned hornc Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remediation: Unicode Book titles mangled during Import #135

Remediation: Unicode Book titles mangled during Import #135

anandology commented Jan 23, 2012

anandology commented Jan 23, 2012

amillar503 commented Jun 4, 2012

bfalling commented Sep 22, 2016

hornc commented May 8, 2017

LeadSongDog commented Oct 30, 2017

LeadSongDog commented Nov 1, 2017

hornc commented Nov 3, 2017

hornc commented Nov 3, 2017

mekarpeles commented Mar 13, 2018

tfmorris commented Mar 14, 2018

cclauss commented Dec 6, 2020

LeadSongDog commented Dec 6, 2020

cclauss commented Dec 6, 2020

tfmorris commented Dec 6, 2020

LeadSongDog commented Dec 6, 2020

tfmorris commented Dec 7, 2020

hornc commented Sep 15, 2023 •

edited

cdrini commented Nov 20, 2023 •

edited

Remediation: Unicode Book titles mangled during Import #135

Remediation: Unicode Book titles mangled during Import #135

Comments

anandology commented Jan 23, 2012

anandology commented Jan 23, 2012

amillar503 commented Jun 4, 2012

bfalling commented Sep 22, 2016

hornc commented May 8, 2017

LeadSongDog commented Oct 30, 2017

LeadSongDog commented Nov 1, 2017

hornc commented Nov 3, 2017

hornc commented Nov 3, 2017

mekarpeles commented Mar 13, 2018

tfmorris commented Mar 14, 2018

cclauss commented Dec 6, 2020

LeadSongDog commented Dec 6, 2020

cclauss commented Dec 6, 2020

tfmorris commented Dec 6, 2020

LeadSongDog commented Dec 6, 2020

tfmorris commented Dec 7, 2020

hornc commented Sep 15, 2023 • edited

cdrini commented Nov 20, 2023 • edited

hornc commented Sep 15, 2023 •

edited

cdrini commented Nov 20, 2023 •

edited