New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remediation: Unicode Book titles mangled during Import #135
Comments
Both these records came from archive.org. I've looked at the _marc.xml files on archive.org. Last modified time of both the marc.xml files are in 2007 and these recrods were created in 2008. It looks like the issue is with the script that parsed those titles. |
There are thousands of MARC import records where accented characters have been mangled or handled incorrectly. Another common scenario is that accents or other diacritical marks have been replaced by a space before or after the vowel. See for example: http://openlibrary.org/authors/OL4459814A/Heinrich_Schro_der http://openlibrary.org/works/OL10684450W/Tonbandgera_te-Messpraxis In both the author and title, the umlaut has been changed to a space after the vowel. The linked MARC record shows correctly in the browser. |
Should we consider re-importing? And is #149, which also references https://bugs.launchpad.net/openlibrary/+bug/598204, a dependency? |
Some I have found recently: https://openlibrary.org/works/OL17670297W |
https://openlibrary.org/search?q=title%3A+%22©♭%22&mode=everything Still finds well over 17 million matches. This for "é", likely the most common accented letter. Edits like https://openlibrary.org/books/OL26303038M/Anatomie_générale_appliquée_à_la_physiologie_et_à_la_médecine?b=3&a=1&_compare=Compare&m=diff should not be necessarily manual. |
@hornc re your May 8 comment, those works were created from editions created from importing |
@LeadSongDog interesting, the MARC display you link to shows the characters garbled, but if you click through to the XML representation https://ia800202.us.archive.org/34/items/b28044277_0001/b28044277_0001_marc.xml the accented e's display correctly. There may be an issue with encoding types set incorrectly? I will pick this up shortly, the new openlibrary-client is now in a state where it can be used to make bulk data corrections. |
@LeadSongDog I may have figured out how the mangling is happening, in this example marc xml the a-grave of "Secours à donner" displays correctly in utf-8 encoding a-grave is U+00E0, which in binary (pythonic notation) is if those bytes were interpreted as MARC8 and "converted", I now think these MARC records have utf-8 character encodings, but were imported to OL as if they were MARC8, which explains the mangling. I did the MARC8 conversion manually from the tables found here https://memory.loc.gov/diglib/codetables/45.html I'll need to use yaz or something to test this out properly, but this will provide a good path to fixing the MARC errors programatically. I know that there are other unicode mangling errors affecting Amazon imported records, but I think that is from incorrect conversion from Windows or ISO charsets Thanks for your comment @LeadSongDog, in trying to figure out whether the MARC records were actually wrong or not I think I have stumbled upon the root cause of the issue! |
@hornc any updates on MARC mangling and/or if we resolved this issue? |
The issue is definitely not resolved. When the import script is fixed, @bfalling suggest of reimporting will most likely be necessary. From the point of view of triage, it would probably be useful to get an actual count. "Thousands" isn't a very big percentage of 25 million editions. |
Has this been resolved with our Python 3 changes or can someone provide steps-to-reproduce on Python 3? |
Well https://openlibrary.org/books/OL12903648M/Etudes_Conomiques_De_L'Ocde certainly isn’t fixed, but perhaps we’re done with digging the hole...
|
Steps-to-reproduce problem class 1? |
The earlier examples are better than the most recent one which is an import from crappy Amazon data (which we should not be imported). If the bug has been fixed, reimporting the records should result in the correct encoding. Then the task simply becomes reimporting the millions of corrupted records. The search which was claimed to return 17+ million records before: https://openlibrary.org/search?q=title%3A+%22%C2%A9%E2%99%AD%22&mode=everything |
@tfmorris As https://openlibrary.org/search?q=title%3A+%22+%22&mode=everything gets the same result, it seems that yes, it is a simple case of title search for an effectively-blank string. |
I created #4223 for the search bug. |
Update, running a current search
I'll try to come up with some clear queries to identify classes of MARC encoding caused character issues
I think these issues are limited to work/edition titles and author names, but it'd pay to check whether locations or publishers are affected too. I'm also not 100% sure a reimport will overwrite all the fields if they are already populated, unfortunately. Garbled Amazon titles should be dealt with separately as the cause of mangling is more variable there. I'll take a another look at MARC sourced mangled data remediation and see how much is left to close this after a decade... I have just created a project wiki page to track my notes on this: https://github.com/internetarchive/openlibrary/wiki/Mangled-MARC |
I'm finding about 16,000 works affected on open library: https://openlibrary.org/search?q=title_suggest%3A%C2%A9+AND+ia%3A*&mode=everything |
This issue is reported in the ol-tech mailing list.
The text was updated successfully, but these errors were encountered: