New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create unit test to detect marc unicode encoding issues #8798
Comments
@cdrini The source of this particular character issue is that the source record has an incorrect encoding flag in the MARC binary. It claims to be MARC-8 encoded, but the data is UTF-8 encoded... treating a UTF-8 For some reason, the MARC XML shows the correct UTF-8 encoding in the leader and content. I think OL is doing the correct operations with bad data. Looking into why archive.org got incorrect data for the MARC binary (but not MARC XML) would be useful. It seems all actions on this item are recent. |
A similar item scanned around the same time has accents displayed correctly: https://openlibrary.org/books/OL50976370M/Suppl%C3%A9ment_de_l'Abreg%C3%A9_de_toute_la_m%C3%A9decine_pratique_ou_tome_VI_de_cet_ouvrage_..._premiere_partie |
Here are some other recent ones:
I'm not sure what the pattern is that caused these to regress, but can we perhaps sniff the file and look for certain characters? Or use the marc xml instead of the binary? |
The file that imported correctly https://openlibrary.org/books/OL50976370M doesn't have a binary MRC file, just a MARC XML file. I'm not sure what conditions cause that in the processing pipeline. The additional 5 files identified by @cdrini follow the same pattern as identified by @hornc for the first example.
Where are these files being sourced/derived from? Clearly something in the pipeline is broken. Interestingly, the two MARC files are the oldest files in the directory https://archive.org/download/b30530921_0001 Having said that MARCedit displays all the "broken" binary MARC files without any problem, so it must have some heuristic to override the encoding flag. |
Further, the same is appearing in author names, such as: |
I confirmed with the author of MARCedit that he uses a heuristic for encoding detection because MARC encoding flag isn't reliable. |
Here is a recent import from IA into OL:
The long-withstanding issue ( #135 ) of mysterious characters like
©♭
appearing in the Open Library record!The purpose of this issue is to create a unit test of the smallest possible piece that is breaking. Likely, that is the piece that takes in the MARC record. That way this error should never resurface!
Stakeholders
@hornc
The text was updated successfully, but these errors were encountered: