Create unit test to detect marc unicode encoding issues #8798

cdrini · 2024-02-07T03:32:18Z

Here is a recent import from IA into OL:

IA Record: https://archive.org/details/b30530921_0001
OL Record: https://openlibrary.org/books/OL50976356M/Abreg%C2%A9%E2%99%AD_de_toute_la_medecine_pratique_...

The long-withstanding issue ( #135 ) of mysterious characters like ©♭ appearing in the Open Library record!

The purpose of this issue is to create a unit test of the smallest possible piece that is breaking. Likely, that is the piece that takes in the MARC record. That way this error should never resurface!

Stakeholders

@hornc

The text was updated successfully, but these errors were encountered:

hornc · 2024-02-07T04:30:28Z

@cdrini The source of this particular character issue is that the source record has an incorrect encoding flag in the MARC binary. It claims to be MARC-8 encoded, but the data is UTF-8 encoded... treating a UTF-8 é as if it were MARC-8 produces ©♭

For some reason, the MARC XML shows the correct UTF-8 encoding in the leader and content.

I think OL is doing the correct operations with bad data. Looking into why archive.org got incorrect data for the MARC binary (but not MARC XML) would be useful. It seems all actions on this item are recent.

hornc · 2024-02-07T04:55:07Z

A similar item scanned around the same time has accents displayed correctly: https://openlibrary.org/books/OL50976370M/Suppl%C3%A9ment_de_l'Abreg%C3%A9_de_toute_la_m%C3%A9decine_pratique_ou_tome_VI_de_cet_ouvrage_..._premiere_partie

cdrini · 2024-02-07T16:43:55Z

Here are some other recent ones:

I'm not sure what the pattern is that caused these to regress, but can we perhaps sniff the file and look for certain characters? Or use the marc xml instead of the binary?

tfmorris · 2024-02-07T18:06:41Z

The file that imported correctly https://openlibrary.org/books/OL50976370M doesn't have a binary MRC file, just a MARC XML file. I'm not sure what conditions cause that in the processing pipeline.

The additional 5 files identified by @cdrini follow the same pattern as identified by @hornc for the first example.

For some reason, the MARC XML shows the correct UTF-8 encoding in the leader and content.

Where are these files being sourced/derived from? Clearly something in the pipeline is broken. Interestingly, the two MARC files are the oldest files in the directory https://archive.org/download/b30530921_0001

Having said that MARCedit displays all the "broken" binary MARC files without any problem, so it must have some heuristic to override the encoding flag.

LeadSongDog · 2024-03-06T02:58:10Z

Further, the same is appearing in author names, such as:
https://openlibrary.org/search?q=author%3A©+AND+ia%3A*&mode=everything
or simply
https://openlibrary.org/search/authors?q=©

tfmorris · 2024-03-06T14:36:18Z

Having said that MARCedit displays all the "broken" binary MARC files without any problem, so it must have some heuristic to override the encoding flag.

I confirmed with the author of MARCedit that he uses a heuristic for encoding detection because MARC encoding flag isn't reliable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create unit test to detect marc unicode encoding issues #8798

Create unit test to detect marc unicode encoding issues #8798

cdrini commented Feb 7, 2024

hornc commented Feb 7, 2024 •

edited

hornc commented Feb 7, 2024

cdrini commented Feb 7, 2024 •

edited

tfmorris commented Feb 7, 2024 •

edited

LeadSongDog commented Mar 6, 2024 •

edited

tfmorris commented Mar 6, 2024

Create unit test to detect marc unicode encoding issues #8798

Create unit test to detect marc unicode encoding issues #8798

Comments

cdrini commented Feb 7, 2024

Stakeholders

hornc commented Feb 7, 2024 • edited

hornc commented Feb 7, 2024

cdrini commented Feb 7, 2024 • edited

tfmorris commented Feb 7, 2024 • edited

LeadSongDog commented Mar 6, 2024 • edited

tfmorris commented Mar 6, 2024

hornc commented Feb 7, 2024 •

edited

cdrini commented Feb 7, 2024 •

edited

tfmorris commented Feb 7, 2024 •

edited

LeadSongDog commented Mar 6, 2024 •

edited