Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create unit test to detect marc unicode encoding issues #8798

Open
cdrini opened this issue Feb 7, 2024 · 6 comments
Open

Create unit test to detect marc unicode encoding issues #8798

cdrini opened this issue Feb 7, 2024 · 6 comments
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]

Comments

@cdrini
Copy link
Collaborator

cdrini commented Feb 7, 2024

Here is a recent import from IA into OL:

The long-withstanding issue ( #135 ) of mysterious characters like ©♭ appearing in the Open Library record!

The purpose of this issue is to create a unit test of the smallest possible piece that is breaking. Likely, that is the piece that takes in the MARC record. That way this error should never resurface!

Stakeholders

@hornc

@cdrini cdrini added Type: Bug Something isn't working. [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 2 Important, as time permits. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Priority: 3 Issues that we can consider at our leisure. [managed] and removed Priority: 2 Important, as time permits. [managed] labels Feb 7, 2024
@hornc
Copy link
Collaborator

hornc commented Feb 7, 2024

@cdrini The source of this particular character issue is that the source record has an incorrect encoding flag in the MARC binary. It claims to be MARC-8 encoded, but the data is UTF-8 encoded... treating a UTF-8 é as if it were MARC-8 produces ©♭

For some reason, the MARC XML shows the correct UTF-8 encoding in the leader and content.

I think OL is doing the correct operations with bad data. Looking into why archive.org got incorrect data for the MARC binary (but not MARC XML) would be useful. It seems all actions on this item are recent.

@hornc
Copy link
Collaborator

hornc commented Feb 7, 2024

@tfmorris
Copy link
Contributor

tfmorris commented Feb 7, 2024

The file that imported correctly https://openlibrary.org/books/OL50976370M doesn't have a binary MRC file, just a MARC XML file. I'm not sure what conditions cause that in the processing pipeline.

The additional 5 files identified by @cdrini follow the same pattern as identified by @hornc for the first example.

For some reason, the MARC XML shows the correct UTF-8 encoding in the leader and content.

Where are these files being sourced/derived from? Clearly something in the pipeline is broken. Interestingly, the two MARC files are the oldest files in the directory https://archive.org/download/b30530921_0001

Having said that MARCedit displays all the "broken" binary MARC files without any problem, so it must have some heuristic to override the encoding flag.

@LeadSongDog
Copy link

LeadSongDog commented Mar 6, 2024

Further, the same is appearing in author names, such as:
https://openlibrary.org/search?q=author%3A©+AND+ia%3A*&mode=everything
or simply
https://openlibrary.org/search/authors?q=©

@tfmorris
Copy link
Contributor

tfmorris commented Mar 6, 2024

Having said that MARCedit displays all the "broken" binary MARC files without any problem, so it must have some heuristic to override the encoding flag.

I confirmed with the author of MARCedit that he uses a heuristic for encoding detection because MARC encoding flag isn't reliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Bug Something isn't working. [managed]
Projects
None yet
Development

No branches or pull requests

4 participants