Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diacritics in UTF-8 Glossary File Not Working #44

Open
DigitalProf opened this issue Aug 13, 2022 · 3 comments
Open

Diacritics in UTF-8 Glossary File Not Working #44

DigitalProf opened this issue Aug 13, 2022 · 3 comments
Labels
api change Requires changes to the DeepL API

Comments

@DigitalProf
Copy link

DigitalProf commented Aug 13, 2022

I have attempted several times to upload a glossary such that diacritics in French will be handled correctly. I am sure is in UTF-8 and have verified in various ways.

A related issue is that Excel by default includes a Byte Order Mark (BOM), which I have seen is included in the first item in the glossary. Using Notepad++ I have converted the encoding of the file to exclude the BOM, but that does not correct the problem. Here is the link to the glossary file that I uploaded to DeepL using Python.

I am also attaching below:

  1. A screen grab of the "Save as" screen from Excel
  2. The results of a dump of the most recent glossary that I uploaded as well as a screen grab of the problems int the translated document
  3. The actual CSV
    SaveTestGlossaryV2
    GlossaryContents(V2)
    Sample Page of Character Encoding Problems
@daniel-jones-deepl
Copy link
Member

Hi @DigitalProf Mike, thanks for creating this issue.

I reproduced the problem that the first term contains the byte order mark (BOM); there seems to be an error on our side. I've reported this issue to the team.

I also looked into Excel. Exporting using "CSV UTF-8 (Comma delimited)" gives the correct encoding, but includes the BOM. Unfortunately I could not find an easy way to omit the BOM.

As a workaround (until we can resolve the BOM issue on our side), could you try entering a dummy first entry in the CSV? For example "entry-to-be-ignored,entry-to-be-ignored". Your remaining glossary entries should be unaffected and work correctly. Please make sure the entries appear correctly in Excel -- when I open your link above, many of the entries already include wrong characters (I guess because Excel assumed the wrong encoding as the file does not contain a BOM).

@daniel-jones-deepl daniel-jones-deepl added the api change Requires changes to the DeepL API label Aug 15, 2022
@DigitalProf
Copy link
Author

DigitalProf commented Aug 15, 2022

Thanks for the action on this issue, Daniel! I apologize for how the link to the Excel file works. The access to a OneDrive file via the browser does not give one a chance to state that the file is in fact in UTF-8. When the file is opened in Excel, the software asks for confirmation that the file is indeed in UTF-8. I should have tested the link myself. Sorry about that! :-)

As to the Byte Order Mark (BOM), Excel does in fact place that into the file by default. I have checked, but do not see how to do otherwise for exporting from Excel. I have, however, tested this aspect of the problem by opening the file in Notepad++ and changing the encoding scheme to remove the BOM. I have tested that, but it does not change how DeepL handles the file.

Last night, I sent along the Python code I used to upload the glossary. The code is from your site, but in copying the code into my message last evening, I believe that I now see the problem. I have checked this out, but I am thinking that I simply need to open the file in Python with UTF-8 encoding by adding this:

, encoding="utf-8"

If this is in fact the issue, I suggest that the sample code be changed on GitHub. It appears about half-way down the page on the GitHub site in the section, “Creating a glossary.

Cheers,

Mike

Python Code for Creating a Glossary

image

@DigitalProf
Copy link
Author

DigitalProf commented Aug 15, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Requires changes to the DeepL API
Projects
None yet
Development

No branches or pull requests

2 participants