Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML tags with attributes change the order of the translation when tagHandling is XML but are OK when tagHandling is off #30

Open
imanirajdoost opened this issue Jul 10, 2023 · 5 comments

Comments

@imanirajdoost
Copy link

imanirajdoost commented Jul 10, 2023

Describe the bug
I am translating a text using Deepl API that contains XML tags and some of these tags include custom attributes; ex.

That’s the <fontcolor="#007af2">timer</fontcolor>! It measures the time you spend in a module OR the time you have left to complete a challenge!

However, the format of the XML tag is not conserved when the text is translated to Slovenian and Italian (I have not tested in other languages but could be the case for others as well). The result is like this:

Slovenian: To je časovnik <fontcolor="#007af2"></fontcolor> ! Meri čas, ki ga porabite v modulu, ALI čas, ki vam je ostal do konca izziva!

Italian: Questo è il timer <fontcolor="#007af2"></fontcolor> ! Misura il tempo trascorso in un modulo O il tempo rimasto per completare una sfida!

Meaning that instead of putting the word timer inside the tag, it gets out and leaves the tag empty. This happens when the tagHandling option is set to either XML or HTML, however if I set the tagHandling to off, the result is OK but other problems occur for my text because the tagHandling is set to off.

To Reproduce
Steps to reproduce the behavior:
Can be reproduced in the Deepl API Simulator: https://www.deepl.com/en/docs-api/simulator/

  1. Go to Deepl API Simulator
  2. Copy the text That’s the <fontcolor="#007af2">timer</fontcolor>! It measures the time you spend in a module OR the time you have left to complete a challenge! in the Text field.
  3. Set the target language to Slovenian or Italian
  4. Set tagHandling to XML
  5. Click on Send and compare the results with when the tagHandling is set to off

Expected behavior
The correct text should be:

Slovenian: To je <fontcolor="#007af2">časomer</fontcolor>! Meri čas, ki ga porabite v modulu, ALI čas, ki vam je ostal za dokončanje izziva!

Italian: È il <fontcolor="#007af2">timer</fontcolor>! Misura il tempo trascorso in un modulo O il tempo rimanente per completare una sfida!

Which is the case when the tagHandling is set to off but that should not be the case.

What has been tested

I tried combining different options together to see if I can make it work but none of them gave me the intended result. These are the parameters that I changed:

SentenceSplitting=on,off,noNewLines
preserveFormatting=on,off
nonSplittingTags=fontcolor,null

UPDATE 07/10/2023 11:47 AM
The problem seems to be the fact that the API takes into account the ="#007af2" part of the tag as the name of the tag and it doesn't see the closing tag for the same thing. So if we add a space: <fontcolor "=#007af2"> , it will work as expected. I don't know if a fix for that would be necessary but maybe a support for custom attributes like this would be nice.

@imanirajdoost
Copy link
Author

Seems like a standard XML must have a space between name and attribute, so I guess the source could be the problem.

@JanEbbing
Copy link
Member

JanEbbing commented Jul 10, 2023

I think the problem with this XML example is that you are using a tag (fontcolor) as an attribute, which is not allowed. The following is an invalid XML document (you can check with various online validators/the XML standard)

<?xml version = "1.0" encoding = "UTF-8"?>
<note>
That’s the <fontcolor="#007af2">timer</fontcolor>!
</note>

This is a valid XML document (I added an attribute col with your color):

<?xml version = "1.0" encoding = "UTF-8"?>
<note>
That’s the <fontcolor col="#007af2">timer</fontcolor>!
</note>

@imanirajdoost
Copy link
Author

@JanEbbing That is correct, however there is an issue when there is an ignoreTag inside another XML tag (which from the XML standard point of view, should be valid).

This example could demonstrate the problem :

Welcome back to the <gs><ignore>[SCHOOL_NAME]</ignore>!</gs> We missed you! Are you ready for this path?

where ignore is added to the ignoreTags list. In this case the result in Slovenian is :

Dobrodošli nazaj na <gs><ignore>[SCHOOL_NAME]</ignore>! Pogrešali smo vas! Ste pripravljeni na to pot?</gs>

Whereas it should be:

Dobrodošli nazaj na <gs><ignore>[SCHOOL_NAME]</ignore>!</gs> Pogrešali smo vas! Ste pripravljeni na to pot?

Meaning that the <gs> tag is not well-placed. Again, if we set tagHandling to off, the <gs> tag will be in its place but the [SCHOOL_NAME] is translated because the tagHandling would not work anymore.

Any ideas for this problem?

@imanirajdoost imanirajdoost reopened this Jul 10, 2023
@JanEbbing
Copy link
Member

JanEbbing commented Jul 10, 2023

Putting the exclamation mark outside the gs tag fixes this for me (I assume this is because we do sentence splitting, and the exclamation mark that ends this sentence is inside the XML tag), maybe that helps?

Welcome back to the <gs><ignore>[SCHOOL_NAME]</ignore></gs>! We missed you! Are you ready for this path?
=>
Dobrodošli nazaj v <gs><ignore>[SCHOOL_NAME]</ignore></gs>! Pogrešali smo te! Ste pripravljeni na to pot?
But I agree this is not a good response from our API, I will check internally with another team.

@imanirajdoost
Copy link
Author

You're right, in this case it will resolve the problem, I suspect that there will be other examples having the same issue, I'll make sure to add them here if I find them to help the team resolve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants