Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between foliapy and libfolia in stripping control characters in normalize_spaces() #55

Open
proycon opened this issue Mar 29, 2024 · 4 comments
Assignees

Comments

@proycon
Copy link
Member

proycon commented Mar 29, 2024

normalize_spaces() is used in text validation, currently foliapy (v2.5.11) and libfolia behave differently here regarding control characters:

  • foliapy (>= v2.5.11) all strips control characters.
  • foliapy (< v2.5.11) left them as-is, this was wrong.
  • libfolia regards a control character the same as a space character, I think this is not correct because control characters don't imply whitespace (in fact, sometimes they are explicitly zero-width)

This issue arose from @martinreynaert 's data, where we see for example:

Expected: Vierstellen-Prädikate bildende Operator „ “ mit dem Zweistellen-Prädikat
Found: Vierstellen-Prädikate bildende Operator „“ mit dem Zweistellen-Prädikat     
******* DEVIATION POINT: Operator „<*HERE*>“ mit dem       

Character in question is a 0x7f (DELETE).

It also happens in an instance of hebrew text (I translitterate the hebrew because browsers are too smart in RTL rendering and mess up the point): <0x202d>Tun-<0x202d>Idash which libfolia turns into Tun- Idash (inserts an unwanted space). 0x202d is a left-to-right control override.

@kosloot
Copy link
Contributor

kosloot commented Mar 29, 2024

Yes, you are right. Strange oversight. But until now it never caused problems.
I am working on a fix (not that difficult) but this change has some ramifications.
Especially it means that every file that bothered @martinreynaert should be rerun with ucto based on the fixed libfolia.
Manually correction them is also possible. Of course.
But to be sure, it would be helpful when @martinreynaert provided me with the original input files of the Wittgenstein and Kierkegaard examples BEFORE ucto was run.
I would like be able to check if ucto does the right thing now.
thanx

kosloot added a commit that referenced this issue Mar 29, 2024
@kosloot
Copy link
Contributor

kosloot commented Mar 30, 2024

@proycon This introduces another interesting issue: should we preserve (some?) BiDI information?
I think that this is in fact the right thing to do. But it is tricky.
I experimented with keeping the LeftToRightOverride character in libfolia, which seems to work fine BUT is also implied amending Ucto.
So something to do in a major release cycle. IF we want it
Any thought on this?

@proycon
Copy link
Member Author

proycon commented Mar 30, 2024 via email

@kosloot
Copy link
Contributor

kosloot commented Apr 2, 2024

Aren't the files from @martinreynaert examples of a use case?
After removing the Right to Left stuff, displaying will be wrong, I assume.
Still I can live with that. But a fix seems possible, and not THAT complicated.
And I would be very surprised if it would break a lot of FoLiA in the wild

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants