Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify FLAG UTF-8 when converting to UTF-8, if there was no explicit FLAG option #25

Open
donnerpeter opened this issue Feb 4, 2021 · 8 comments

Comments

@donnerpeter
Copy link

Hunspell read the affix file byte by byte and decodes UTF-8 on demand. If it's not instructed to do so for flags, it doesn't. So non-ASCII characters like "ý" are treated like several characters, and due to another bug Hunspell silently takes just the first character and ignores the rest. So the words can have unexpected flags.

Example: pt contains FORBIDDENWORD ý, and the perfectly valid word trabalhar/akYMjLÀÚ is treated as having this flag and thus considered misspelled.

@wooorm
Copy link
Owner

wooorm commented Feb 4, 2021

Yeah good idea. I do remember thinking about this, but it never came up. Perhaps a send expression in crawl.sh could do the trick. PR welcome!

@donnerpeter
Copy link
Author

Yes, some combination of bash and unix text processing utilities should help. Neither of them are my strong side, so I wouldn't hold breath from a PR by me in the very near future :)

@wooorm
Copy link
Owner

wooorm commented Jun 23, 2021

Shouldn’t this issue be about setting an SET UTF-8 instead of using a FLAG UTF-8? 🤔

@donnerpeter
Copy link
Author

No, it's not enough. At the moment of submisson pt already had SET UTF-8, but Hunspell parses flags byte by byte, and needs to know that they're in UTF-8, too.

@wooorm
Copy link
Owner

wooorm commented Jun 23, 2021

That sounds more complex than I thought...

But, then this is a bug in Portuguese though? It should either use ASCII flags, or SET UTF-8?

@donnerpeter
Copy link
Author

Well, it was so. Now pt already has FLAG UTF-8, but there might be other dictionaries with this issue.

@wooorm
Copy link
Owner

wooorm commented Jun 23, 2021

Hmm, that still seems like an issue for them though? That should be fixed in the upstream, rather than patched here?

@donnerpeter
Copy link
Author

The issue should be addressed where the dictionaries are converted into UTF-8. My understanding was that it was here, at least partly. If I'm mistaken, then this is a wrong repo indeed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants