Specify FLAG UTF-8 when converting to UTF-8, if there was no explicit FLAG option #25

donnerpeter · 2021-02-04T18:03:31Z

Hunspell read the affix file byte by byte and decodes UTF-8 on demand. If it's not instructed to do so for flags, it doesn't. So non-ASCII characters like "ý" are treated like several characters, and due to another bug Hunspell silently takes just the first character and ignores the rest. So the words can have unexpected flags.

Example: pt contains FORBIDDENWORD ý, and the perfectly valid word trabalhar/akYMjLÀÚ is treated as having this flag and thus considered misspelled.

The text was updated successfully, but these errors were encountered:

wooorm · 2021-02-04T18:09:56Z

Yeah good idea. I do remember thinking about this, but it never came up. Perhaps a send expression in crawl.sh could do the trick. PR welcome!

donnerpeter · 2021-02-04T18:24:54Z

Yes, some combination of bash and unix text processing utilities should help. Neither of them are my strong side, so I wouldn't hold breath from a PR by me in the very near future :)

wooorm · 2021-06-23T16:20:59Z

Shouldn’t this issue be about setting an SET UTF-8 instead of using a FLAG UTF-8? 🤔

donnerpeter · 2021-06-23T17:15:41Z

No, it's not enough. At the moment of submisson pt already had SET UTF-8, but Hunspell parses flags byte by byte, and needs to know that they're in UTF-8, too.

wooorm · 2021-06-23T17:36:02Z

That sounds more complex than I thought...

But, then this is a bug in Portuguese though? It should either use ASCII flags, or SET UTF-8?

donnerpeter · 2021-06-23T18:30:44Z

Well, it was so. Now pt already has FLAG UTF-8, but there might be other dictionaries with this issue.

wooorm · 2021-06-23T18:35:34Z

Hmm, that still seems like an issue for them though? That should be fixed in the upstream, rather than patched here?

donnerpeter · 2021-06-23T19:03:47Z

The issue should be addressed where the dictionaries are converted into UTF-8. My understanding was that it was here, at least partly. If I'm mistaken, then this is a wrong repo indeed :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify FLAG UTF-8 when converting to UTF-8, if there was no explicit FLAG option #25

Specify FLAG UTF-8 when converting to UTF-8, if there was no explicit FLAG option #25

donnerpeter commented Feb 4, 2021

wooorm commented Feb 4, 2021

donnerpeter commented Feb 4, 2021

wooorm commented Jun 23, 2021

donnerpeter commented Jun 23, 2021

wooorm commented Jun 23, 2021 •

edited

donnerpeter commented Jun 23, 2021

wooorm commented Jun 23, 2021

donnerpeter commented Jun 23, 2021

Specify FLAG UTF-8 when converting to UTF-8, if there was no explicit FLAG option #25

Specify FLAG UTF-8 when converting to UTF-8, if there was no explicit FLAG option #25

Comments

donnerpeter commented Feb 4, 2021

wooorm commented Feb 4, 2021

donnerpeter commented Feb 4, 2021

wooorm commented Jun 23, 2021

donnerpeter commented Jun 23, 2021

wooorm commented Jun 23, 2021 • edited

donnerpeter commented Jun 23, 2021

wooorm commented Jun 23, 2021

donnerpeter commented Jun 23, 2021

wooorm commented Jun 23, 2021 •

edited