Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallowed words, boundaries, case #41

Open
jaumeortola opened this issue Aug 2, 2019 · 3 comments
Open

Disallowed words, boundaries, case #41

jaumeortola opened this issue Aug 2, 2019 · 3 comments

Comments

@jaumeortola
Copy link
Contributor

jaumeortola commented Aug 2, 2019

https://github.com/Common-Voice/common-voice-wiki-scraper/blob/a23abced7713c2260f78fc77252727fe719d6eca/src/checker.rs#L37

Here you split words just around white spaces. You should use word boundaries instead (in regexp: \b, or something equivalent like common separators). Otherwise, the word is not detected in many contexts. For example, I have the word oster-monath in the disallowed words file, but in a sentence it appears between quotation marks ("oster-monath") or near a comma (oster-monath, ) and it is not detected.

https://github.com/Common-Voice/common-voice-wiki-scraper/blob/a23abced7713c2260f78fc77252727fe719d6eca/src/checker.rs#L42

Case is very important for spelling in most languages. I think the disallowed words should be case-sensitive. Case-insensitive is used sometimes in NLP, but not in spell-checking! It could be optional for each language.

@jaumeortola
Copy link
Contributor Author

The word tokenization issue is a bit more complicated than I said, specially if you want it to be useful for all languages.

The most important thing is that the word tokenization used here must be exactly the same used when generating the disallowed words list. (In addition, if different kinds of apostrophes are converted in the blacklist generation, the same must be done here when extracting the sentences.)

I see these possibilities:

  1. Use only white spaces as a separator. The blacklist generation will be more complex, and it will require more language expertise, but it is still feasible.
  2. Use white spaces plus a few other common separators: comma, semicolon, quotation marks, question marks..., but not hyphens and apostrophes.
  3. Use word boundaries (\b in regular expressions). This is probably too greedy and not good for some languages.

I would choose option 2. The list of separators could be a configuration option for each language.

Hyphens and apostrophes should be considered separators only when they are next to other separators. To implement this, separators should be multicharacter strings, not just single characters.

Tell me what you decide, and I will redo the blacklist accordingly.

@jaumeortola
Copy link
Contributor Author

We need to clarify the issue of word tokenization. I will make a concrete proposal.

Add to the language configuration file a setting with replacements to be made (in order) before splitting sentences at white spaces. To be general enough, the replacements must be regular expressions. Apostrophe replacements can be included here. For example:

replacements = [
     [ "’", "'"]
     [ "[,;\?!\.]", " "]
     [ "'[,;\?!\.]", " "]
     [ "^['-]", ""]
]

These replacements should be made exactly the same way when extracting sentences and when creating the blacklist.

The words in the blacklist should be case-sensitive. Or at least make it optional:

blacklist_caseinsensitive = false

@jaumeortola
Copy link
Contributor Author

Looking again into the word tokenization issue, I find that the problem was not "oster-monath"or oster-monath, but l'"oster-monath", which is not detected if the word in the blacklist is just oster-monath.
Anyway, this kind of problems will be solved using, as I said, exactly the same tokenization and the same replacements when creating the blacklist and when selecting sentences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Overview
Discussion Needed
Development

No branches or pull requests

3 participants