Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-censoring & accents does not work with custom non English words #8

Open
Priler opened this issue Jan 20, 2023 · 8 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@Priler
Copy link

Priler commented Jan 20, 2023

When adding a custom non English word, everything works fine except self-censoring and accents

unsafe {
    add_word("плохоеслово", Type::PROFANE & Type::SEVERE);
    add_word("badword", Type::PROFANE & Type::SEVERE);
}

assert!("b*d w***r-d тест".is(Type::INAPPROPRIATE)); // true
assert!("badwörd тест".is(Type::INAPPROPRIATE)); // true

assert!("плохоеслово тест".is(Type::INAPPROPRIATE)); // true
assert!("п л о х о е с   л о  в о тест".is(Type::INAPPROPRIATE)); // true
assert!("плоооохоооое слово тест".is(Type::INAPPROPRIATE)); // true
assert!("п__л--о о о о х_о_о_о_о-е слово тест".is(Type::INAPPROPRIATE)); // true

assert!("пл*х*есл*во тест".is(Type::INAPPROPRIATE)); // false
assert!("плöхöеслöвö тест".is(Type::INAPPROPRIATE)); // false

Also, is there a way to add custom confusable characters?
Or we should generate custom variants for each added word.

Context

I am using rustrict version 0.5.11 (latest version)

@Priler Priler added the bug Something isn't working label Jan 20, 2023
@finnbear
Copy link
Owner

Thanks for the issue!

except self-censoring
assert!("пл*х*есл*во тест".is(Type::INAPPROPRIATE)); // false

The main way self-censoring is currently implemented is by manually adding variations for common/likely shortenings. For example, the wordlist contains fuk which should also cover fu*k. You could, for example, add плхеслво тест to your wordlist. The exception is ASCII vowels (a, e, i, etc.), which are handled automatically e.g. fuck should cover f*ck.

except accents
assert!("плöхöеслöвö тест".is(Type::INAPPROPRIATE)); // false

Using unicode inspector reveals that the о is cyrillic but the ö is latin. Making the filter consider the possibility of cyrillic letters every time it sees latin letters would make it much slower. I recommend trying to use ASCII in your wordlist if you want both self-censoring-rejection and accent-rejection to work better (e.g. use nnoxoecnobo instead of плохоеслово).

@Priler
Copy link
Author

Priler commented Jan 20, 2023

Yeah, I see how self-censoring is implemented.
Then I should add N variations of the same word in order to include such cases.

As for accents, I wanted to say that there should be some way to extend replacements, for example.
Then I could just add custom table and ö would be replaced to cyrillic о.

And yes, I do understand that the same ö can be replaced to ASCII o.
So I suggest adding some kind of mode option that can be switched to make rustrict work for specific given language (with cyrillic support, i.e.).
This way it's implemented in py-censure (via lang argument).

Cuz, AFAIK current implementation of rustrict is very tied to ASCII/English profanity filtering.
It lacks of localization options.

p.s. It's my thoughts and suggestions on how rustrict could be improved.
I mean, if there was more localization options, I've could then provide you with respective profanity dictionaries (for Russian language, for example, cuz there many countries out there that speaks this language, not only in Russia itself).

@finnbear
Copy link
Owner

finnbear commented Jan 20, 2023

I am open to expanding rustrict to additional languages, to the extent that it doesn't add too much complexity or overhead*.

*adding more words/replacements is probably never too much overhead, but adding more filter steps/features might be.

As for accents, I wanted to say that there should be some way to extend replacements

I could add that in a future update, but it likely wouldn't help as much as you think (because of the effort required to make a comprehensive list of replacements).

Then I could just add custom table and ö would be replaced to cyrillic о.

The umlaut would (along with all other accents) be filtered out by Unicode normalization in the very early stages of the filter, leaving only 'o' (which would then be subject to replacement rules).

While it would, in theory, be possible to replace all 'o' lookalikes with all other 'o' lookalikes, it seems more efficient to use ASCII 'o' in in place of Cyrillic 'о' within the profanity list. That's not because the filter couldn't handle match the Cyrillic 'о' but because the filter is already engineered to replace tens or hundreds of 'o' lookalikes with ASCII 'o'

So I suggest adding some kind of mode option that can be switched to make rustrict work for specific given language (with cyrillic support, i.e.).
This way it's implemented in py-censure (via lang argument).
I mean, if there was more localization options, I've could then provide you with respective profanity dictionaries (for Russian language, for example, cuz there many countries out there that speaks this language, not only in Russia itself).

It looks like py-censure has built-in wordlists for different languages (English and Russian at the moment). I do hope to add the option, in the future, to easily substitute out the wordlist (or compose multiple wordlists). The main obstacle is finding false-positives (e.g. "assassin" or "push it"), which takes about 2-3 minutes and requires the entire dictionary for the language (too long and too much data to do at runtime).

@mkadirtan
Copy link

Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?

@finnbear
Copy link
Owner

finnbear commented Apr 3, 2023

Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?

Can you explain the benefit of this over a single wordlist with profanity in multiple languages? (the current approach)

Are you trying to remove languages you don't care about to make the filter more efficient?

@mkadirtan
Copy link

Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?

Can you explain the benefit of this over a single wordlist with profanity in multiple languages? (the current approach)

Are you trying to remove languages you don't care about to make the filter more efficient?

My main takeaway was that the bad-words-next package use per language lookalike a.k.a replacements in the word lists. This allows for converting Cyrillic alphabet conversions. Also, you can explicitly censor in a single language with this approach. I didn't think about the efficiency, though.

@finnbear
Copy link
Owner

finnbear commented Apr 7, 2023

This allows for converting Cyrillic alphabet conversions.

One of the barriers between rustrict and better Cyrillic support is indeed alphabet conversions. Right now, most rustrict lookalike characters are targeted at ASCII letters. In other words, a Cyrillic А can be interpreted as a Latin A in a Latin profanity but a Latin A won't be interpreted as a Cyrillic А in a Cyrillic profanity.

If every character that looks like A had to reference very other character that looks like A (and same thing for the other 52+ letters), the replacement list would take too much memory. I have a few ideas for fixing this but none of them are particularly appealing.

Also, you can explicitly censor in a single language with this approach

Indeed 👌

@mkadirtan
Copy link

the replacement list would take too much memory.

So this is a memory problem, maybe only convert the most common variations of the letter A? Only variations that are possible to write with a keyboard?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants