Self-censoring & accents does not work with custom non English words #8

Priler · 2023-01-20T00:47:29Z

When adding a custom non English word, everything works fine except self-censoring and accents

unsafe {
    add_word("плохоеслово", Type::PROFANE & Type::SEVERE);
    add_word("badword", Type::PROFANE & Type::SEVERE);
}

assert!("b*d w***r-d тест".is(Type::INAPPROPRIATE)); // true
assert!("badwörd тест".is(Type::INAPPROPRIATE)); // true

assert!("плохоеслово тест".is(Type::INAPPROPRIATE)); // true
assert!("п л о х о е с   л о  в о тест".is(Type::INAPPROPRIATE)); // true
assert!("плоооохоооое слово тест".is(Type::INAPPROPRIATE)); // true
assert!("п__л--о о о о х_о_о_о_о-е слово тест".is(Type::INAPPROPRIATE)); // true

assert!("пл*х*есл*во тест".is(Type::INAPPROPRIATE)); // false
assert!("плöхöеслöвö тест".is(Type::INAPPROPRIATE)); // false

Also, is there a way to add custom confusable characters?
Or we should generate custom variants for each added word.

Context

I am using rustrict version 0.5.11 (latest version)

The text was updated successfully, but these errors were encountered:

finnbear · 2023-01-20T05:51:43Z

Thanks for the issue!

except self-censoring
assert!("пл*х*есл*во тест".is(Type::INAPPROPRIATE)); // false

The main way self-censoring is currently implemented is by manually adding variations for common/likely shortenings. For example, the wordlist contains fuk which should also cover fu*k. You could, for example, add плхеслво тест to your wordlist. The exception is ASCII vowels (a, e, i, etc.), which are handled automatically e.g. fuck should cover f*ck.

except accents
assert!("плöхöеслöвö тест".is(Type::INAPPROPRIATE)); // false

Using unicode inspector reveals that the о is cyrillic but the ö is latin. Making the filter consider the possibility of cyrillic letters every time it sees latin letters would make it much slower. I recommend trying to use ASCII in your wordlist if you want both self-censoring-rejection and accent-rejection to work better (e.g. use nnoxoecnobo instead of плохоеслово).

Priler · 2023-01-20T21:09:56Z

Yeah, I see how self-censoring is implemented.
Then I should add N variations of the same word in order to include such cases.

As for accents, I wanted to say that there should be some way to extend replacements, for example.
Then I could just add custom table and ö would be replaced to cyrillic о.

And yes, I do understand that the same ö can be replaced to ASCII o.
So I suggest adding some kind of mode option that can be switched to make rustrict work for specific given language (with cyrillic support, i.e.).
This way it's implemented in py-censure (via lang argument).

Cuz, AFAIK current implementation of rustrict is very tied to ASCII/English profanity filtering.
It lacks of localization options.

p.s. It's my thoughts and suggestions on how rustrict could be improved.
I mean, if there was more localization options, I've could then provide you with respective profanity dictionaries (for Russian language, for example, cuz there many countries out there that speaks this language, not only in Russia itself).

finnbear · 2023-01-20T22:45:43Z

I am open to expanding rustrict to additional languages, to the extent that it doesn't add too much complexity or overhead*.

*adding more words/replacements is probably never too much overhead, but adding more filter steps/features might be.

As for accents, I wanted to say that there should be some way to extend replacements

I could add that in a future update, but it likely wouldn't help as much as you think (because of the effort required to make a comprehensive list of replacements).

Then I could just add custom table and ö would be replaced to cyrillic о.

The umlaut would (along with all other accents) be filtered out by Unicode normalization in the very early stages of the filter, leaving only 'o' (which would then be subject to replacement rules).

While it would, in theory, be possible to replace all 'o' lookalikes with all other 'o' lookalikes, it seems more efficient to use ASCII 'o' in in place of Cyrillic 'о' within the profanity list. That's not because the filter couldn't handle match the Cyrillic 'о' but because the filter is already engineered to replace tens or hundreds of 'o' lookalikes with ASCII 'o'

So I suggest adding some kind of mode option that can be switched to make rustrict work for specific given language (with cyrillic support, i.e.).
This way it's implemented in py-censure (via lang argument).
I mean, if there was more localization options, I've could then provide you with respective profanity dictionaries (for Russian language, for example, cuz there many countries out there that speaks this language, not only in Russia itself).

It looks like py-censure has built-in wordlists for different languages (English and Russian at the moment). I do hope to add the option, in the future, to easily substitute out the wordlist (or compose multiple wordlists). The main obstacle is finding false-positives (e.g. "assassin" or "push it"), which takes about 2-3 minutes and requires the entire dictionary for the language (too long and too much data to do at runtime).

mkadirtan · 2023-04-03T10:55:01Z

Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?

finnbear · 2023-04-03T15:03:15Z

Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?

Can you explain the benefit of this over a single wordlist with profanity in multiple languages? (the current approach)

Are you trying to remove languages you don't care about to make the filter more efficient?

mkadirtan · 2023-04-06T10:56:20Z

Seems like https://github.com/alexzel/bad-words-next has an extendable word list structure for multiple languages. Maybe this library can adopt a similar data structure?

Can you explain the benefit of this over a single wordlist with profanity in multiple languages? (the current approach)

Are you trying to remove languages you don't care about to make the filter more efficient?

My main takeaway was that the bad-words-next package use per language lookalike a.k.a replacements in the word lists. This allows for converting Cyrillic alphabet conversions. Also, you can explicitly censor in a single language with this approach. I didn't think about the efficiency, though.

finnbear · 2023-04-07T05:32:36Z

This allows for converting Cyrillic alphabet conversions.

One of the barriers between rustrict and better Cyrillic support is indeed alphabet conversions. Right now, most rustrict lookalike characters are targeted at ASCII letters. In other words, a Cyrillic А can be interpreted as a Latin A in a Latin profanity but a Latin A won't be interpreted as a Cyrillic А in a Cyrillic profanity.

If every character that looks like A had to reference very other character that looks like A (and same thing for the other 52+ letters), the replacement list would take too much memory. I have a few ideas for fixing this but none of them are particularly appealing.

Also, you can explicitly censor in a single language with this approach

Indeed 👌

mkadirtan · 2023-04-07T07:52:34Z

the replacement list would take too much memory.

So this is a memory problem, maybe only convert the most common variations of the letter A? Only variations that are possible to write with a keyboard?

Priler added the bug Something isn't working label Jan 20, 2023

Priler assigned finnbear Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-censoring & accents does not work with custom non English words #8

Self-censoring & accents does not work with custom non English words #8

Priler commented Jan 20, 2023

finnbear commented Jan 20, 2023

Priler commented Jan 20, 2023 •

edited

finnbear commented Jan 20, 2023 •

edited

mkadirtan commented Apr 3, 2023

finnbear commented Apr 3, 2023

mkadirtan commented Apr 6, 2023

finnbear commented Apr 7, 2023 •

edited

mkadirtan commented Apr 7, 2023

Self-censoring & accents does not work with custom non English words #8

Self-censoring & accents does not work with custom non English words #8

Comments

Priler commented Jan 20, 2023

When adding a custom non English word, everything works fine except self-censoring and accents

Context

finnbear commented Jan 20, 2023

Priler commented Jan 20, 2023 • edited

finnbear commented Jan 20, 2023 • edited

mkadirtan commented Apr 3, 2023

finnbear commented Apr 3, 2023

mkadirtan commented Apr 6, 2023

finnbear commented Apr 7, 2023 • edited

mkadirtan commented Apr 7, 2023

Priler commented Jan 20, 2023 •

edited

finnbear commented Jan 20, 2023 •

edited

finnbear commented Apr 7, 2023 •

edited