Allow inverse `existence_check`s (re: up-goer-five check) #1334

patcon · 2022-12-09T18:30:14Z

Hi there! This tool is neat. Thanks!

I was hoping to create a check for raising issue when any words outside a set were used: (top 1000 english words)
https://splasho.com/upgoer5/phpspellcheck/dictionaries/1000.dicin

Can you suggest the simplest way to go about this? I'm happy to submit a PR when I get around to it :)

Inspiration: https://splasho.com/upgoer5/

vqle23 · 2023-04-05T20:34:33Z

Hey, this sounds interesting. I can probably work on this within the next week if you haven't.

Nytelife26 · 2023-07-27T13:49:14Z

I can see the potential utility in this, but it feels out of scope for proselint. Can you think of any scenario where this may be useful in checking prose besides in writing tasks that involve special restrictions? I fear this check would be computationally expensive, and unnecessary for the typical use of the program.

patcon · 2023-08-25T06:42:19Z

I feel this sort of processing is most likely to be helpful to people writing linters for things like ESL readers (english-as-second-language) or for scientific communication with very young readers or those with minimal or no high school education.

I don't want to take anyone's time for granted (we should only work on what we enjoy!), but I will admit it feels arbitrary to put these categories of use outside the realm of worth serving. I totally understand if you do not wish to invest your own time, but pls reconsider making a blanket judgement that this is too computationally expensive 🙏🏻🙏🏻🙏🏻🙂

Nytelife26 · 2024-04-23T00:30:12Z

Hi, @patcon. I have reconsidered, indeed. Many other linters provide configurations for special environments, and it can't hurt to broaden proselint's utility. Maybe this could be implemented as a plugin, when such a system is available? Alternatively, if you have suggestions for how this could work in a similarly performant manner to what we have now, I would be open to including this in proselint's core.

ferdnyc · 2024-05-08T16:07:18Z

So, with respect to the fine initial work done by @vqle23 in #1351, I have a different implementation of this that I think solves the major issues that came up in that PR's review process.

I started from these basic premises:

We have a relatively small universe of words in the "pass" list
We'll eventually need to find the position of every non-matching word
No matter how long the input text is, we're going to end up searching through it all the way to the end.
(Tangentially and as a caveat/addendum: At least in the first pass, words shorter than 3 characters aren't worth bothering with.)¹

So, even on a super long text, because we eventually need to regular-expression-search the whole thing anyway, rather than building searches for any specific word(s), it makes more sense to just search it for every word, using re.compile(r"[\w'-]{3,}").finditer(text).²

Then, we can walk the iterator of matched results (non-overlapping word spans in the input text), and examine each one to (a) exclude any strings containing numerals, then (b) determine whether it appears on the supplied list of permitted words.

Anytime matched in list isn't True, that word can immediately be added to the errors list without any additional processing, because we already have the necessary position data in the .finditer() Match objects we're iterating over.

So, that implementation looks something like this:

def reverse_existence_check(
        text, list, err, msg, ignore_case=True, str=False, offset=0
        ):
    """Find all words in ``text`` that aren't on the ``list``."""
    if ignore_case:
        list = [word.lower() for word in list]

        def allowed_word(match):
            matched = match.string[match.start():match.end()].lower()
            return matched in list
    else:
        def allowed_word(match):
            matched = match.string[match.start():match.end()]
            return matched in list

    # Match all 3+ character words
    tokenizer = re.compile(r"[\w'-]{3,}")
    # Remove any that contain numerals
    exclusions = re.compile(r'[0-9]')

    errors = [(
        m.start() + 1 + offset,
        m.end() + offset,
        err,
        msg.format(m.string[m.start():m.end()]),
        None)
        for m in tokenizer.finditer(text)
        if not exclusions.search(m.string[m.start():m.end()])
        and not allowed_word(m)
    ]
    return errors

(A potential further optimization: The function could actually be written as a generator, by using yield instead of building the full list. That would mean consuming its output as an iterator would allow it to operate incrementally over the length of the input text.)

Iterative processing may not even be necessary, though. I happen to have a text version of "The Hitchhiker's Guide to the Galaxy" (a transcript of the radio plays, I think, not the book version), anyway it's 5019 lines and just shy of 47,000 words.

I imported it as a Python string, and ran the whole thing through the reverse-existence implementations. They produce wildly different results, so it's not an entirely balanced comparison, but running some time trials on how they processed the text at least let me confirm that the basic concept I was working with is sound.

>>> hhgttg = <the text>
>>> # (Note: I shuffled some things around to make the
>>> #  word lists from #1351 accessible to other code)
>>> t1k = proselint.checks.reverse_existence.top1000.WORDS
>>> oldfn = 'proselint.tools.reverse_existence_check'
>>> newfn = 'proselint.tools.my_revex'
>>>
>>> timeit.repeat(
...     f"{oldfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
...     number=1, repeat=2, globals=locals())
[59.088628315133974, 58.57274700002745]
>>>
>>> timeit.repeat(
...     f"{newfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
...     number=1, repeat=2, globals=locals())
[0.6903484750073403, 0.6852877610363066]
>>>
>>> orig = proselint.tools.reverse_existence_check(
...     hhgttg, t1k, 'error', 'Bad word {}'))
>>> mine = proselint.tools.my_revex(
...     hhgttg, t1k, 'error', 'Bad word {}'))
>>> len(orig)
10990
>>> len(mine)
13755

So, scanning ~47,000 words in a slightly noisy prose string, it flags 30% of them in a little over 2/3 of a second, up from flagging 23% in just shy of 60 seconds.

The increase in the number of results is a little surprising, especially when you look at what's actually flagged on either end:

>>> from pprint import pp
>>> pp([x[3] for x in orig[:50]])
['Bad word Douglas',
 'Bad word Hitch',
 'Bad word Guide',
 'Bad word Galaxy',
 'Bad word Fantazy.',
 'Bad word 1990.',
 'Bad word Based',
 'Bad word famous',
 'Bad word series',
 'Bad word Douglas',
 'Bad word N.',
 'Bad word Adams',
 'Bad word born',
 'Bad word Cambridge',
 'Bad word 1952.',
 'Bad word Brentwood',
 'Bad word Essex',
 'Bad word St.',
 'Bad word Cambridge',
 'Bad word English.',
 'Bad word graduation',
 'Bad word years',
 'Bad word material',
 'Bad word radio',
 'Bad word television',
 'Bad word shows',
 'Bad word writing,',
 'Bad word performing',
 'Bad word directing',
 'Bad word London,',
 'Bad word Cambridge',
 'Bad word Edinburgh',
 'Bad word worked',
 'Bad word various',
 'Bad word times',
 'Bad word porter,',
 'Bad word chicken',
 'Bad word cleaner,',
 'Bad word bodyguard,',
 'Bad word radio',
 'Bad word producer',
 'Bad word script',
 'Bad word married,',
 'Bad word Surrey.',
 'Bad word Jonny',
 'Bad word Clare',
 'Bad word Arlingtonians',
 'Bad word tea,',
 'Bad word sympathy,',
 'Bad word sofa']
>>>
>>>
>>> pp([x[3] for x in mine[:50]])
['Bad word Douglas',
 'Bad word Adams',
 'Bad word Hitch',
 'Bad word Hikers',
 'Bad word Guide',
 'Bad word Galaxy',
 'Bad word Fantazy',
 'Bad word Based',
 'Bad word famous',
 'Bad word Radio',
 'Bad word series',
 'Bad word Douglas',
 'Bad word Adams',
 'Bad word born',
 'Bad word Cambridge',
 'Bad word educated',
 'Bad word Brentwood',
 'Bad word Essex',
 "Bad word John's",
 'Bad word Cambridge',
 'Bad word English',
 'Bad word graduation',
 'Bad word years',
 'Bad word contributing',
 'Bad word material',
 'Bad word radio',
 'Bad word television',
 'Bad word shows',
 'Bad word writing',
 'Bad word performing',
 'Bad word directing',
 'Bad word stage',
 'Bad word revues',
 'Bad word London',
 'Bad word Cambridge',
 'Bad word Edinburgh',
 'Bad word Fringe',
 'Bad word worked',
 'Bad word various',
 'Bad word times',
 'Bad word porter',
 'Bad word barn',
 'Bad word builder',
 'Bad word chicken',
 'Bad word shed',
 'Bad word cleaner',
 'Bad word bodyguard',
 'Bad word radio',
 'Bad word producer',
 'Bad word script']

The original implementation is flagging words like English. (but note the period captured at the end), 1990., N., and St., all of which except for English I exclude. But it seems to be missing some words, like Adams and Radio early on — I haven't investigated why.³ And it also excludes John's, due to the apostrophe.

The code still needs additional testing and cleanup, but I think it's an approach worth getting in shape to submit as a PR, if @Nytelife26 is still open to incorporating the feature?

...Oh, and after writing this, I realized I could massively further speed up the permitted-word lookups by transforming the input list into a set up front. That reduced the processing time to...

>>> timeit.repeat(
...     f"{newfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
...     number=1, repeat=2, globals=locals())
[0.1017304239794612, 0.09134247805923223]

...around a tenth of a second.

Notes

That can be revisited, but I think it's a reasonable initial constraint. All of the 1- and 2-letter pronouns, prepositions, and etc. are on the Top-1000 list anyway. But obscure stuff like "Mr." or "OK" or "IV" becomes annoying to deal with, and beyond that the amount of textual "noise" that happens to have 2 letters adjacent to each other, but isn't word content at all, is much higher than if you set the lower limit at 3 letters.
You need the hyphen to treat hyphenated terms as one word, and you need the apostrophe to handle possessives and contractions. (Otherwise — especially if words of less than 3 characters are included — you end up checking endless piles of 's', 'ed', 'nt', etc...)

For the record, here's the first 500 characters of the input text:

>>> print('\n'.join(
... [x.strip() for x in hhgttg[:500].splitlines() if x]))
Douglas Adams. The Hitch Hikers Guide to Galaxy
Fantazy. 1990.
Based on the famous Radio series
Douglas  N.  Adams was born in Cambridge in 1952. He was
educated at Brentwood School, Essex  and  St.  John's  College,
Cambridge where he read English. After graduation he spent several
years contributing material to radio and television  shows
as  well  as  writing, performing and sometimes directing stage
revues in London, Cambridge and on the Edinburgh Fringe. He

Nytelife26 · 2024-05-08T16:40:09Z

@ferdnyc i must say, that's impressive work. the kind of science i know and love.

your performance improvements are beyond what i would've considered acceptable for an implementation. feel free to open a PR, and i'll review it within a few days.

ultimately i am not sure we can make good use of it until the general program and configuration is refactored, but that's for down the road. thank you for your contribution here.

Nytelife26 · 2024-05-14T04:49:50Z

@patcon i have good news for you. a year and a half later, thanks to the efforts of @ferdnyc and @vqle23, we have a solid implementation of this and two new optional checks to go with it.

patcon changed the title ~~Allow inverse existence_checks~~ Allow inverse existence_checks (re: up-goer-five check) Dec 9, 2022

patcon mentioned this issue Dec 9, 2022

Error text improvements DFE-Digital/polis-whitelabel#17

Open

vqle23 mentioned this issue Apr 21, 2023

Adding Reverse Existence checks Issue #1334 #1351

Closed

Nytelife26 self-assigned this Jul 27, 2023

This was referenced May 10, 2024

checks: Remove duplicate items from lists #1369

Merged

New reverse-existence implementation #1370

Merged

Nytelife26 closed this as completed in #1370 May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow inverse `existence_check`s (re: up-goer-five check) #1334

Allow inverse `existence_check`s (re: up-goer-five check) #1334

patcon commented Dec 9, 2022

vqle23 commented Apr 5, 2023

Nytelife26 commented Jul 27, 2023

patcon commented Aug 25, 2023

Nytelife26 commented Apr 23, 2024

ferdnyc commented May 8, 2024 •

edited

Nytelife26 commented May 8, 2024

Nytelife26 commented May 14, 2024 •

edited

Allow inverse existence_checks (re: up-goer-five check) #1334

Allow inverse existence_checks (re: up-goer-five check) #1334

Comments

patcon commented Dec 9, 2022

vqle23 commented Apr 5, 2023

Nytelife26 commented Jul 27, 2023

patcon commented Aug 25, 2023

Nytelife26 commented Apr 23, 2024

ferdnyc commented May 8, 2024 • edited

Notes

Nytelife26 commented May 8, 2024

Nytelife26 commented May 14, 2024 • edited

Allow inverse `existence_check`s (re: up-goer-five check) #1334

Allow inverse `existence_check`s (re: up-goer-five check) #1334

ferdnyc commented May 8, 2024 •

edited

Nytelife26 commented May 14, 2024 •

edited