Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow inverse existence_checks (re: up-goer-five check) #1334

Closed
patcon opened this issue Dec 9, 2022 · 7 comments · Fixed by #1370
Closed

Allow inverse existence_checks (re: up-goer-five check) #1334

patcon opened this issue Dec 9, 2022 · 7 comments · Fixed by #1370
Assignees
Labels
cat: new-rule Issues and PRs related to new proselint rules. priority: null Issues and PRs that are of negligible importance so may be postponed. status: needs-help Issues and PRs that require volunteer assistance to proceed. type: feat Issues and PRs related to new features. version: minor Issues and PRs with new features belonging to the next minor release.

Comments

@patcon
Copy link

patcon commented Dec 9, 2022

Hi there! This tool is neat. Thanks!

I was hoping to create a check for raising issue when any words outside a set were used: (top 1000 english words)
https://splasho.com/upgoer5/phpspellcheck/dictionaries/1000.dicin

Can you suggest the simplest way to go about this? I'm happy to submit a PR when I get around to it :)

Inspiration: https://splasho.com/upgoer5/

@patcon patcon changed the title Allow inverse existence_checks Allow inverse existence_checks (re: up-goer-five check) Dec 9, 2022
@vqle23
Copy link
Contributor

vqle23 commented Apr 5, 2023

Hey, this sounds interesting. I can probably work on this within the next week if you haven't.

@Nytelife26 Nytelife26 self-assigned this Jul 27, 2023
@Nytelife26 Nytelife26 added type: feat Issues and PRs related to new features. priority: null Issues and PRs that are of negligible importance so may be postponed. version: minor Issues and PRs with new features belonging to the next minor release. status: needs-help Issues and PRs that require volunteer assistance to proceed. cat: new-rule Issues and PRs related to new proselint rules. labels Jul 27, 2023
@Nytelife26
Copy link
Member

I can see the potential utility in this, but it feels out of scope for proselint. Can you think of any scenario where this may be useful in checking prose besides in writing tasks that involve special restrictions? I fear this check would be computationally expensive, and unnecessary for the typical use of the program.

@patcon
Copy link
Author

patcon commented Aug 25, 2023

I feel this sort of processing is most likely to be helpful to people writing linters for things like ESL readers (english-as-second-language) or for scientific communication with very young readers or those with minimal or no high school education.

I don't want to take anyone's time for granted (we should only work on what we enjoy!), but I will admit it feels arbitrary to put these categories of use outside the realm of worth serving. I totally understand if you do not wish to invest your own time, but pls reconsider making a blanket judgement that this is too computationally expensive 🙏🏻🙏🏻🙏🏻🙂

@Nytelife26
Copy link
Member

Hi, @patcon. I have reconsidered, indeed. Many other linters provide configurations for special environments, and it can't hurt to broaden proselint's utility. Maybe this could be implemented as a plugin, when such a system is available? Alternatively, if you have suggestions for how this could work in a similarly performant manner to what we have now, I would be open to including this in proselint's core.

@ferdnyc
Copy link
Contributor

ferdnyc commented May 8, 2024

So, with respect to the fine initial work done by @vqle23 in #1351, I have a different implementation of this that I think solves the major issues that came up in that PR's review process.

I started from these basic premises:

  1. We have a relatively small universe of words in the "pass" list
  2. We'll eventually need to find the position of every non-matching word
  3. No matter how long the input text is, we're going to end up searching through it all the way to the end.
  4. (Tangentially and as a caveat/addendum: At least in the first pass, words shorter than 3 characters aren't worth bothering with.)1

So, even on a super long text, because we eventually need to regular-expression-search the whole thing anyway, rather than building searches for any specific word(s), it makes more sense to just search it for every word, using re.compile(r"[\w'-]{3,}").finditer(text).2

Then, we can walk the iterator of matched results (non-overlapping word spans in the input text), and examine each one to (a) exclude any strings containing numerals, then (b) determine whether it appears on the supplied list of permitted words.

Anytime matched in list isn't True, that word can immediately be added to the errors list without any additional processing, because we already have the necessary position data in the .finditer() Match objects we're iterating over.

So, that implementation looks something like this:

def reverse_existence_check(
        text, list, err, msg, ignore_case=True, str=False, offset=0
        ):
    """Find all words in ``text`` that aren't on the ``list``."""
    if ignore_case:
        list = [word.lower() for word in list]

        def allowed_word(match):
            matched = match.string[match.start():match.end()].lower()
            return matched in list
    else:
        def allowed_word(match):
            matched = match.string[match.start():match.end()]
            return matched in list

    # Match all 3+ character words
    tokenizer = re.compile(r"[\w'-]{3,}")
    # Remove any that contain numerals
    exclusions = re.compile(r'[0-9]')

    errors = [(
        m.start() + 1 + offset,
        m.end() + offset,
        err,
        msg.format(m.string[m.start():m.end()]),
        None)
        for m in tokenizer.finditer(text)
        if not exclusions.search(m.string[m.start():m.end()])
        and not allowed_word(m)
    ]
    return errors

(A potential further optimization: The function could actually be written as a generator, by using yield instead of building the full list. That would mean consuming its output as an iterator would allow it to operate incrementally over the length of the input text.)

Iterative processing may not even be necessary, though. I happen to have a text version of "The Hitchhiker's Guide to the Galaxy" (a transcript of the radio plays, I think, not the book version), anyway it's 5019 lines and just shy of 47,000 words.

I imported it as a Python string, and ran the whole thing through the reverse-existence implementations. They produce wildly different results, so it's not an entirely balanced comparison, but running some time trials on how they processed the text at least let me confirm that the basic concept I was working with is sound.

>>> hhgttg = <the text>
>>> # (Note: I shuffled some things around to make the
>>> #  word lists from #1351 accessible to other code)
>>> t1k = proselint.checks.reverse_existence.top1000.WORDS
>>> oldfn = 'proselint.tools.reverse_existence_check'
>>> newfn = 'proselint.tools.my_revex'
>>>
>>> timeit.repeat(
...     f"{oldfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
...     number=1, repeat=2, globals=locals())
[59.088628315133974, 58.57274700002745]
>>>
>>> timeit.repeat(
...     f"{newfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
...     number=1, repeat=2, globals=locals())
[0.6903484750073403, 0.6852877610363066]
>>>
>>> orig = proselint.tools.reverse_existence_check(
...     hhgttg, t1k, 'error', 'Bad word {}'))
>>> mine = proselint.tools.my_revex(
...     hhgttg, t1k, 'error', 'Bad word {}'))
>>> len(orig)
10990
>>> len(mine)
13755

So, scanning ~47,000 words in a slightly noisy prose string, it flags 30% of them in a little over 2/3 of a second, up from flagging 23% in just shy of 60 seconds.

The increase in the number of results is a little surprising, especially when you look at what's actually flagged on either end:

>>> from pprint import pp
>>> pp([x[3] for x in orig[:50]])
['Bad word Douglas',
 'Bad word Hitch',
 'Bad word Guide',
 'Bad word Galaxy',
 'Bad word Fantazy.',
 'Bad word 1990.',
 'Bad word Based',
 'Bad word famous',
 'Bad word series',
 'Bad word Douglas',
 'Bad word N.',
 'Bad word Adams',
 'Bad word born',
 'Bad word Cambridge',
 'Bad word 1952.',
 'Bad word Brentwood',
 'Bad word Essex',
 'Bad word St.',
 'Bad word Cambridge',
 'Bad word English.',
 'Bad word graduation',
 'Bad word years',
 'Bad word material',
 'Bad word radio',
 'Bad word television',
 'Bad word shows',
 'Bad word writing,',
 'Bad word performing',
 'Bad word directing',
 'Bad word London,',
 'Bad word Cambridge',
 'Bad word Edinburgh',
 'Bad word worked',
 'Bad word various',
 'Bad word times',
 'Bad word porter,',
 'Bad word chicken',
 'Bad word cleaner,',
 'Bad word bodyguard,',
 'Bad word radio',
 'Bad word producer',
 'Bad word script',
 'Bad word married,',
 'Bad word Surrey.',
 'Bad word Jonny',
 'Bad word Clare',
 'Bad word Arlingtonians',
 'Bad word tea,',
 'Bad word sympathy,',
 'Bad word sofa']
>>>
>>>
>>> pp([x[3] for x in mine[:50]])
['Bad word Douglas',
 'Bad word Adams',
 'Bad word Hitch',
 'Bad word Hikers',
 'Bad word Guide',
 'Bad word Galaxy',
 'Bad word Fantazy',
 'Bad word Based',
 'Bad word famous',
 'Bad word Radio',
 'Bad word series',
 'Bad word Douglas',
 'Bad word Adams',
 'Bad word born',
 'Bad word Cambridge',
 'Bad word educated',
 'Bad word Brentwood',
 'Bad word Essex',
 "Bad word John's",
 'Bad word Cambridge',
 'Bad word English',
 'Bad word graduation',
 'Bad word years',
 'Bad word contributing',
 'Bad word material',
 'Bad word radio',
 'Bad word television',
 'Bad word shows',
 'Bad word writing',
 'Bad word performing',
 'Bad word directing',
 'Bad word stage',
 'Bad word revues',
 'Bad word London',
 'Bad word Cambridge',
 'Bad word Edinburgh',
 'Bad word Fringe',
 'Bad word worked',
 'Bad word various',
 'Bad word times',
 'Bad word porter',
 'Bad word barn',
 'Bad word builder',
 'Bad word chicken',
 'Bad word shed',
 'Bad word cleaner',
 'Bad word bodyguard',
 'Bad word radio',
 'Bad word producer',
 'Bad word script']

The original implementation is flagging words like English. (but note the period captured at the end), 1990., N., and St., all of which except for English I exclude. But it seems to be missing some words, like Adams and Radio early on — I haven't investigated why.3 And it also excludes John's, due to the apostrophe.

The code still needs additional testing and cleanup, but I think it's an approach worth getting in shape to submit as a PR, if @Nytelife26 is still open to incorporating the feature?

...Oh, and after writing this, I realized I could massively further speed up the permitted-word lookups by transforming the input list into a set up front. That reduced the processing time to...

>>> timeit.repeat(
...     f"{newfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
...     number=1, repeat=2, globals=locals())
[0.1017304239794612, 0.09134247805923223]

...around a tenth of a second.

Notes

  1. That can be revisited, but I think it's a reasonable initial constraint. All of the 1- and 2-letter pronouns, prepositions, and etc. are on the Top-1000 list anyway. But obscure stuff like "Mr." or "OK" or "IV" becomes annoying to deal with, and beyond that the amount of textual "noise" that happens to have 2 letters adjacent to each other, but isn't word content at all, is much higher than if you set the lower limit at 3 letters.

  2. You need the hyphen to treat hyphenated terms as one word, and you need the apostrophe to handle possessives and contractions. (Otherwise — especially if words of less than 3 characters are included — you end up checking endless piles of 's', 'ed', 'nt', etc...)

  3. For the record, here's the first 500 characters of the input text:

    >>> print('\n'.join(
    ... [x.strip() for x in hhgttg[:500].splitlines() if x]))
    Douglas Adams. The Hitch Hikers Guide to Galaxy
    Fantazy. 1990.
    Based on the famous Radio series
    Douglas  N.  Adams was born in Cambridge in 1952. He was
    educated at Brentwood School, Essex  and  St.  John's  College,
    Cambridge where he read English. After graduation he spent several
    years contributing material to radio and television  shows
    as  well  as  writing, performing and sometimes directing stage
    revues in London, Cambridge and on the Edinburgh Fringe. He

@Nytelife26
Copy link
Member

@ferdnyc i must say, that's impressive work. the kind of science i know and love.

your performance improvements are beyond what i would've considered acceptable for an implementation. feel free to open a PR, and i'll review it within a few days.

ultimately i am not sure we can make good use of it until the general program and configuration is refactored, but that's for down the road. thank you for your contribution here.

@Nytelife26
Copy link
Member

Nytelife26 commented May 14, 2024

@patcon i have good news for you. a year and a half later, thanks to the efforts of @ferdnyc and @vqle23, we have a solid implementation of this and two new optional checks to go with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat: new-rule Issues and PRs related to new proselint rules. priority: null Issues and PRs that are of negligible importance so may be postponed. status: needs-help Issues and PRs that require volunteer assistance to proceed. type: feat Issues and PRs related to new features. version: minor Issues and PRs with new features belonging to the next minor release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants