New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow inverse existence_check
s (re: up-goer-five check)
#1334
Comments
existence_check
sexistence_check
s (re: up-goer-five check)
Hey, this sounds interesting. I can probably work on this within the next week if you haven't. |
I can see the potential utility in this, but it feels out of scope for proselint. Can you think of any scenario where this may be useful in checking prose besides in writing tasks that involve special restrictions? I fear this check would be computationally expensive, and unnecessary for the typical use of the program. |
I feel this sort of processing is most likely to be helpful to people writing linters for things like ESL readers (english-as-second-language) or for scientific communication with very young readers or those with minimal or no high school education. I don't want to take anyone's time for granted (we should only work on what we enjoy!), but I will admit it feels arbitrary to put these categories of use outside the realm of worth serving. I totally understand if you do not wish to invest your own time, but pls reconsider making a blanket judgement that this is too computationally expensive 🙏🏻🙏🏻🙏🏻🙂 |
Hi, @patcon. I have reconsidered, indeed. Many other linters provide configurations for special environments, and it can't hurt to broaden proselint's utility. Maybe this could be implemented as a plugin, when such a system is available? Alternatively, if you have suggestions for how this could work in a similarly performant manner to what we have now, I would be open to including this in proselint's core. |
So, with respect to the fine initial work done by @vqle23 in #1351, I have a different implementation of this that I think solves the major issues that came up in that PR's review process. I started from these basic premises:
So, even on a super long text, because we eventually need to regular-expression-search the whole thing anyway, rather than building searches for any specific word(s), it makes more sense to just search it for every word, using Then, we can walk the iterator of Anytime So, that implementation looks something like this: def reverse_existence_check(
text, list, err, msg, ignore_case=True, str=False, offset=0
):
"""Find all words in ``text`` that aren't on the ``list``."""
if ignore_case:
list = [word.lower() for word in list]
def allowed_word(match):
matched = match.string[match.start():match.end()].lower()
return matched in list
else:
def allowed_word(match):
matched = match.string[match.start():match.end()]
return matched in list
# Match all 3+ character words
tokenizer = re.compile(r"[\w'-]{3,}")
# Remove any that contain numerals
exclusions = re.compile(r'[0-9]')
errors = [(
m.start() + 1 + offset,
m.end() + offset,
err,
msg.format(m.string[m.start():m.end()]),
None)
for m in tokenizer.finditer(text)
if not exclusions.search(m.string[m.start():m.end()])
and not allowed_word(m)
]
return errors (A potential further optimization: The function could actually be written as a generator, by using Iterative processing may not even be necessary, though. I happen to have a text version of "The Hitchhiker's Guide to the Galaxy" (a transcript of the radio plays, I think, not the book version), anyway it's 5019 lines and just shy of 47,000 words. I imported it as a Python string, and ran the whole thing through the reverse-existence implementations. They produce wildly different results, so it's not an entirely balanced comparison, but running some time trials on how they processed the text at least let me confirm that the basic concept I was working with is sound. >>> hhgttg = <the text>
>>> # (Note: I shuffled some things around to make the
>>> # word lists from #1351 accessible to other code)
>>> t1k = proselint.checks.reverse_existence.top1000.WORDS
>>> oldfn = 'proselint.tools.reverse_existence_check'
>>> newfn = 'proselint.tools.my_revex'
>>>
>>> timeit.repeat(
... f"{oldfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
... number=1, repeat=2, globals=locals())
[59.088628315133974, 58.57274700002745]
>>>
>>> timeit.repeat(
... f"{newfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
... number=1, repeat=2, globals=locals())
[0.6903484750073403, 0.6852877610363066]
>>>
>>> orig = proselint.tools.reverse_existence_check(
... hhgttg, t1k, 'error', 'Bad word {}'))
>>> mine = proselint.tools.my_revex(
... hhgttg, t1k, 'error', 'Bad word {}'))
>>> len(orig)
10990
>>> len(mine)
13755 So, scanning ~47,000 words in a slightly noisy prose string, it flags 30% of them in a little over 2/3 of a second, up from flagging 23% in just shy of 60 seconds. The increase in the number of results is a little surprising, especially when you look at what's actually flagged on either end: >>> from pprint import pp
>>> pp([x[3] for x in orig[:50]])
['Bad word Douglas',
'Bad word Hitch',
'Bad word Guide',
'Bad word Galaxy',
'Bad word Fantazy.',
'Bad word 1990.',
'Bad word Based',
'Bad word famous',
'Bad word series',
'Bad word Douglas',
'Bad word N.',
'Bad word Adams',
'Bad word born',
'Bad word Cambridge',
'Bad word 1952.',
'Bad word Brentwood',
'Bad word Essex',
'Bad word St.',
'Bad word Cambridge',
'Bad word English.',
'Bad word graduation',
'Bad word years',
'Bad word material',
'Bad word radio',
'Bad word television',
'Bad word shows',
'Bad word writing,',
'Bad word performing',
'Bad word directing',
'Bad word London,',
'Bad word Cambridge',
'Bad word Edinburgh',
'Bad word worked',
'Bad word various',
'Bad word times',
'Bad word porter,',
'Bad word chicken',
'Bad word cleaner,',
'Bad word bodyguard,',
'Bad word radio',
'Bad word producer',
'Bad word script',
'Bad word married,',
'Bad word Surrey.',
'Bad word Jonny',
'Bad word Clare',
'Bad word Arlingtonians',
'Bad word tea,',
'Bad word sympathy,',
'Bad word sofa']
>>>
>>>
>>> pp([x[3] for x in mine[:50]])
['Bad word Douglas',
'Bad word Adams',
'Bad word Hitch',
'Bad word Hikers',
'Bad word Guide',
'Bad word Galaxy',
'Bad word Fantazy',
'Bad word Based',
'Bad word famous',
'Bad word Radio',
'Bad word series',
'Bad word Douglas',
'Bad word Adams',
'Bad word born',
'Bad word Cambridge',
'Bad word educated',
'Bad word Brentwood',
'Bad word Essex',
"Bad word John's",
'Bad word Cambridge',
'Bad word English',
'Bad word graduation',
'Bad word years',
'Bad word contributing',
'Bad word material',
'Bad word radio',
'Bad word television',
'Bad word shows',
'Bad word writing',
'Bad word performing',
'Bad word directing',
'Bad word stage',
'Bad word revues',
'Bad word London',
'Bad word Cambridge',
'Bad word Edinburgh',
'Bad word Fringe',
'Bad word worked',
'Bad word various',
'Bad word times',
'Bad word porter',
'Bad word barn',
'Bad word builder',
'Bad word chicken',
'Bad word shed',
'Bad word cleaner',
'Bad word bodyguard',
'Bad word radio',
'Bad word producer',
'Bad word script'] The original implementation is flagging words like The code still needs additional testing and cleanup, but I think it's an approach worth getting in shape to submit as a PR, if @Nytelife26 is still open to incorporating the feature? ...Oh, and after writing this, I realized I could massively further speed up the permitted-word lookups by transforming the input >>> timeit.repeat(
... f"{newfn}(hhgttg, t1k, 'error', 'Bad word {{}}')",
... number=1, repeat=2, globals=locals())
[0.1017304239794612, 0.09134247805923223] ...around a tenth of a second. Notes
|
@ferdnyc i must say, that's impressive work. the kind of science i know and love. your performance improvements are beyond what i would've considered acceptable for an implementation. feel free to open a PR, and i'll review it within a few days. ultimately i am not sure we can make good use of it until the general program and configuration is refactored, but that's for down the road. thank you for your contribution here. |
Hi there! This tool is neat. Thanks!
I was hoping to create a check for raising issue when any words outside a set were used: (top 1000 english words)
https://splasho.com/upgoer5/phpspellcheck/dictionaries/1000.dicin
Can you suggest the simplest way to go about this? I'm happy to submit a PR when I get around to it :)
Inspiration: https://splasho.com/upgoer5/
The text was updated successfully, but these errors were encountered: