New reverse-existence implementation #1370

ferdnyc · 2024-05-11T00:05:38Z

This PR formally submits the implementation of proselint.tools.reverse_existence_check which I describe in #1334 (comment). It still builds on the great work @vqle23 did in #1351 (in fact, it builds on the branch from that PR, rebased to current main and with my changes added as followup commits.)

A few things have changed from what I described in that issue comment:

While the utility function in proselint.tools is still named reverse_existence_check(), I renamed the actual checkers to restricted.top1000 and restricted.elementary, as I felt that was a less unwieldy title/category name that doesn't sacrifice any accuracy/descriptiveness.
I tweaked the word regex slightly, to avoid capturing hyphens or apostrophes on either extreme of the word. (So, my arbitrarily-chosen 3-character minimum length turned out to be prescient, as the regex ended up looking like this: re.compile(r"\w[\w'-]+\w")
Both versions of the allowed_word helper function (case-sensitive and not) are now defined statically in the tools.py file. The proper one is selected and bound with a functools.partial() at the start of each reverse_existence_check() call.

(...And I just noticed, I left some type annotations on the arguments to the helpers, that I'll probably have to take out to appease older Pythons. Bother.)

Fixes: #1334

…g fixes

Also, factor out word lists into module-level variables, for access from elsewhere.

Searches for all (3+-character) words using a simple regexp, then walks the `finditer()` of non-overlapping matches, does some sanity-checking on the candidate string (no digits), and queues an error unless it appears on the list of permitted words.

Nytelife26

other than this, it looks good. this could be used later as a model for refactoring the way existence_check itself works later, actually.

proselint/tools.py

Nytelife26 · 2024-05-13T16:08:07Z

proselint/tools.py

+    if ignore_case:
+        permitted = set([word.lower() for word in list])
+        allowed_word = functools.partial(
+            _case_insensitive_allowed_word, permitted)
+    else:
+        permitted = set(list)
+        allowed_word = functools.partial(
+            _case_sensitive_allowed_word, permitted
+        )


if my previous comment is followed, this can also be simplified.

Suggested change

if ignore_case:

permitted = set([word.lower() for word in list])

allowed_word = functools.partial(

_case_insensitive_allowed_word, permitted)

else:

permitted = set(list)

allowed_word = functools.partial(

_case_sensitive_allowed_word, permitted

)

permitted = set([word.lower() for word in list] if ignore_case else list)

allowed_word = functools.partial(

_allowed_word, permitted, ignore_case=ignore_case

)

Oh, actually I missed your set-conversion tweak... hm, that feels a bit "fun" to parse (for humans), compared to:

if ignore_case: permitted = set([word.lower() for word in list]) else: permitted = set(list)

I'm open to it, just not sure it's necessary to make future code readers work that hard. Thoughts?

i have never really considered the readability of inline if statements for Python programmers, i just know ternaries are a popular construct in other languages i use, like Rust and TypeScript. additionally, ruff has a rule (derived from flake8-simplify) that prefers ternaries, so i would imagine they're not uncommon in Python as well. without any kind of data on how readable people find them, i can only go by convention.

I tend to reserve them for situations where there's no other choice, like in list comprehensions. (Which can be downright squirrely to read under any circumstances; see my last commit that adds some indentation to try and make my whopper at least semi-readable.) I guess it all comes down to personal comfort; I know people who dislike code like this:

if something: return one thing return the other

...and always want the explicit else: in there. And then OTOH there are linters that will flag the explicit else: as unnecessary (which of course it is, technically).

¯\_(ツ)_/¯

in some cases i might even use a ternary on the return for the example you just gave, actually. however, we won't do that here. i think we should use a ternary for the highlighted block here, and then when i sort out the refactor i'll see what ruff thinks.

The fact that it baaarely fits in 80 columns makes me indifferent to it, so sure. Done.

ferdnyc · 2024-05-13T16:50:18Z

@Nytelife26

Oh, one thing I didn't address from the previous PR is your comment about the size of the word lists in the code.

I suppose we could move them to a JSON file, and json.load() the list in when the check runs. Possibly even a compressed JSON file.

To retain compatibility with zipfile packaging and the like, it'd probably be advisable to use importlib.resources to access the data (and make the backport a dependency for older Pythons) rather than relying on there being a filesystem to access. But that's not that big a deal, it's a solved problem at this point.

ferdnyc · 2024-05-13T17:06:41Z

I just sketched out a quick attempt at offloading the top1000 word list into a JSON file. Uncompressed, top1000.json would be 8.9k, and reduce the size of top1000.py from 14k to 850 bytes. (That's including the code to load the word list from the file.)

ferdnyc · 2024-05-13T17:07:48Z

elementary.json would be 7.5k, and reduce elementary.py from 12k to around the same 850 bytes.

Nytelife26 · 2024-05-13T17:17:39Z

i appreciate your continued efforts to go the extra mile. since it's really just a list, wouldn't a csv or similar be more efficient than json?

Nytelife26 · 2024-05-13T17:19:43Z

it'd probably be advisable to use importlib.resources to access the data (and make the backport a dependency for older Pythons)

since importlib.resources was added in Python 3.7, and the oldest supported version of Python is 3.8, we don't need to worry about backports. i would be comfortable with using importlib if you think it'd be more compatible for packaging.

ferdnyc · 2024-05-13T17:30:11Z

@Nytelife26 Unfortunately, the modern interface to importlib.resources (importlib.resources.file()) didn't come along until Python 3.9. But there's the importlib-resources backport package that can pull it in for Python 3.8.

It's just a matter of adding a dependency (I can never remember the syntax..):

importlib-resources = "^6.0"; python_version < "3.9"
# Wait, actually I think it's...
importlib-resources = "^6.0; python_version < 3.9"

ferdnyc · 2024-05-13T17:50:12Z

Hmm, turning the list into a .csv file with zero structure whatsoever makes top1000.csv 5.9k, as opposed to the JSON file's 8.9k.

That's for a file that looks like this, which I hate (but I don't have to read it, Python does):

a,able,about,above,accept,across,act,actually,add,admit,afraid,after,afternoon,again,against,age,ago,agree,ah,ahead,air,all,allow,almost,alone,along,already,alright,also,although,always,am,amaze,an,an

The JSON file looks like this, technically I could squeeze it tighter by telling it to lose the spaces between each item:

["a", "able", "about", "above", "accept", "across", "act", "actually", "add", "admit", "afraid", "after", "afternoon", "again", "against", "age", "ago", "agree", "ah", "ahead", "air", "all", "allow",

Going with CSV also means dealing with csv.py and the csv.reader(), which as standard library modules go is abominable. json.load() is simplicity itself. But, it's just 2-4 lines of code either way, so NBD. And the file is a lot smaller.

ferdnyc · 2024-05-13T18:24:21Z

It's just a matter of adding a dependency (I can never remember the syntax..):
importlib-resources = "^6.0"; python_version < "3.9"
# Wait, actually I think it's...
importlib-resources = "^6.0; python_version < 3.9"

I was doubly wrong, with poetry it's apparently:

importlib-resources = { version = "^6.0", python = "<3.9" }

Nytelife26 · 2024-05-13T18:28:27Z

That's for a file that looks like this, which I hate (but I don't have to read it, Python does)

using newlines as field delimiters instead of commas would improve readability and make diffs easier, without changing file size, if that helps.

you could argue with the validity of this, because it's comma separated values not newline separated values, but i would argue it makes sense to imagine them as records of one "words" field.

ferdnyc · 2024-05-13T18:31:57Z

Hmm, I suppose that'd make csv.reader() happier too, right now it reads the whole row in as a single list, which it then returns inside a one-item list (because it's a list of rows and there's only one row). That way the result would just be the list.

Nytelife26 · 2024-05-13T18:52:05Z

sort of. it still produces a list of lists (on my system at least), which should be trivial to flatten, but it's worth noting.

ferdnyc · 2024-05-13T18:57:32Z

sort of. it still produces a list of lists (on my system at least), which should be trivial to flatten, but it's worth noting.

Yeah, I ended up having to do this nonsense which is... whatever. I hate csv.py with a passion.

with files(proselint).joinpath(_CSV_PATH).open('r') as data:
    reader = csv.reader(data)
    wordlist = list()
    for row in reader:
        wordlist.extend(row)

EDIT: (You should've seen me trying to write the files, I kept ending up with output that looked like this...)

"a"
"a","b","l","e"
"a","b","o","u","t"

...And then cursing. A lot.

ferdnyc · 2024-05-13T19:13:57Z

@Nytelife26
I just realized I have no idea if those CSV files will actually be packaged into the source and/or binary distributions. I'll defer to your expertise on that one if that's OK. I tried to run poetry locally, it tried to get me to authenticate for something, then crashed my ssh-agent completely. So now I'm stuck typing my passphrase for every git transaction until I fix its mess. 🙄

Nytelife26 · 2024-05-13T20:10:36Z

it would be nice if python had a flatten function, like pretty much every other language i use, that's for sure.
my latest commits should ease your pain somewhat.

also, having just tested the build system locally, and reviewing the configuration, all files in proselint/ including the new CSVs are included in the build.

proselint/checks/restricted/top1000.py

Nytelife26 · 2024-05-13T20:31:20Z

i'm ready to merge this if you have no further comments, suggestions or changes to make.

ferdnyc · 2024-05-14T02:08:51Z

@Nytelife26 All looks good to me, thanks for all your help with this!

vqle23 and others added 4 commits May 10, 2024 19:45

added reverse existence with some basic tests for top1000 words

16df335

added elementary words, docstrings,summary,sources, tests, and lintin…

a8dc8c2

…g fixes

Rename "reverse_existence" checks to "restricted"

fc483ad

Also, factor out word lists into module-level variables, for access from elsewhere.

ferdnyc requested review from suchow and Nytelife26 as code owners May 11, 2024 00:05

Nytelife26 requested changes May 13, 2024

View reviewed changes

This was referenced May 13, 2024

Adding Reverse Existence checks Issue #1334 #1351

Closed

checks: Remove duplicate items from lists #1369

Merged

ferdnyc added 2 commits May 13, 2024 12:24

Simplify allowed_word helper

af304db

Format list comprehension comprehensibly

63f0a10

ferdnyc added 2 commits May 13, 2024 14:07

Switch to CSV for restricted wordlists

3eb9696

Py3.8 balks at type annotation

057b529

More terse set-conversion code

d3266a4

Reformat CSV files to one-word-per-line

8695911

Nytelife26 added 2 commits May 13, 2024 20:56

chore: crlf -> lf

51c0716

refactor: replace csv library with simple read

2ce070a

Nytelife26 reviewed May 13, 2024

View reviewed changes

proselint/checks/restricted/top1000.py Outdated Show resolved Hide resolved

Nytelife26 added 2 commits May 13, 2024 21:29

style: apply lint

f2eadac

docs: update error message for restricted.top1000

e257cd7

Nytelife26 merged commit 94a1eea into amperser:main May 14, 2024
1 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New reverse-existence implementation #1370

New reverse-existence implementation #1370

ferdnyc commented May 11, 2024

Nytelife26 left a comment

Nytelife26 May 13, 2024

ferdnyc May 13, 2024

Nytelife26 May 13, 2024 •

edited

ferdnyc May 13, 2024

Nytelife26 May 13, 2024

ferdnyc May 13, 2024

ferdnyc commented May 13, 2024

ferdnyc commented May 13, 2024

ferdnyc commented May 13, 2024

Nytelife26 commented May 13, 2024

Nytelife26 commented May 13, 2024

ferdnyc commented May 13, 2024 •

edited

ferdnyc commented May 13, 2024

ferdnyc commented May 13, 2024

Nytelife26 commented May 13, 2024 •

edited

ferdnyc commented May 13, 2024

Nytelife26 commented May 13, 2024

ferdnyc commented May 13, 2024 •

edited

ferdnyc commented May 13, 2024

Nytelife26 commented May 13, 2024 •

edited

Nytelife26 commented May 13, 2024

ferdnyc commented May 14, 2024

New reverse-existence implementation #1370

New reverse-existence implementation #1370

Conversation

ferdnyc commented May 11, 2024

Nytelife26 left a comment

Choose a reason for hiding this comment

Nytelife26 May 13, 2024

Choose a reason for hiding this comment

ferdnyc May 13, 2024

Choose a reason for hiding this comment

Nytelife26 May 13, 2024 • edited

Choose a reason for hiding this comment

ferdnyc May 13, 2024

Choose a reason for hiding this comment

Nytelife26 May 13, 2024

Choose a reason for hiding this comment

ferdnyc May 13, 2024

Choose a reason for hiding this comment

ferdnyc commented May 13, 2024

ferdnyc commented May 13, 2024

ferdnyc commented May 13, 2024

Nytelife26 commented May 13, 2024

Nytelife26 commented May 13, 2024

ferdnyc commented May 13, 2024 • edited

ferdnyc commented May 13, 2024

ferdnyc commented May 13, 2024

Nytelife26 commented May 13, 2024 • edited

ferdnyc commented May 13, 2024

Nytelife26 commented May 13, 2024

ferdnyc commented May 13, 2024 • edited

ferdnyc commented May 13, 2024

Nytelife26 commented May 13, 2024 • edited

Nytelife26 commented May 13, 2024

ferdnyc commented May 14, 2024

Nytelife26 May 13, 2024 •

edited

ferdnyc commented May 13, 2024 •

edited

Nytelife26 commented May 13, 2024 •

edited

ferdnyc commented May 13, 2024 •

edited

Nytelife26 commented May 13, 2024 •

edited