Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New reverse-existence implementation #1370

Merged
merged 14 commits into from May 14, 2024
Merged

Conversation

ferdnyc
Copy link
Contributor

@ferdnyc ferdnyc commented May 11, 2024

This PR formally submits the implementation of proselint.tools.reverse_existence_check which I describe in #1334 (comment). It still builds on the great work @vqle23 did in #1351 (in fact, it builds on the branch from that PR, rebased to current main and with my changes added as followup commits.)

A few things have changed from what I described in that issue comment:

  1. While the utility function in proselint.tools is still named reverse_existence_check(), I renamed the actual checkers to restricted.top1000 and restricted.elementary, as I felt that was a less unwieldy title/category name that doesn't sacrifice any accuracy/descriptiveness.
  2. I tweaked the word regex slightly, to avoid capturing hyphens or apostrophes on either extreme of the word. (So, my arbitrarily-chosen 3-character minimum length turned out to be prescient, as the regex ended up looking like this: re.compile(r"\w[\w'-]+\w")
  3. Both versions of the allowed_word helper function (case-sensitive and not) are now defined statically in the tools.py file. The proper one is selected and bound with a functools.partial() at the start of each reverse_existence_check() call.

(...And I just noticed, I left some type annotations on the arguments to the helpers, that I'll probably have to take out to appease older Pythons. Bother.)

Fixes: #1334

vqle23 and others added 4 commits May 10, 2024 19:45
Also, factor out word lists into module-level variables, for
access from elsewhere.
Searches for all (3+-character) words using a simple regexp, then
walks the `finditer()` of non-overlapping matches, does some
sanity-checking on the candidate string (no digits), and queues
an error unless it appears on the list of permitted words.
Copy link
Member

@Nytelife26 Nytelife26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than this, it looks good. this could be used later as a model for refactoring the way existence_check itself works later, actually.

proselint/tools.py Outdated Show resolved Hide resolved
Comment on lines 384 to 392
if ignore_case:
permitted = set([word.lower() for word in list])
allowed_word = functools.partial(
_case_insensitive_allowed_word, permitted)
else:
permitted = set(list)
allowed_word = functools.partial(
_case_sensitive_allowed_word, permitted
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if my previous comment is followed, this can also be simplified.

Suggested change
if ignore_case:
permitted = set([word.lower() for word in list])
allowed_word = functools.partial(
_case_insensitive_allowed_word, permitted)
else:
permitted = set(list)
allowed_word = functools.partial(
_case_sensitive_allowed_word, permitted
)
permitted = set([word.lower() for word in list] if ignore_case else list)
allowed_word = functools.partial(
_allowed_word, permitted, ignore_case=ignore_case
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually I missed your set-conversion tweak... hm, that feels a bit "fun" to parse (for humans), compared to:

    if ignore_case:
        permitted = set([word.lower() for word in list])
    else:
        permitted = set(list)

I'm open to it, just not sure it's necessary to make future code readers work that hard. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have never really considered the readability of inline if statements for Python programmers, i just know ternaries are a popular construct in other languages i use, like Rust and TypeScript. additionally, ruff has a rule (derived from flake8-simplify) that prefers ternaries, so i would imagine they're not uncommon in Python as well. without any kind of data on how readable people find them, i can only go by convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to reserve them for situations where there's no other choice, like in list comprehensions. (Which can be downright squirrely to read under any circumstances; see my last commit that adds some indentation to try and make my whopper at least semi-readable.) I guess it all comes down to personal comfort; I know people who dislike code like this:

if something:
    return one thing
return the other

...and always want the explicit else: in there. And then OTOH there are linters that will flag the explicit else: as unnecessary (which of course it is, technically).

¯\_(ツ)_/¯

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in some cases i might even use a ternary on the return for the example you just gave, actually. however, we won't do that here. i think we should use a ternary for the highlighted block here, and then when i sort out the refactor i'll see what ruff thinks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that it baaarely fits in 80 columns makes me indifferent to it, so sure. Done.

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 13, 2024

@Nytelife26

Oh, one thing I didn't address from the previous PR is your comment about the size of the word lists in the code.

I suppose we could move them to a JSON file, and json.load() the list in when the check runs. Possibly even a compressed JSON file.

To retain compatibility with zipfile packaging and the like, it'd probably be advisable to use importlib.resources to access the data (and make the backport a dependency for older Pythons) rather than relying on there being a filesystem to access. But that's not that big a deal, it's a solved problem at this point.

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 13, 2024

I just sketched out a quick attempt at offloading the top1000 word list into a JSON file. Uncompressed, top1000.json would be 8.9k, and reduce the size of top1000.py from 14k to 850 bytes. (That's including the code to load the word list from the file.)

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 13, 2024

elementary.json would be 7.5k, and reduce elementary.py from 12k to around the same 850 bytes.

@Nytelife26
Copy link
Member

i appreciate your continued efforts to go the extra mile. since it's really just a list, wouldn't a csv or similar be more efficient than json?

@Nytelife26
Copy link
Member

it'd probably be advisable to use importlib.resources to access the data (and make the backport a dependency for older Pythons)

since importlib.resources was added in Python 3.7, and the oldest supported version of Python is 3.8, we don't need to worry about backports. i would be comfortable with using importlib if you think it'd be more compatible for packaging.

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 13, 2024

@Nytelife26 Unfortunately, the modern interface to importlib.resources (importlib.resources.file()) didn't come along until Python 3.9. But there's the importlib-resources backport package that can pull it in for Python 3.8.

It's just a matter of adding a dependency (I can never remember the syntax..):

importlib-resources = "^6.0"; python_version < "3.9"
# Wait, actually I think it's...
importlib-resources = "^6.0; python_version < 3.9"

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 13, 2024

Hmm, turning the list into a .csv file with zero structure whatsoever makes top1000.csv 5.9k, as opposed to the JSON file's 8.9k.

That's for a file that looks like this, which I hate (but I don't have to read it, Python does):

a,able,about,above,accept,across,act,actually,add,admit,afraid,after,afternoon,again,against,age,ago,agree,ah,ahead,air,all,allow,almost,alone,along,already,alright,also,although,always,am,amaze,an,an

The JSON file looks like this, technically I could squeeze it tighter by telling it to lose the spaces between each item:

["a", "able", "about", "above", "accept", "across", "act", "actually", "add", "admit", "afraid", "after", "afternoon", "again", "against", "age", "ago", "agree", "ah", "ahead", "air", "all", "allow",

Going with CSV also means dealing with csv.py and the csv.reader(), which as standard library modules go is abominable. json.load() is simplicity itself. But, it's just 2-4 lines of code either way, so NBD. And the file is a lot smaller.

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 13, 2024

It's just a matter of adding a dependency (I can never remember the syntax..):

importlib-resources = "^6.0"; python_version < "3.9"
# Wait, actually I think it's...
importlib-resources = "^6.0; python_version < 3.9"

I was doubly wrong, with poetry it's apparently:

importlib-resources = { version = "^6.0", python = "<3.9" }

@Nytelife26
Copy link
Member

Nytelife26 commented May 13, 2024

That's for a file that looks like this, which I hate (but I don't have to read it, Python does)

using newlines as field delimiters instead of commas would improve readability and make diffs easier, without changing file size, if that helps.

you could argue with the validity of this, because it's comma separated values not newline separated values, but i would argue it makes sense to imagine them as records of one "words" field.

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 13, 2024

Hmm, I suppose that'd make csv.reader() happier too, right now it reads the whole row in as a single list, which it then returns inside a one-item list (because it's a list of rows and there's only one row). That way the result would just be the list.

@Nytelife26
Copy link
Member

sort of. it still produces a list of lists (on my system at least), which should be trivial to flatten, but it's worth noting.

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 13, 2024

sort of. it still produces a list of lists (on my system at least), which should be trivial to flatten, but it's worth noting.

Yeah, I ended up having to do this nonsense which is... whatever. I hate csv.py with a passion.

with files(proselint).joinpath(_CSV_PATH).open('r') as data:
    reader = csv.reader(data)
    wordlist = list()
    for row in reader:
        wordlist.extend(row)

EDIT: (You should've seen me trying to write the files, I kept ending up with output that looked like this...)

"a"
"a","b","l","e"
"a","b","o","u","t"

...And then cursing. A lot.

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 13, 2024

@Nytelife26
I just realized I have no idea if those CSV files will actually be packaged into the source and/or binary distributions. I'll defer to your expertise on that one if that's OK. I tried to run poetry locally, it tried to get me to authenticate for something, then crashed my ssh-agent completely. So now I'm stuck typing my passphrase for every git transaction until I fix its mess. 🙄

@Nytelife26
Copy link
Member

Nytelife26 commented May 13, 2024

it would be nice if python had a flatten function, like pretty much every other language i use, that's for sure.
my latest commits should ease your pain somewhat.

also, having just tested the build system locally, and reviewing the configuration, all files in proselint/ including the new CSVs are included in the build.

@Nytelife26
Copy link
Member

i'm ready to merge this if you have no further comments, suggestions or changes to make.

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 14, 2024

@Nytelife26 All looks good to me, thanks for all your help with this!

@Nytelife26 Nytelife26 merged commit 94a1eea into amperser:main May 14, 2024
1 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow inverse existence_checks (re: up-goer-five check)
3 participants