Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checks: Remove duplicate items from lists #1369

Merged
merged 1 commit into from May 13, 2024

Conversation

ferdnyc
Copy link
Contributor

@ferdnyc ferdnyc commented May 9, 2024

While reading through one of the proselint/checks/*/misc.py files, I happened to notice a duplicate item on its list of matching terms.

So I ran a quick script to look for duplicate items in all of the misc.py files, spotted a couple more that way, and removed those as well.

Redundancies NOT fixed

Both cliches/misc.py and redundancy/misc.py contain several more duplicates, by virtue of the same term appearing on multiple different lists of cliches/redundancies — those, I didn't touch. This is only about duplicates on the same list.

(Plus, the irony of having redundant terms in the redundancies check is inherently entertaining. Enough to justify leaving those alone anyway.)

Copy link
Member

@Nytelife26 Nytelife26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good finds, thank you for the contribution.

i like your sense of humour, but i will indeed go through and remove duplicates in other lists as part of this.

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 10, 2024

@Nytelife26

i will indeed go through and remove duplicates in other lists as part of this.

Well, my thinking in leaving them was: Where the lists come from different external sources (as is the case in all of those left-over duplicates), how do you choose which one keeps the term, and which one doesn't? (And then either way, one list is no longer in sync with its source.)

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 10, 2024

@Nytelife26

But if you want the lists of remaining, cross-list duplicates, they are:

cliches/misc.py:

        "between a rock and a hard place",
        "conspicuous by its absence",
        "crystal clear",
        "easier said than done",
        "gilding the lily",
        "horns of a dilemma",
        "last but not least",
        "moment of truth",
        "par for the course",
        "thick as thieves",

redundancy/misc.py:

        ["blend",             ["blend together"]],
        ["Challah",           ["Challah bread"]],
        ["collaborate",       ["collaborate together"]],
        ["combine",           ["combine together"]],
        ["cooperation",       ["mutual cooperation"]],
        ["fundamentals",      ["basic fundamentals"]],
        ["gift",              ["free gift"]],
        ["innovation",        ["new innovation"]],
        ["merge",             ["merge together"]],
        ["mix",               ["mix together"]],
        ["mutual respect",    ["mutual respect for each other"]],
        ["plans",             ["future plans"]],
        ["potable water",     ["potable drinking water"]],
        ["refer",             ["refer back"]],
        ["retreat",           ["retreat back"]],
        ["surrounded",        ["surrounded on all sides"]],
        ["veteran",           ["former veteran"]],
        ["visible",           ["visible to the eye"]],

@Nytelife26
Copy link
Member

how do you choose which one keeps the term, and which one doesn't?

i am currently waiting to accrue the funds to obtain the sources myself (my university library does not have them unfortunately), which would make things easier, but you're right.

i was considering aggregating the lists so they run by type of check rather than by source, and using comments to keep track, but that might get messy.

i'll come up with something.

@ferdnyc
Copy link
Contributor Author

ferdnyc commented May 10, 2024

@Nytelife26

i was considering aggregating the lists so they run by type of check rather than by source, and using comments to keep track, but that might get messy.

That's certainly an option. I mentioned in #1334 that in testing my implementation of reverse-existence, I'd factored the lists of terms out of the check functions so that I could access them externally, specifically in the REPL when I was testing/debugging/benchmarking interactively.

That's useful enough that I'd suggest it as an across-the-board change for all of the checking functions — instead of holding their lists internally, move them out to module-level variables which can be accessed from other code.

So, where now the "template" is:

def some_check_fn(args):
    terms = [
      "foo",
      "bar",
      "baz",
    ]
    return some_tool_function(text, terms, blah, blah)

My suggestion would be to go with:

SOME_CHECK_TERMS = [
  "foo",
  "bar",
  "baz",
]

def some_check_fn(args):
    terms = SOME_CHECK_TERMS
    return some_tool_function(text, terms, blah, blah)

The major advantage there is that you can use from proselint.checks.checker.misc import SOME_CHECK_TERMS to get at that list of terms from anywhere in the codebase — from any codebase, really, even outside of proselint.

But it would also permit combining multiple lists together1 and de-duplicating them, e.g.:

FIRST_SOURCE_TERMS = [...]
SECOND_SOURCE_TERMS = [...]

def some_check_fn():
    terms = list(set(FIRST_SOURCE_TERMS + SECOND_SOURCE_TERMS))

   return some_tool_function(text, terms, blah, blah)

Do that, it doesn't matter if the source lists contain redundancies, they'll be dropped in building the final terms list.

Notes

  1. (I like to think I used "combine... together" here only because it was on the duplicate-redundancies list I'd just posted, and seeing it there meant it was on my mind as I was writing this. Because the alternative is that I write tedious, redundant prose.)

@Nytelife26
Copy link
Member

i like that solution, actually. it could simplify testing, too. i think the goal should eventually be to refactor checks to merely return a list of regex-enabled sequences along with their options, and run the existence check once so it can search the source for everything in as few passes as possible, like what you did with reverse_existence_check in #1370.

@Nytelife26 Nytelife26 merged commit f01742d into amperser:main May 13, 2024
1 of 17 checks passed
@ferdnyc ferdnyc deleted the remove-dupes branch May 13, 2024 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants