Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more end-user tools and scripts #202

Open
kristian-clausal opened this issue Jan 20, 2023 · 1 comment
Open

Add more end-user tools and scripts #202

kristian-clausal opened this issue Jan 20, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@kristian-clausal
Copy link
Collaborator

I've added a usertools/ folder with three starting scripts. Two of them are for sorting our .json output files so that you get consistent sorting between runs, the third is a word search that trawls through a wiktextract .json file, with a toggle for regex, filtering by language(s) and a max output count.

If you have anything that you would like to add there, just put up a pull request with a new file in usertools/, or if you have specific requests post here.

These are meant to be small command-line scripts for the most part, but even that's not a must as long as it could be helpful to someone somewhere down the line. It would be very nice if they are simple and easy to understand even for people new to programming, so that they can be edited (and resubmitted as new variant scripts).

@kristian-clausal kristian-clausal added enhancement New feature or request good first issue Good for newcomers labels Jan 20, 2023
@kristian-clausal kristian-clausal self-assigned this Jan 20, 2023
@kristian-clausal
Copy link
Collaborator Author

kristian-clausal commented Jan 26, 2023

The sorting scripts aren't nearly enough to get a working diff. I've committed json-compare-samples.py which takes two files, indexes one of them trying to give each json object its own key (which doesn't work when there's not enough distinguishing info and many "Noun" "Noun" "Noun" sections inside the same etymology...), then the other file is jumped through and each line has a one in N chance (--one-in-a) to be chosen as a sample. The sample is also wrung through the same process to craft a key that should correspond with one in the index of the first file, and then those two lines are compared using difflib if they are different; this outputs something like a diff for each object being compared, comparing lines of strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant