Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease size of "small" word list #3

Open
MrXyfir opened this issue Apr 25, 2019 · 5 comments
Open

Decrease size of "small" word list #3

MrXyfir opened this issue Apr 25, 2019 · 5 comments

Comments

@MrXyfir
Copy link
Member

MrXyfir commented Apr 25, 2019

Given the introduction of "small" and "big" word lists in v3.1.0, I'd like to decrease the size of the "small" word list from ~129k words to 100k at most, and possibly even down to 50k.

The small list should be acceptable for general use, and as lightweight as possible. We should remove any stop words, and any super rare words that you can hardly find a definition for.

@fredspivock
Copy link

Would this be useful?
https://www.wordfrequency.info/samples.asp

Im looking to use this library to generate a 3 or 4 word passphrase but the words currently produced by the library are too obscure. I think using this list of 60k words could be useful.

@MrXyfir
Copy link
Member Author

MrXyfir commented Nov 13, 2020

@fredspivock If you were able to find or compile a list that was freely available without purchase or licensing I'd be happy to update the library with it! It looks like that one requires payment but maybe it'd be allowed if you were only taking the words themselves and not the data attached to them? Might be worth asking.

@fredspivock
Copy link

fredspivock commented Nov 13, 2020

@MrXyfir I got excited! You are right, it is paid.
This might be more promising:
https://github.com/first20hours/google-10000-english

He even includes a link to a much larger list but it seems like he did some deduping on it. I also noticed it contains proper names so that could be a deal breaker for some.

@MrXyfir
Copy link
Member Author

MrXyfir commented Nov 14, 2020

@fredspivock I think 10,000 is too small, unless maybe we rename our current small to medium and use the 10,000 for small? That could work. Ideally I'd like our primary list to be around 40-60k. If you could find one that'd be great! And yeah preferably without names.

@MrXyfir
Copy link
Member Author

MrXyfir commented Nov 26, 2020

I was able to bring the size down by roughly 5,000 by removing bad/offensive words that shouldn't have been in there anyways. See #5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants