Skip to content

A list of the top 3 million+ english words in project gutenberg.

License

Notifications You must be signed in to change notification settings

ScriptSmith/topwords

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Top english words

A comprehensive list of the top 3 million+ english words in project gutenberg. Data is sourced from Allison Parrish's awesome gutenberg-dammit project.

Usage

Use the word list:

$ head words.txt
the
of
and
to
a
in
that
i
he

Use the word count list:

$ head counts.txt
169852828 the
92493412 of
83626800 and
69017783 to
54796935 a
47554786 in
30598554 that
30324861 i
27900933 he

Download

or

Clone this repo:

git clone https://github.com/scriptsmith/topwords.git
cd topwords

Recreating

Tools used:

  • jq
  • parallel
  • grep
  • sed
  • GNU coreutils
    • tr
    • sort
    • uniq
    • cut

The following pattern was used to find words in the corpus:

[A-Za-z]+('[A-Za-z]+)?(?<!('s))

Clone this repo

git clone https://github.com/scriptsmith/topwords.git
cd topwords

Get the data

Download and extract the guttenberg-dammit data. This is a free resource, so don't abuse it.

Extract the words

Finds words from the 40000+ books with English as a primary language:

jq -r '.[] | select((.Language | length) == 1 and .Language[0] == "English") | "gutenberg-dammit-files/" + ."gd-path"' gutenberg-dammit-files/gutenberg-metadata.json | parallel "grep -ohPf pattern.txt {}" | tr '[:upper:]' '[:lower:]' > allwords.txt

Sort and count words

If your temporary directory can't store more than 60GiB, change the value of TMP_DIR

TMP_DIR=/tmp
sort -T $TMP_DIR allwords.txt | uniq -c | sed 's/^\s*//' | sort -nr > counts.txt

Remove word counts

cut -d ' ' -f2 counts.txt > words.txt

About

A list of the top 3 million+ english words in project gutenberg.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published