An implementation of traditional readability measures based on simple surface characteristics. These measures are basically linear regressions based on the number of words, syllables, and sentences.
The functionality is modeled after the UNIX style(1)
command. Compared to the implementation as part of GNU diction, this version supports UTF-8 encoded text, but expects sentence-segmented and word-tokenized text. The syllabification and word type recognition are based on simple heuristics and only provides a rough measure. The supported languages are English, German, and Dutch. Adding support for a new language involves the addition of heuristics for the aforementioned syllabification and word type recognition; see langdata.py
.
NB: all readability formulas were developed for English, so the scales of the outcomes are only meaningful for English texts. The Dale-Chall measure uses the original word list for English, but for Dutch and German lists of frequent words are used that were not specifically selected for recognizability by school children.
For syntactic complexity measures, see udstyle
$ pip install https://github.com/andreasvc/readability/tarball/master
The following preprocessing is expected:
- Tokens (words or punctuation) separated by space
- One sentence per line; no line breaks within sentences
- Paragraphs separated by one empty line
The quality of preprocessing affects the validity of the results.
From Python; tokenization using syntok:
>>> import readability
>>> import syntok.segmenter as segmenter
>>> text = """
This is an example sentence. Note that tokens will be separated by spaces
and sentences by newlines.
This is the second paragraph."""
>>> tokenized = '\n\n'.join(
'\n'.join(' '.join(token.value for token in sentence)
for sentence in paragraph)
for paragraph in segmenter.analyze(text))
>>> print(tokenized)
This is an example sentence .
Note that tokens will be separated by spaces and sentences by newlines .
This is the second paragraph .
>>> results = readability.getmeasures(tokenized, lang='en')
>>> print(results['readability grades']['FleschReadingEase'])
68.64621212121216
Command line usage:
$ readability --help
Simple readability measures.
Usage: readability [--lang=<x>] [FILE]
or: readability [--lang=<x>] --csv FILES...
By default, input is read from standard input.
Text should be encoded with UTF-8,
one sentence per line, tokens space-separated.
Options:
-L, --lang=<x> Set language (available: de, nl, en).
--csv Produce a table in comma separated value format on
standard output given one or more filenames.
--tokenizer=<x> Specify a tokenizer including options that will be given
each text on stdin and should return tokenized output on
stdout. Not applicable when reading from stdin.
Recommended tokenizers:
- For English and German, I recommend "tokenizer", cf. http://moin.delph-in.net/WeSearch/DocumentParsing
- For Dutch, I recommend the tokenizer that is part of the Alpino parser: http://www.let.rug.nl/vannoord/alp/Alpino/.
ucto
is a general multilingual tokenizer: http://ilk.uvt.nl/ucto
Example using ucto
:
$ ucto -L en -n -s "''" "CONRAD, Joseph - Lord Jim.txt" | readability
[...]
readability grades:
Kincaid: 5.44
ARI: 6.39
Coleman-Liau: 6.91
FleschReadingEase: 85.17
GunningFogIndex: 9.86
LIX: 31.98
SMOGIndex: 9.39
RIX: 2.56
DaleChallIndex: 8.02
sentence info:
characters_per_word: 4.17
syll_per_word: 1.24
words_per_sentence: 16.35
sentences_per_paragraph: 11.5
type_token_ratio: 0.09
characters: 551335
syllables: 164205
words: 132211
wordtypes: 12071
sentences: 8087
paragraphs: 703
long_words: 20670
complex_words: 10990
complex_words_dc: 29908
word usage:
tobeverb: 3907
auxverb: 1630
conjunction: 4398
pronoun: 18092
preposition: 19290
nominalization: 1167
sentence beginnings:
pronoun: 2578
interrogative: 217
article: 629
subordination: 120
conjunction: 236
preposition: 397
The option --csv
collects readability measures for a number of texts in a table. To tokenize documents on-the-fly when using this option, use the --tokenizer
option. Example with the "tokenize" tool:
$ readability --csv --tokenizer='tokenizer -L en-u8 -P -S -E "" -N' */*.txt >readabilitymeasures.csv
The following readability metrics are included:
- http://en.wikipedia.org/wiki/Automated_Readability_Index
- http://en.wikipedia.org/wiki/SMOG
- http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_Grade_Level#Flesch.E2.80.93Kincaid_Grade_Level
- http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_test#Flesch_Reading_Ease
- http://en.wikipedia.org/wiki/Coleman-Liau_Index
- http://en.wikipedia.org/wiki/Gunning-Fog_Index
- https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula
For better readability measures, consider the following:
- Collins-Thompson & Callan (2004). A language modeling approach to predicting reading difficulty. In Proc. of HLT/NAACL, pp. 193-200. http://aclweb.org/anthology/N04-1025.pdf
- Schwarm & Ostendorf (2005). Reading level assessment using SVM and statistical language models. Proc. of ACL, pp. 523-530. http://www.aclweb.org/anthology/P05-1065.pdf
- The Lexile framework for reading. http://www.lexile.com
- Coh-Metrix. http://cohmetrix.memphis.edu/
- Stylene: http://www.clips.ua.ac.be/category/projects/stylene
- T-Scan: http://languagelink.let.uu.nl/tscan
The code is based on: https://github.com/mmautner/readability
Which in turn was based on: https://github.com/nltk/nltk_contrib/tree/master/nltk_contrib/readability