GitHub - cedeno/vocab_read_check: compares input text to popular words for language vocabulary analysis

Set of scripts to download content from NYTimes and Foxnews for vocabulary analysis.

First you will need to get the site content, I used wget to make this happen

Here is an explanations of the wget options I used:

-l 2 says follow 2 levels deep
-I 2012 is specific to NYTimes - it says to only include stuff in the 2012 directory which are articles otherwise we get a bunch of random website stuff
-w 2 says to wait for 2 seconds between each retrieval

To run:

wget -r -l 3 -I 2012 -w 2 http://www.nytimes.com/most-popular
wget -r -l 3 -w 2 http://www.foxnews.com

You will now have a www.foxnews.com and www.nytimes.com directories.

execute iterate_nytimes.sh and iterate_fox.sh to process all the HTML content and turn it into straight text, which is much easier to process. we use 'lynx' to do this, a text-based web browser, so you may need to yum install lynx

you need to direct output

$ ./iterate_fox.sh > FOX_OUT

word_check.lua (requires lua) will read the words in and count individual words. it discounts 'words' that have non-alpha characters in it. the output will be in csv format. the word followed by a comma followed by the number of occurrences. you can import this into a spreadsheet.

$ ./word_check.lua < FOX_OUT > FOX_OUT.csv

How much of an article can you read with the most popular vocabulary words? Using ./read_check.lua, you pipe in a piece of text (eg: news article, blog post), give it the csv file you just generated with word_check.lua and also tell it the number of top occurring words you want to use (ie, if you pass in 5, it will check words in the article against the top 5 occurring words in the csv, which is probably 'the', 'an', 'a', etc)

There's a lot of proper names in most stories, and its not fair to count these as misses. Given content you usually know "Mr. Gropper gave the boy some roses", that Mr. Gropper is a person.

To adjust for this, the script runs in two passes. The first time you pass in "--output_missed=true" and it will output a csv to stdout which contains all the missing words and how often they missed.

For example: ./read_check.lua --output_missed=true 3000 ./data/foxnews.csv < ./data/article1.txt > ./data/article1_missed_words.txt

You can then edit the missed words file and leave only the 'exceptions'. This will effectively add those words to the known dictionary without affecting the number of words we use in the primary dictionary file.

The second pass, you set --output_missed=false and also give this file , eg: ./read_check.lua --output_missed=false 3000 ./data/foxnews.csv ./data/article1_missed_words.txt < ./data/article1.txt

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
README.md		README.md
TODO		TODO
iterate_fox.sh		iterate_fox.sh
iterate_nytimes.sh		iterate_nytimes.sh
lorem.lua		lorem.lua
read_check.lua		read_check.lua
top_words.lua		top_words.lua
word_check.lua		word_check.lua
wordlib.lua		wordlib.lua

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

README.md

TODO

TODO

iterate_fox.sh

iterate_fox.sh

iterate_nytimes.sh

iterate_nytimes.sh

lorem.lua

lorem.lua

read_check.lua

read_check.lua

top_words.lua

top_words.lua

word_check.lua

word_check.lua

wordlib.lua

wordlib.lua

Repository files navigation

About

Releases

Packages

Languages

cedeno/vocab_read_check

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages