- Grab the new notebooks for this week. They are numbered W3-1 through W3-5.
- (Follow the instructions from last week)
- Grab the new data corpora for this week
- If you're using Tactic, you can get them from the
Collections
tab of the repository. - If you're not using Tactic, get them from this repository and put them in your
corpora
folder.
- If you're using Tactic, you can get them from the
- Look at our analyses of the Titanic corpus in notebooks 3-1 and 3-2. Try to improve
on them.
- This dataset has been the focus of a Kaggle challenge. If you poke around on the internet you might be able to find some suggestions. (I haven't poked around myself so I'm not sure.)
- Try the tasks in notebook 3-5 which use a new non-text corpus.
- See if you can improve on our analysis of the spam (text) corpus in notebook 3-4. (If we get to this in class.)
- The nltk book has a chapter on classifying text here. I think it's worth taking a read through. But you'll see that it uses, as its first example, the "gender classification" of names. You can decide what you think about this. (What I think: If we treat it as an attempt to understand what people are responding to when they see a name as male or female, then that makes this an appropriate and interesting endeavor. But we should tread more sensitively than the authors of this chapter.)