Skip to content

Book Development

Steven Bird edited this page Jul 23, 2015 · 79 revisions

This page documents our plans for the development of the NLTK book, leading to a second edition.

  1. Language Processing and the Natural Language Toolkit 0. Introduction * something to get attention * bring in some material from the preface (SB) * revisit the reviews of the old chapter 1

    1. NLP systems: translation (nearly done), dialogue (done), summarisation (SB), web search (EL), question answering (logic-rich, inferencing approach EK), recommendations (similarity metrics over document collections, EL), sentiment analysis (EK). Pointers to demonstrations online (links hosted at nltk.org to avoid link rot). Motivation. Architecture. Limitations. Discussion to highlight the non-trivial NLP involved. Help readers understand the breadth and limitations of NLP.
      • description, the pieces you need to solve it (architecture diagram if necessary)
      • the fact that there's overlap between these in terms of the required subtasks
      • some very different approaches exist for the above (favour popularity, reasonableness, coverage of approaches across the whole set)
    2. Sub-tasks: WSD (EK), pronoun resolution/coreference (EL), paraphrasing (EK), finding things in text (SB), language modeling (EL), collocations (SB), sentence segmentation (SB), lexicon (associating meaning with words, and learning those associations automatically), normalization (stemming, unicode, case, twitter-speak) (EK?), syntax (how do different words in the sentence relate to one another; e.g., agent of a verb) (EK?), named entity recognition (EL) -- (identify tasks and writers by mid June)
      • show that these are non-trivial tasks (requires an example), but also do-able (outline of how it works)
      • each block of the architecture diagrams is a candidate here
      • include some indication of what performance to expect on each task (state of the art performance?)
    3. Overview of NLTK (simple things you can do, intended to engage readers with the content, linguists and non-linguists alike) (SB)
      • a little bit less on counting
      • tokenization, tagging, count more interesting things
      • parsing?
    4. Overview of book
    5. Summary (with forward pointers)
    6. Further Reading
    7. Exercises
  2. Accessing Text Corpora and Lexical Resources

    1. Accessing Text Corpora mention NomBank, PropBank [ewanklein]
    2. Conditional Frequency Distributions
    3. Lexical Resources add FrameNet, VerbNet, Sentiwordnet [edloper]
    4. WordNet add Open Multilingual WordNet [stevenbird]
    5. Summary
    6. Further Reading
    7. Exercises
  3. Processing Raw Text

    1. Accessing Text from the Web and from Disk (add Twitter) [ewanklein]
    2. Strings: Text Processing at the Lowest Level
    3. Text Processing with Unicode updated for Python 3 including bytes type – but this will already be done [edloper]
    4. Regular Expressions for Detecting Word Patterns
    5. Useful Applications of Regular Expressions
    6. Normalizing Text (add Twitter) [ewanklein]
    7. Regular Expressions for Tokenizing Text
    8. Segmentation
    9. Formatting: From Lists to Strings (update to use string.format – but this will already be done)
    10. Scaling up (incl how to use the Stanford tokenizer) [stevenbird]
    11. Summary
    12. Further Reading
    13. Exercises
  4. Language Modeling [edloper]

  5. n-gram models

  6. bins: forming equivalence classes

  7. independence assumptions

  8. sparse data problems.

  9. statistical estimators (MLE, laplace, heldout, etc)

  10. combining estimators

  11. word clusters and word similarity

  12. word embeddings

  13. scaling up -- show how to perform some task that we've already performed, but using an external tool and a larger data set. Should be a short section, but enough to show how to use the interface to the external tool, and to show the performance difference etc.

  14. Summary

  15. Further Reading

  16. Exercises

  17. Categorizing and Tagging Words

    1. Using a Tagger
    2. Tagged Corpora mention MASC tagged corpus? [stevenbird]
    3. Mapping Words to Properties Using Python Dictionaries
    4. Automatic Tagging
    5. N-Gram Tagging
    6. Transformation-Based Tagging
    7. How to Determine the Category of a Word
    8. Scaling Up (incl how to use the Stanford tagger) [stevenbird]
    9. Summary
    10. Further Reading
    11. Exercises
  18. Learning to Classify Text

    1. Supervised Classification
    2. Further Examples of Supervised Classification
    3. Evaluation
    4. Decision Trees
    5. Naive Bayes Classifiers
    6. Maximum Entropy Classifiers
    7. Modeling Linguistic Patterns
    8. Sentiment Detection (incl sentiwordnet); here or in chapter 7 [ewanklein]
    9. Scaling Up [edloper]
    10. Summary
    11. Further Reading
    12. Exercises
  19. Extracting Information from Text

    1. Information Extraction
    2. Chunking (decide which interface to use: chunking, tagging, or parsing)
    3. Developing and Evaluating Chunkers
    4. Recursion in Linguistic Structure
    5. Named Entity Recognition
    6. Relation Extraction
    7. Scaling Up (incl how to use the Stanford chunker and named entity recognizer) [edloper]
    8. Summary
    9. Further Reading
    10. Exercises
  20. Analyzing Sentence Structure

    1. Some Grammatical Dilemmas
    2. What's the Use of Syntax?
    3. Context Free Grammar
    4. Parsing With Context Free Grammar
    5. Combinatory Categorial Grammar [ewanklein]
    6. Dependencies and Dependency Grammar [edloper]
      • split into two sections:
      • (a) heads, arguments, and roles (mentions FrameNet, VerbNet, NomBank, PropBank)
      • (b) dependency grammar and dependency parsing
    7. Grammar Development (could this work in chapter 9?)
    8. Scaling Up (incl how to use the Stanford parser) [edloper]
    9. Summary
    10. Further Reading
    11. Exercises
  21. Building Feature Based Grammars

    1. Grammatical Features
    2. Processing Feature Structures
    3. Extending a Feature based Grammar
    4. Summary
    5. Further Reading
    6. Exercises
  22. Analyzing the Meaning of Sentences

    1. Natural Language Understanding
    2. Propositional Logic
    3. First-Order Logic
    4. Logic-based Semantics
    5. The Semantics of English Sentences
    6. Discourse Semantics
    7. Learning to build logical representations [tbd, depends on new implementation]
    8. Summary
    9. Further Reading
    10. Exercises
  23. Managing Linguistic Data

    1. Corpus Structure: a Case Study
    2. The Life-Cycle of a Corpus
    3. Acquiring Data
    4. Working with XML
    5. Working with Toolbox Data Working with FLEx Data [stevenbird]
    6. Describing Language Resources using OLAC Metadata
    7. Summary
    8. Further Reading
    9. Exercises
  24. Further Topics

    1. Machine Translation [stevenbird]
      • Sentence Alignment (incl Gale-Church algorithm)
      • Word Alignment (IBM model 1; mention existence of other models)
      • Aligned Corpora
      • Decoding [depends on new implementation]
      • Evaluation
    2. Twitter Processing
    3. Sentiment Analysis
    4. Design Patterns for NLP systems??
    5. Further Reading
    6. Exercises
  25. Appendix: Enough Python for this Book (incorporating material from old chapters 1 and 4)

    1. Getting started with Python (from 1.1)
    2. Texts as lists of words (lists, variables, strings, from 1.2)
    3. Making decisions and taking control (conditionals, comprehensions, nesting, from 1.4)
    4. Sequences (includes tuples, from 4.2)
    5. Functions and modules (from 2.3 and 4.4)
    6. Doing more with functions (from 4.5, plus module structure from 4.6)
    7. Getting started with NLTK (from 1.1 and 1.3)
    8. Exercises
  26. Appendix: Useful Python libraries for NLP (from 4.8)

    1. matplotlib
    2. networkx
    3. csv
    4. numpy
    5. scikit-learn
    6. gensim

#Workplan#

We are working on groups of chapters as indicated in the following diagram:

Book development plan

#Notes#

Images are done using Helvetica and exported in 100dpi