Skip to content

Latest commit

 

History

History
58 lines (50 loc) · 4.11 KB

Class11.md

File metadata and controls

58 lines (50 loc) · 4.11 KB

Class 11: Regular Expressions

Activities

  • Code review implementations of higher order Markov chains
  • Review structures used to build Markov chain and discuss scalability
  • Lecture and discussion following regular expressions slides
  • Build and test regular expressions with RegExr and visualize them with RegExper

Objectives

After completing this class session and the associated tutorial challenges, students will be able to ...

  • Use regular expressions to clean up and remove junk text from corpus
  • Use regular expressions to create a more intelligent word tokenizer

Resources

Challenges

These challenges are the baseline required to complete the project and course. Be sure to complete these before next class session and before starting on the stretch challenges below.

  • Page 13: Parsing Text and Clean Up
    • Remove unwanted junk text (e.g., chapter titles in books, character names in scripts)
    • Remove unwanted punctuation (e.g., _ or * characters around words)
    • Convert HTML character codes to ASCII equivalents (e.g., — to )
    • Normalize punctuation characters (e.g., convert both types of quotes – ‘’ and “” – to regular quotes – '' and "")
  • Page 14: Tokenization
    • Handle special characters (e.g., underscores, dashes, brackets, $, %, , etc.)
    • Handle punctuation and hyphens (e.g., Dr., U.S., can't, on-demand, etc.)
    • Handle letter casing and capitalization (e.g., turkey and Turkey)

Stretch Challenges

These challenges are more difficult and help you push your skills and understanding to the next level.

  • Page 13: Parsing Text and Clean Up
    • Make your parser code readable, then improve its organization and modularity so that it's easy to modify in the future
    • Modify your parser so that it can be used as both a module (imported by another script) and as a stand-alone, executable script that, when invoked from the command line with a file argument, will print out the cleaned-up version, which can be redirected into a file
  • Page 14: Tokenization
    • Make your tokenizer code readable, then improve its organization and modularity so that it's easy to modify in the future
    • Write tests to ensure that you're getting the results you've designed for, then run your tests with controlled input data
    • Come up with at least one other tokenization strategy and compare performance against your original strategy, then find ways to make your tokenizer more efficient