Use NLTK sent_tokenize and word_tokenize #4

achyudh · 2019-03-22T17:05:03Z

We should replace our primitive regex based tokenization with NLTK's tokenize module in the dataset pre-processing classes (after creating a snapshot release of this repository for the camera-ready)

Code duplication can be reduced if the pre-processing methods are moved to a util module rather than having it in each dataset class.

achyudh mentioned this issue Mar 22, 2019

Use NLTK sent_tokenize and word_tokenize karkaroff/hedwig#6

Closed

achyudh added the enhancement New feature or request label Apr 19, 2019

achyudh mentioned this issue Apr 19, 2019

Tokenization for common models karkaroff/hedwig#16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use NLTK sent_tokenize and word_tokenize #4

Use NLTK sent_tokenize and word_tokenize #4

achyudh commented Mar 22, 2019

Use NLTK sent_tokenize and word_tokenize #4

Use NLTK sent_tokenize and word_tokenize #4

Comments

achyudh commented Mar 22, 2019