Tokenization

Tokenization is the task to split a sentence into tokens such that each token represents a word or a punctuation. ELIT features a rule-based English tokenizer which offers both tokenization and sentence segmentation.

Tokenization

The EnglishTokenizer class handles common abbreviations, apostrophes, concatenation words, hyphens, network protocols, emojis, emails, html entities, list item units with expert-crafted rules. E.g.:

from elit.components.tokenizer import EnglishTokenizer
tokenizer = EnglishTokenizer()
text = "Emory NLP is a research lab in Atlanta, GA. It is founded by Jinho D. Choi in 2014. Dr. Choi is a professor at Emory University."
print(tokenizer.tokenize(text))

Output:

['Emory', 'NLP', 'is', 'a', 'research', 'lab', 'in', 'Atlanta', ',', 'GA', '.', 'It', 'is', 'founded', 'by', 'Jinho', 'D.', 'Choi', 'in', '2014', '.', 'Dr.', 'Choi', 'is', 'a', 'professor', 'at', 'Emory', 'University', '.']

Sentence Segmentation

The tokenized tokens can be fed into tokenizer.segment for sentence segmentation. E.g.:

from elit.components.tokenizer import EnglishTokenizer
tokenizer = EnglishTokenizer()
print(tokenizer.segment(
  ['Emory', 'NLP', 'is', 'a', 'research', 'lab', 'in', 'Atlanta', ',', 'GA', '.', 'It', 'is', 'founded', 'by',
   'Jinho', 'D.', 'Choi', 'in', '2014', '.', 'Dr.', 'Choi', 'is', 'a', 'professor', 'at', 'Emory',
   'University', '.']))

Output:

[['Emory', 'NLP', 'is', 'a', 'research', 'lab', 'in', 'Atlanta', ',', 'GA', '.'], ['It', 'is', 'founded', 'by', 'Jinho', 'D.', 'Choi', 'in', '2014', '.'], ['Dr.', 'Choi', 'is', 'a', 'professor', 'at', 'Emory', 'University', '.']]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization.md

tokenization.md

Tokenization

Tokenization

Sentence Segmentation

Files

tokenization.md

Latest commit

History

tokenization.md

File metadata and controls

Tokenization

Tokenization

Sentence Segmentation