Skip to content

Latest commit

 

History

History
40 lines (28 loc) · 1.67 KB

tokenization.md

File metadata and controls

40 lines (28 loc) · 1.67 KB

Tokenization

Tokenization is the task to split a sentence into tokens such that each token represents a word or a punctuation. ELIT features a rule-based English tokenizer which offers both tokenization and sentence segmentation.

Tokenization

The EnglishTokenizer class handles common abbreviations, apostrophes, concatenation words, hyphens, network protocols, emojis, emails, html entities, list item units with expert-crafted rules. E.g.:

from elit.components.tokenizer import EnglishTokenizer
tokenizer = EnglishTokenizer()
text = "Emory NLP is a research lab in Atlanta, GA. It is founded by Jinho D. Choi in 2014. Dr. Choi is a professor at Emory University."
print(tokenizer.tokenize(text))

Output:

['Emory', 'NLP', 'is', 'a', 'research', 'lab', 'in', 'Atlanta', ',', 'GA', '.', 'It', 'is', 'founded', 'by', 'Jinho', 'D.', 'Choi', 'in', '2014', '.', 'Dr.', 'Choi', 'is', 'a', 'professor', 'at', 'Emory', 'University', '.']

Sentence Segmentation

The tokenized tokens can be fed into tokenizer.segment for sentence segmentation. E.g.:

from elit.components.tokenizer import EnglishTokenizer
tokenizer = EnglishTokenizer()
print(tokenizer.segment(
  ['Emory', 'NLP', 'is', 'a', 'research', 'lab', 'in', 'Atlanta', ',', 'GA', '.', 'It', 'is', 'founded', 'by',
   'Jinho', 'D.', 'Choi', 'in', '2014', '.', 'Dr.', 'Choi', 'is', 'a', 'professor', 'at', 'Emory',
   'University', '.']))

Output:

[['Emory', 'NLP', 'is', 'a', 'research', 'lab', 'in', 'Atlanta', ',', 'GA', '.'], ['It', 'is', 'founded', 'by', 'Jinho', 'D.', 'Choi', 'in', '2014', '.'], ['Dr.', 'Choi', 'is', 'a', 'professor', 'at', 'Emory', 'University', '.']]