Tokenizer_NLP

#What is Tokenization in NLP? Tokenization is one of the most common tasks when it comes to working with text data. But what does the term ‘tokenization’ actually mean? Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

Why is Tokenization required in NLP?

I want you to think about the English language here. Pick up any sentence you can think of and hold that in your mind as you read this section. This will help you understand the importance of tokenization in a much easier manner. Before processing a natural language, we need to identify the words that constitute a string of characters. That’s why tokenization is the most basic step to proceed with NLP (text data). This is important because the meaning of the text could easily be interpreted by analyzing the words present in the text. Let’s take an example. Consider the below string: “This is a cat.” What do you think will happen after we perform tokenization on this string? We get [‘This’, ‘is’, ‘a’, cat’]. There are numerous uses of doing this. We can use this tokenized form to: 1.Count the number of words in the text 2.Count the frequency of the word, that is, the number of times a particular word is present

The True Reasons behind Tokenization

As tokens are the building blocks of Natural Language, the most common way of processing the raw text happens at the token level.

For example, Transformer based models – the State of The Art (SOTA) Deep Learning architectures in NLP – process the raw text at the token level. Similarly, the most popular deep learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the token level.

As shown here, RNN receives and processes each token at a particular timestep.

My Kaggle notebook:https://www.kaggle.com/lykin22/tokenizer-nlp

If you liked my analysis, pls upvote my notebook!

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
Tokenizer.ipynb		Tokenizer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

Tokenizer.ipynb

Tokenizer.ipynb

Repository files navigation

Tokenizer_NLP

Why is Tokenization required in NLP?

The True Reasons behind Tokenization

About

Releases

Packages

Languages

License

Ishan-Kotian/Tokenizer_NLP

Folders and files

Latest commit

History

Repository files navigation

Tokenizer_NLP

Why is Tokenization required in NLP?

The True Reasons behind Tokenization

About

Topics

Resources

License

Stars

Watchers

Forks

Languages