Tokenization in Natural Language Processing NLP — Data Science
If you read my this article till end I assure you next time someone asks you what is tokenization you can explain them for hours without any hesitation.
Data Science is an emerging field and Natural language processing (NLP)is a sub branch of Data Science. Natural Language Processing plays a very important role in Data Science but in order to work with NLP we need to follow some steps and tokenization is one of those steps.
What is tokenization?
A wall is built up of bricks if we separate the bricks of a wall then there won’t be a wall but we are just left with bricks.
a single brick cannot build a wall but a single wall contains multiple bricks. When we break the wall and extract each brick out of the wall you can say its tokenization only difference is in tokenization a wall could be a phrase, sentence, paragraph or a huge document. Now a phrase, sentence, paragraph or a huge document is not built by one word they are built when multiple words are written together. So the words are the bricks of our tokenization wall. Extracting those words out from the wall individually is called tokenization.
Now since we covered the theoretical part of tokenization how can we implement it in python ?
So here we go :
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
NLTK is a workshop where we find multiple tools that will help us implementing in building NLP.
From nltk.tokenize we imported two libraries i.e word_tokenize & sent_tokenize. The word_tokenize means in our wall words are the bricks.Whereas sent_tokenize mean in our wall each sentence is a brick.
So we know a sentence is formed when we use a full stop ‘.’ and words are separated by spaces.
Let’s say I have a following variable of type string that holds a phrase:
sentence = "Hello my name is Umair khan. I am a data science practitioner, I love to develop applications for IOS and I love martial arts"
So if I use word_tokenize on this it will give me the following result:
the word_tokenize provides us with a method called ‘word_tokenize()’ inside the brackets we pass our sentence or any other variable that needs to be tokenized. And we are storing the tokenization result in a variable called ‘a’.
So using word_tokenize shows us each word is a brick it returned me list of words each and every single word that was contained by the phrase we can also say the returned list is nothing but a list of tokens. So tokens = bricks.
Everything you see got printed inside single quotes are our tokens.
Now lets implement sent_tokenize:
I don’t think so after the above explanation I would need to explain the code again. Let’s see what got printed as I told earlier when we use sent_tokenize each sent is counted as a token i.e brick of the wall so anything that got printed inside single quote is a brick i.e token.
We were able to get only two token from the provided phrase this is because each sentence is completed when a full stop occurs and in our provided phrase there is only one full stop.
So lets check what happens if we remove all the full stops from our variable ‘sentence’ it will look like this now :
sentence = "Hello my name is Umair khan I am a data science practitioner, I love to develop applications for IOS and I love martial arts"
now lets use sent_tokenize().
See we are returned with only one token because we excluded all the full stops and no full stop means no sentence no sentence means no bricks its considered as a one brick only.
So now we know how to tokenize but what if there are thousands of token returned how can we count them easily well here is the trick:
Here ‘a’ is the variable that holds the result of word_tokenize() and that result is a list/Array of tokens. Using len(a) will return us the count of list. This list is not necessarily needs to be returned by word_tokenize() it could be any list.
The len function tells us we have 26 tokens inside the list ‘a’.
Now lets see how can we actually check the frequency of a single token i.e the number of times a same word occurred.
For checking the frequency firstly we need to import and create an instance of the following library :
from nltk.probability import FreqDist#Create an instance of FreqDistfrequency = FreqDist()
Next we will write a for loop to iterate over the ‘a’ list. What is a loop ? a loop is to execute some code again and again until a specific condition is met in the for loop we are asking it to iterate over each element of the list ‘a’ once it has iterated over each item of the list the condition is met and it will stop looping.
for i in a:#i represents each element of list a frequency[i] = frequency[i] + 1
So why this ‘frequency[i] + 1’ well this is because when we loop our item for example we have item in list ‘a’ the item = ‘I’ and we have it occurred three time so when it goes through it first the frequency[i] = 0 but we are setting it 0 + 1 that makes it 1 so now the first time we loop through any item of list ‘a’ it’s frequency will be 1 and every time we loop we will keep incrementing it by 1.
you can compare the printed result with the token list and you can count the number of occurrence.
That’s it for this article hope I was able to deliver you a good understanding and implementation of tokenization.
If you like the way I explained and I helped you in understanding tokenization kindly do follow and subscribe to my email list I promise to make a series of NLP in my next article I will be writing about stemming in NLP and by the time I will end writing on NLP you will go from zero to hero.