Stop Words In Natural Language Processing — NLP
Implement Stop Words with Python
Stop words what are they ? and how does this help us in NLP ? In this article I am going to answer both of these questions.
What are stopwords they are nothing but the words being used more often having a low impact in our NLP process. E.g there are many words in the text we provide for our NLP training which will cause no impact in our training but to waste extra computation power for example ‘a’ this word is used very often an in some documents it might have been used 100 times but that won’t help us to train or gain better accuracy in our NLP hence we filter out these words before training.
In short stopwords are those words which occurs more often but have low impact in training our model.
To import ‘stopwords’ type the following code :
from nltk.corpus import stopwords
Now how do I find which words are exactly the stop words. To do that use the following code :
stop_word = set(stopwords.words(‘english’))
In the above code ‘stop_word’ is a variable who is holding the value of set of stopwords in english language.
Where as ‘set’ its an abstract data type which store unique values which means it won’t store same value twice. ‘stopwords.words(pass the laguage you want stop word for)’ its a function given by stopwords library just pass in the language you are looking stopwords for and you will get them.
Printing the variable ‘stop_word’ will print the set of stopwords.
Our process will look like this first we provide the text then we tokenize it and last step we will check each token in the set of stop words if the ‘stop words set’ contain the token then we will not append that token in our new list that is ‘filtered_list’ but we will append it to another list ‘filtered_words’ this is just an extra steps so that we could see which words are filtered out.
sentence = "I am Umair Khan an IOS developer and Data Science practitioner, I also love to teach martial arts during my free time"
The above sentence is we are going to use for this demo.
first we tokenize the above sentence :
from nltk.tokenize import word_tokenizetokens = word_tokenize(sentence) // list of tokens
Now create two list ‘filtered_list’ that will append the tokens if they are not in stopwords and ‘filtered_words’ that will append tokens if they are in stopwords.
filtered_list = 
filtered_words = 
Now iterate over the list of tokens.
for i in tokens: if i not in stop_words:
Now you can see the words that got filtered we had a sentence of 23 words which is now reduced to 17 words. This is how filtering stop words can save your time and computation power while working with the large documents.
I don't think if I could find a better way of explaining stopwords if you think the same do follow me and share my article with your friends. This really motivates me to write more and more tutorials.