Stop Words In Natural Language Processing — NLP

Stop Words In Natural Language Processing — NLP
Photo by Branden Tate on Unsplash

Stop words what are they ? and how does this help us in NLP ? In this article I am going to answer both of these questions.

What are stopwords they are nothing but the words being used more often having a low impact in our NLP process. E.g there are many words in the text we provide for our NLP training which will cause no impact in our training but to waste extra computation power for example ‘a’ this word is used very often an in some documents it might have been used 100 times but that won’t help us to train or gain better accuracy in our NLP hence we filter out these words before training.

In short stopwords are those words which occurs more often but have low impact in training our model.

To import ‘stopwords’ type the following code :

from nltk.corpus import stopwords
Stop Words In Natural Language Processing — Data Science Python
stopwords in NLP

Now how do I find which words are exactly the stop words. To do that use the following code :

stop_word = set(stopwords.words(‘english’))

In the above code ‘stop_word’ is a variable who is holding the value of set of stopwords in english language.

Where as ‘set’ its an abstract data type which store unique values which means it won’t store same value twice. ‘stopwords.words(pass the laguage you want stop word for)’ its a function given by stopwords library just pass in the language you are looking stopwords for and you will get them.

Stop Words In Natural Language Processing — Data Science Python
Stopwords in NLP

Printing the variable ‘stop_word’ will print the set of stopwords.

Practical Implementation:

Stop Words In Natural Language Processing — Data Science Python
Stopwords in NLP

Our process will look like this first we provide the text then we tokenize it and last step we will check each token in the set of stop words if the ‘stop words set’ contain the token then we will not append that token in our new list that is ‘filtered_list’ but we will append it to another list ‘filtered_words’ this is just an extra steps so that we could see which words are filtered out.

Stop Words In Natural Language Processing — Data Science Python
Stopwords in NLP
sentence = "I am Umair Khan an IOS developer and Data Science practitioner, I also love to teach martial arts during my free time"

The above sentence is we are going to use for this demo.

first we tokenize the above sentence :

from nltk.tokenize import word_tokenizetokens = word_tokenize(sentence) // list of tokens

Now create two list ‘filtered_list’ that will append the tokens if they are not in stopwords and ‘filtered_words’ that will append tokens if they are in stopwords.

filtered_list = []
filtered_words = []

Now iterate over the list of tokens.

for i in tokens:    if i not in stop_words:
filtered_list.append(i)

else:
filtered_words.append(i)
Stop Words In Natural Language Processing — Data Science Python
print(filtered_words)
print(filtered_list)
print(tokens)
stopwords in NLP

Now you can see the words that got filtered we had a sentence of 23 words which is now reduced to 17 words. This is how filtering stop words can save your time and computation power while working with the large documents.

I don't think if I could find a better way of explaining stopwords if you think the same do follow me and share my article with your friends. This really motivates me to write more and more tutorials.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Umair Ishrat Khan

Umair Ishrat Khan

An energetic and motivated individual IOS developer/ Data Science Practitioner. Apart from computer science Martial arts interests me the most.