Bag of Words in NLP : Natural Language Processing

The above bag is not a common bag its a bag of words. But what is bag-of-words (BoW). Well by name what we can assume is it’s a bag carrying words.

Well thats some how true this bag does carry words but whats the use of it and how does it work this is some thing I am going to explain you now.

  • He wants to go USA
  • She wants to go Germany
  • Bodybuilders like good diet

Above are three sentence now if I give this text directly to my model as input will it understand ? NO! it won’t because computer cannot understand this text without being trained so in order for us to make computer understand these sentences we need to create an equivalent vector of each sentence.

Vector in easy words is nothing but a numerical shape of text.

This whole process of creating an equivalent vector is called bag of words now let’s explore the process of BoW.

Process of BoW:

Assume that initially we have only these three statements and this is our entire data that we need to work on.

Unique Words:

First the BoW will find all the unique words in the three statement we won’t be counting repeating words.

In the given data we found 11 unique words.

Creating Table:

As soon as BoW finds all the unique word it will create column for each words and row for each statement.

So in the current scenario we’ve 11 words and 3 statements which means our table will contain 11 columns and 3 rows.

Creating Vectors :

Now we are going to use numbers to fill out our table and create equivalent vector for each statement.

If a word is in the sentence we will write the number of time it is present if it is not present then 0.

Our first statement is ‘He want to go USA’ . ‘He’ is in the statement once only thats why we put 1 in that column for first statement row. ‘wants’ yes its there once only put 1. ‘to’ yes its there again once only put 1. ‘go’ yes its there how many times only 1 so put 1. ‘USA’ yes its there put 1 for the remaining words they aren’t in our first sentence i.e the first row thats why we put 0.

The vector form of each statement looks like this:

  • He wants to go USA
  • She wants to go Germany
  • Bodybuilders likes good diet

Note one thing the 1 in the row doesn’t represents the presence only but also the number of times each word is present in that specific statement.

Example :

Now our statement is this ‘He He wants to go USA’

‘He’ is occuring twice thats why this time we put 2 in the column of ‘he’ for this statement.

After converting our text in to vector we can pass it as input to our model.

If my this article has helped you understanding BoW. Kindly follow me and share with your friends. In my next article I’ll be writing for TF-IDF.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Umair Ishrat Khan

Umair Ishrat Khan

154 Followers

An energetic and motivated individual IOS developer/ Data Science Practitioner. Apart from computer science Martial arts interests me the most.