Term Frequency & Inverse Document Frequency : TF-IDF in NLP

TF-IDF in NLP

Before we talk about TF-IDF I hope you’ve red my last article on Bag-of-Words if not then please read it then come back here.

In Bag-of-Words when we create a vector for any statement it is created in solid numbers i.e. non decimal numbers. Take an example of the following statement:

We’ve 11 columns and 3 rows and we are talking about the first statement i.e. ‘He wants to go USA’. The vector generated for this statement is ‘1–1–1–1–1–0–0–0–0–0–0’ for every present word in the sentence we got the number of times it is present. But this approach doesn’t give us the importance/weight for each word in the statement it is just telling us the number of time a word is present in the statement.

So to over come this problem we head towards TF-IDF.It does the same thing as BoW convert statements to vector but it also provides weight of each term i.e. word.

I would like to split them first we discuss TF & then IDF and at the end we multiply them with each other.

Term Frequency (TF):

Term Frequency a simple definition is “how often a word occur in a document”. But there is a mathematical formula to find TF i.e. number of times a word occurred divided by total words.

Term Frequency

But unlike BoW TF won’t create columns of unique words instead it creates columns of statement and rows of unique words.

Tf in NLP

Assume we have the following three statements the table for this will be created 3 columns and 9 rows.

TF in NLP

We’ve found 9 unique words in three statements and when finding the TF for each statement what we do is let’s take example of first statement i.e. ‘John love to play poker’.

Total words in statement is 5 so our denominator will remain same for all the words in this statement column.

  • ‘John’ its in our statement and only one time so 1/5. (1 = total time this words occurs in the statement and 5 is the total number of words in the statement)
  • ‘loves’ its in our statement and only one time so 1/5
  • ‘to’ its in our statement and only one time so 1/5
  • ‘play’ its in our statement and only one time so 1/5
  • ‘poker’ its in our statement and only one time so 1/5

Rest of the words are not in our statement thats why all of them will be 0/5.

Now let’s move forward with IDF.

Inverse Document Frequency (IDF):

IDF in NLP

Unlike TF where or denominator was constant for each statement in IDF our nominator will remain constant where as the denominator will be changing for each word.

TF-IDF in NLP

look at the IDF column and remember the formula no. of sentences / no. of sentences with that word. In this case we have three sentences so our nominator for IDF will be 3 and now 3 will be with no. of sentences that word (word for the row) is present in.

We can see ‘John’ all of our three sentences contain this words so IDF for ‘John’ is log(3/3). ‘loves’ appears in two sentences so IDF for it is log(3/2). ‘to’ appears in one sentence hence IDF log (3/1) and so on.

Now to generate the final vector for each statement we will need to multiply TF with IDF. And this time our words will generate the columns where as statements will generate rows.

The table will look like this:

TF-IDF in NLP

Now to fill this table we will take column of each sentence and multiply it with column of IDF. Below is the example for it and will the values for each statement against each column of word.

The final vector for this sentence will be :

And the final TF-IDF vector for all three would be :

TF-IDF NLP

John is 0 for all three how ever it was present in all three sentence i.e. beacuse TF-IDF returns a vector based on weights and not the presence.

Practical Implementation in Python :

To start first I will show you the entire code and then will explain it.

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
stat1= "John loves to play poker"
stat2 = "John is good boy"
stat3 = "John loves horses"
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([stat1,stat2,stat3])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
df = pd.DataFrame(dense, columns=feature_names)print(df)
print(vectorizer.idf_)
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

The first two lines we are importing two libraries one is ‘TfidfVectorizer’ from ‘sklearn.feature_extraction.text’ this helps us to create TF-IDF vectors from our text and other is ‘pandas’ this helps us with cvs files and data frames.

stat1= "John loves to play poker"
stat2 = "John is good boy"
stat3 = "John loves horses"

Next we created three statements the one we took as an example.

vectorizer = TfidfVectorizer() #object of TfidfVectorizer()
vectors = vectorizer.fit_transform([stat1,stat2,stat3])

Later we created the object of ‘TfidfVectorizer()’ and then we provided the data to be vectorized and stored the values of vectorized data in a variable named vectors.

feature_names = vectorizer.get_feature_names()
dense = vectors.todense() #returns data in matrix form

feature name is self explanatory we are just getting our feature names from the vector i.e the unique words and dense is the matrix form of our vectors we will pass it as rows in our data frame.

df = pd.DataFrame(dense, columns=feature_names)

We created a data frame passed ‘dense’ as rows and ‘feature_names’ as columns.

print(df)
print(vectorizer.idf_)

Then we printed the data frame but also the ‘IDF’ (vectorize.idf_) values for our features because the data frame will be showing us the ‘TF-IDF’ values.

TF-IDF NLP Python

Now you must be wondering the values we calculated manually and the values returned by vectorizers are different. Yes! they are different but the meaning for our values and these values are same.

The difference occur because we did ‘Standard Tf-IDF’ and the ‘TFidfVectorizer’ uses some extra steps in my next article I’ll be describing the differences and you will be able to understand why our values are changed from the vectorizer values.

If I was able to explain this topic well and you liked my article do share it with your friends and follow me this really motivates me to write more and more tutorials.

--

--

--

An energetic and motivated individual IOS developer/ Data Science Practitioner. Apart from computer science Martial arts interests me the most.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to get started with the Diabetic Retinopathy project

Insights from EMNLP 2017

What is an Artificial Neural Network…?

Statistical Tests in Machine Learning

Decision Trees in Machine Learning

Part 2: Selecting the right weight initialization for your deep neural network.

Parallel and distributed machine learning

Resources to get started with Keras and it’s Applications

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Umair Ishrat Khan

Umair Ishrat Khan

An energetic and motivated individual IOS developer/ Data Science Practitioner. Apart from computer science Martial arts interests me the most.

More from Medium

Component of Natural Language Processing (NLP)

Lemmatization In Natural Language Processing — NLP

Quick Tip: How to save spaCy dependency graphs as png

Performance Analysis of Text-Summary Generation Models Using ROUGE Score