Naive Bayes Classifier


  Data visualization

Contents

What is Naive Bayes

Say you get an email like so,

From : njlotterries1234@gmail.com
Subject : You won Lottery
Body : Congratulations !!! You won a lottery of 5 Million dollars. Click here to claim..

What do you think of this ? Is this a spam e-mail or not ? In all probability this is spam. How do you know it ? Well, you look at the index words – words like “lottery” , “viagra” , “free”, “money back”. When you see these words, generally you tend to classify that message as spam. This is exactly how Naive Bayes works. Let’s formalize our understanding a bit by going a bit deeper.


Bayes Theorem & Conditional Probability

Before we get into “Naive” Bayes, we have to first understand Bayes theorem. To understand Bayes theorem, we have to first understand something called Conditional Probability. What exactly is it ?

Say there is a standard deck of cards and you draw a card at random.

  • What is the probability that it is a red card ?
  • What is the probability that it is a face card, given that it is a red card ?

This is called conditional probability. Bayes theorem is an alternate way to compute the same thing.

Now, let’s calculate each one of these probabilities.

  • Probability of face card P(A)
  • Probability of a red card
  • Probability of a red card , given it is a face card.
  • And finally, we calculate the probability of a face card, given its a red card P ( face | red )

What did we achieve here ? Looks like we have made things more complicated, right ? I agree with you. In fact, this formula is not all that useful in machine learning. But there is an assumption that makes this formula extraordinarily useful in ML. Let’s go back to the email example.

Again, not very useful. To calculate the probability of “You won lottery” is very arbitrary. You cannot calculate the probability of occurrence of all different phrases or combination of words. The next time around / the subject line might say “Congratulations!! You won lottery” -which is slightly different from ‘ ‘You won lottery” . Point being, you cannot possibly Calculate all different combination of words that could result from the use of all different words in the English dictionary.

Naive Bayes

This is where the Bayes theorem becomes Naive . Let’s revisit the formula again.

The probability of the word “You” occurring in the email is independent of the Lord ‘ “Won” occurring. eg.,

  • Do you have the paper with you ?
  • we have won the contract

These Sentences are completely independent. When we break down the event into the respective independent events, probability can be Simplified as follows.

This is actually a “Naive” assumption – because in reality, there is some level of overlap. Meaning, when you mention the word “lottery”, you almost always use the word “win” or some variant-like ”won'” or “winning” . However, this is where ML is lucky. Even with the naive assumption, results are pretty good with text classification in real life. Let’s apply the simplification to the Bayes theorem once again.

With a bit of naivety, this formula became so much more useful. In fact, it makes it so useful that Naive Bayes is almost exclusively used for most text classification tasks. Let’s explore this example with some rough data – just believable, made-up data.

  • Probability of “You won lottery” being spam.
  • Probability of “You won spam” as NOT spam.

So, the probability of this phrase not being spam is 1.51.

Pretty effective, right? Especially given the simplification. Calculating the probability of the individual words is easy. The heart of this algorithm is, given any sentence, this algorithm can break it down into it’s components (words) and based on the “spamminess” of each of the words, the entire sentence can be classified as spam or not.

All we are trying to do in Naive Bayes, is to break down a complicated problem into its components. Once the component is classified, essentially the bigger piece is classified as well.

It is like solving a jigsaw puzzle. How do you solve one typically ? You look for smaller puzzles to solve. Say this is a picture of a car – you start to look for smaller components of the car, like a tire, a windshield and solve for each of these separately. Once you got the pieces figured out, all you have to do is to put them in order. Naive Bayes works more or less like this.

Classify fruits based on Characteristics

Now that we understand the basics of Naive Bayes, let’s create a simple dataset and solve it in excel. The purpose behind this exercise is to get familiar with Naive Bayes calculation using a smaller dataset. This is going to solidify our understanding a bit further, before we dive into more complicated examples.

Solve the fruits dataset in excel

The probability of each of the characteristics – round, large, small etc, can be calculated as below.

Now, let’s move on to the individual conditional probabilities. For example, what is the probability that a fruit is round, given that it is an apple ? In all the cases of Apple, the fruit is always round.

However, what is the probability that a fruit is red, given that its an apple ? one out of three apples are red.

Like that, we keep calculating the conditional probabilities of all the individual characteristics. Think of this like calculating the probability of each individual word being spam or not.

Time to test our data. Let’s say, we want to calculate the probability of a fruit being an Apple, if it is round and large. All we have to do is plug the numbers.

What is the probability that a fruit is an apple, if it is round, large and smooth ?

Based on our little dataset, we are not doing too bad. let’s do the opposite now. What is the probability of a fruit being a grape, given that it is round, large and smooth ?

Makes sense, right ? grape is never “large”. Hence the probability of a fruit being a grape if it is “large” is relatively small – 16 %.

Solve the fruits dataset in Python

library(e1071)

fruits = read.csv("./data/fruits.csv")

fruits
fruit	round	large	small	red	green	black	golden	yellow	smooth	rough
<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;
apple	yes	yes	no	yes	no	no	no	no	yes	no
apple	yes	yes	no	no	yes	no	no	no	yes	no
apple	yes	yes	no	no	no	no	yes	no	yes	no
grape	yes	no	yes	yes	no	no	no	no	yes	no
grape	yes	no	yes	no	yes	no	no	no	yes	no
grape	yes	no	yes	no	no	yes	no	no	yes	no
melon	yes	yes	no	no	yes	no	no	no	yes	no
melon	yes	yes	no	no	no	no	yes	no	no	yes
melon	yes	yes	no	no	no	no	no	yes	no	yes
model = naiveBayes(fruit ~ . , data = fruits)

pred = predict ( model , fruits[,2:11 ])
pred
apple
apple
apple
grape
grape
grape
apple
melon
melon
table(pred, fruits[,1])
pred    apple grape melon
  apple     3     0     1
  grape     0     3     0
  melon     0     0     2

That’s not bad, given such a small set of characteristics. Let’s actually get the confusion matrix to get the accuracy percentage.

library(caret)

cm = confusionMatrix(pred,as.factor(fruits[,1]))
cm

Confusion Matrix and Statistics

          Reference
Prediction apple grape melon
     apple     3     0     1
     grape     0     3     0
     melon     0     0     2

Overall Statistics
                                          
               Accuracy : 0.8889          
                 95% CI : (0.5175, 0.9972)
    No Information Rate : 0.3333          
    P-Value [Acc &gt; NIR] : 0.0009653       
                                          
                  Kappa : 0.8333          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: apple Class: grape Class: melon
Sensitivity                1.0000       1.0000       0.6667
Specificity                0.8333       1.0000       1.0000
Pos Pred Value             0.7500       1.0000       1.0000
Neg Pred Value             1.0000       1.0000       0.8571
Prevalence                 0.3333       0.3333       0.3333
Detection Rate             0.3333       0.3333       0.2222
Detection Prevalence       0.4444       0.3333       0.2222
Balanced Accuracy          0.9167       1.0000       0.8333

That’s an accuracy of almost 90%. We are not very far off, given our dataset is pretty small. The one place where we went wrong is in classify a melon wrongly as an apple. If we compared the predictions vs the actuals, we can see that we went wrong with the 7th entry ( a melon being mis-classified as an apple ).

predict = pred
actual = fruits[,1]

data.frame(predict,actual)

predict	actual
<fct&gt;	<fct&gt;
apple	apple
apple	apple
apple	apple
grape	grape
grape	grape
grape	grape
apple	melon
melon	melon
melon	melon

Let’s check out the actual entry.

As you can see, the entry for melon ( watermelon ) coincides in its data points to the green apple. How could this happen ? This is because of an oversimplification with regards to size. We only have 2 sizes – small and large. However, both the apple and water melon are large ( and round and smooth ). And that’s why the NB algorithm got it wrong. If we had an extra size characteristic ( say XL ), that would have solved this problem.


Classify messages as Spam

Now that we understood the basics of Naive Bayes along with a simple example in excel and R, we can proceed to solve the problem that we started with – To classify a message as spam or not.


Step 1 – Get the dataset

There is a simple SMS ( text message ) dataset available at kaggle or at the UCI ML datesets. You can also download the file from Ajay Tech’s github page. Download the zip file and open it in excel as a tab delimited format. Each of these messages have been classified as either spam or ham ( ham is just a technical word for “non-spam” ). Open the dataset in excel as a tab-delimited format and give column names ( if not available already ).

Step 2 – Read the dataset into R

data = read.csv("./data/spam.csv", encoding='ISO-8859-1')
head(data)
class	message	X	X.1	X.2
<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;
1	ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...			
2	ham	Ok lar... Joking wif u oni...			
3	spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&amp;C's apply 08452810075over18's			
4	ham	U dun say so early hor... U c already then say...			
5	ham	Nah I don't think he goes to usf, he lives around here though			
6	spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv			
data = data[,c(1,2)]
head(data)
class	message
<fct&gt;	<fct&gt;
1	ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
2	ham	Ok lar... Joking wif u oni...
3	spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&amp;C's apply 08452810075over18's
4	ham	U dun say so early hor... U c already then say...
5	ham	Nah I don't think he goes to usf, he lives around here though
6	spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

Step 3 – Simple EDA

  • How many messages are there in the dataset ?
nrow(data)
5572
summary(data$class)

ham
4825
spam
747
  • Out of them, count the occurances of spam vs ham(non-spam)
  • What percentage of this is spam ?
summary(data$class)["spam"] / summary(data$class)["ham"] * 100
spam: 15.4818652849741

15 % of the messages are spam.


Step 4 – Feature Engineering

Just like we converted the fruits dataset’s feature values from “yes” or “no” to a 1 or 0 , Naive Bayes (or for that matter most ML algorithms) need the feature data to be numeric in nature. In order to do it, we have to use some techniques from Natural language processing.

  • Tokenize the message (into words) and create a sparse matrix

This process basically splits the sentence (message) to it’s individual words. Let’s see a sample before we tokenize the entire dataset.

Now, let’s do the same on our real messages dataset.

library(tm)
message_corpus = Corpus(VectorSource(data$message))
print ( message_corpus)
<<SimpleCorpus&gt;&gt;
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 5572

message_dtm <- DocumentTermMatrix(message_corpus)

Document term matrix (DTM) is in a binary format. So, we can’t just print it out using indices. Instead, we use the inspect ( ) function.

inspect(message_dtm[1:10,1:20])

<<DocumentTermMatrix (documents: 10, terms: 20)&gt;&gt;
Non-/sparse entries: 21/179
Sparsity           : 90%
Maximal term length: 19
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs amore available buffet... bugis cine crazy.. got great jurong there
  1      1         1         1     1    1       1   1     1      1     1
  10     0         0         0     0    0       0   0     0      0     0
  2      0         0         0     0    0       0   0     0      0     0
  3      0         0         0     0    0       0   0     0      0     0
  4      0         0         0     0    0       0   0     0      0     0
  5      0         0         0     0    0       0   0     0      0     0
  6      0         0         0     0    0       0   0     0      0     1
  7      0         0         0     0    0       0   0     0      0     0
  8      0         0         0     0    0       0   0     0      0     0
  9      0         0         0     0    0       0   0     0      0     0

Step 5 – Train/Test data split

Before we use the DTM as-is, we have to convert the 0,1’s to Factors – like a Yes and No. This is becuase Naive Bayes works well with Factors. Let’s write a small functiont that converts all values greater than 0 to a Yes and otherwise to No.

counts_to_factor = function(x){
  x = ifelse(x &gt; 0, 1, 0)
  x = factor(x, levels = c(0,1), labels = c("No", "Yes"))
  return (x)
}


Before we apply this function to the DTM, let’s split the data into training and test datasets.

head(msg_train_dtm[,1:5])

index = sample(1:nrow(data),nrow(data)*.8)
train = data[index,2]
test = data[-index,2]

msg_cor_train      = Corpus(VectorSource(data[train,]$message))
msg_train_dtm      = DocumentTermMatrix(msg_cor_train)
msg_train_dtm      = apply(msg_train_dtm, MARGIN = 2, counts_to_factor)
msg_class_train    = data$class[train]


msg_cor_test       = Corpus(VectorSource(data[test,]$message))
msg_test_dtm       = DocumentTermMatrix(msg_cor_test)
msg_test_dtm       = apply(msg_test_dtm, MARGIN = 2, counts_to_factor)
msg_class_test     = data$class[test]
head(msg_train_dtm[,1:5])

2wks	87077	87077:	club	free
1	Yes	Yes	Yes	Yes	Yes
2	No	No	No	No	No
3	No	No	No	No	No
4	No	No	No	No	No
5	No	No	No	No	No
6	No	No	No	No	No
msg_train_df = as.data.frame(as.matrix(msg_train_dtm))
msg_test_df  = as.data.frame(as.matrix(msg_test_dtm))

head(msg_train_df)
been	curtsey?	have	practising	you	your	off.	pissed	pretty	whatever,	...	not..tel	clearer..	sections	above	da..al	coins	factory	chart	heroes,	tips
<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	...	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;
1	1	1	1	1	1	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	1	1	1	1	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
5	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
6	0	0	0	0	3	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

Step 6 – Model the data

library(e1071)
model = naiveBayes(msg_train_dtm, msg_class_train)

Step 7 – Evaluate the model.

pred = predict(model, msg_test_dtm)
table(msg_class_test, pred)

pred
msg_class_test ham spam
          ham  950   13
          spam  18  134 


Measure the accuracy using the confusion matrix from the caret library.

library(caret)

cm = confusionMatrix(pred,msg_class_test)
cm

Confusion Matrix and Statistics

          Reference
Prediction  ham spam
      ham  2385   46
      spam   12  343
                                          
               Accuracy : 0.9792          
                 95% CI : (0.9732, 0.9842)
    No Information Rate : 0.8604          
    P-Value [Acc &gt; NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9101          
                                          
 Mcnemar's Test P-Value : 1.47e-05        
                                          
            Sensitivity : 0.9950          
            Specificity : 0.8817          
         Pos Pred Value : 0.9811          
         Neg Pred Value : 0.9662          
             Prevalence : 0.8604          
         Detection Rate : 0.8561          
   Detection Prevalence : 0.8726          
      Balanced Accuracy : 0.9384          
                                          
       'Positive' Class : ham             
                                   

There is scope for a ton of optimization here like

  • convert all characters to lower case
  • remove punctuation
  • remove stop words etc

But that is a subject for another day. Here we will just focus on learning the Naive Bayes algorithm.

Challenge

Let’s solve another problem in Naive Bayes. Load up a dataset called house-votes-84.csv from the data folder. The data set should look like this.

These are the results from Congressmen in the US, voting a Yes ( for ) or No (Against ) on 16 different issues. Instead of putting names, the class column identifies the congressmen as either a Republican or a Democrat.

Task – Identify the congressmen as either a Democrat or Republican based on his voting pattern.

solution – This problem is almost exactly similar to the fruits data we started with at the beginning of leaning Naive Bayes.

# 1. Import the dataset
library(mlbench)

data(HouseVotes84, package = "mlbench")

data = HouseVotes84
head(data)

# 2. train/test split
index = sample(1:nrow(data),nrow(data)*.8)
train = data[index,]
test = data[-index,]

# 3. model the data
model = naiveBayes(Class ~ ., data = train)

# 4. predict the data
pred = predict(model, test)

# 5. Accuracy
table(pred, test$Class)

library(caret)

cm = confusionMatrix(pred,test$Class)
print (cm)
Class	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16
<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;
1	republican	n	y	n	y	y	y	n	n	n	y	NA	y	y	y	n	y
2	republican	n	y	n	y	y	y	n	n	n	n	n	y	y	y	n	NA
3	democrat	NA	y	y	NA	y	y	n	n	n	n	y	n	y	y	n	n
4	democrat	n	y	y	n	NA	y	n	n	n	n	y	n	y	n	n	y
5	democrat	y	y	y	n	y	y	n	n	n	n	y	NA	y	y	y	y
6	democrat	n	y	y	n	y	y	n	n	n	n	n	n	y	y	y	y
pred         democrat republican
  democrat        121          5
  republican       21         71
Confusion Matrix and Statistics

            Reference
Prediction   democrat republican
  democrat        121          5
  republican       21         71
                                          
               Accuracy : 0.8807          
                 95% CI : (0.8301, 0.9206)
    No Information Rate : 0.6514          
    P-Value [Acc &gt; NIR] : 1.002e-14       
                                          
                  Kappa : 0.7496          
                                          
 Mcnemar's Test P-Value : 0.003264        
                                          
            Sensitivity : 0.8521          
            Specificity : 0.9342          
         Pos Pred Value : 0.9603          
         Neg Pred Value : 0.7717          
             Prevalence : 0.6514          
         Detection Rate : 0.5550          
   Detection Prevalence : 0.5780          
      Balanced Accuracy : 0.8932          
                                          
       'Positive' Class : democrat    

Challenge – IMDB review Sentiment Analysis

Similar to the SPAM/HAM problem, we can also predict if an IMDB review is positive or negative based on the words in it.

# step 1 - Read the data file
library("xlsx")
data = read.xlsx("./data/imdb-reviews-sentiment.xlsx", sheetIndex = 1,  header=TRUE)

# step 2 - Create a DTM based on the text data
library(tm)
message_corpus = Corpus(VectorSource(data$review))
message_dtm <- DocumentTermMatrix(message_corpus)

# step 3 - function to convert the integers to "Yes" or "No" factors in the DTM
counts_to_factor = function(x){
  x = ifelse(x &gt; 0, 1, 0)
  x = factor(x, levels = c(0,1), labels = c("No", "Yes"))
  return (x)
}

# step 4 - Split the DTMs to Train and test data and convert the integers to factors for "Yes" and "No"
index = sample(1:nrow(data),nrow(data)*.8)
train = data[index,2]
test = data[-index,2]

msg_cor_train      = Corpus(VectorSource(data[train,]$review))
msg_train_dtm      = DocumentTermMatrix(msg_cor_train)
msg_train_dtm      = apply(msg_train_dtm, MARGIN = 2, counts_to_factor)
msg_class_train    = data$sentiment[train]


msg_cor_test       = Corpus(VectorSource(data[test,]$review))
msg_test_dtm       = DocumentTermMatrix(msg_cor_test)
msg_test_dtm       = apply(msg_test_dtm, MARGIN = 2, counts_to_factor)
msg_class_test     = data$sentiment[test]

# step 4 - model the data using Naive Bayes
library(e1071)
model = naiveBayes(msg_train_dtm, msg_class_train)

#step 4- predict the results from the model using the test data
pred = predict(model, msg_test_dtm)

# step 6 - get the accuracy from confusion matrix.
library(caret)
cm = confusionMatrix(pred,data$sentiment[test])
print (cm)

Confusion Matrix and Statistics

          Reference
Prediction negative positive
  negative        0        0
  positive        0     2000
                                     
               Accuracy : 1          
                 95% CI : (0.9982, 1)
    No Information Rate : 1          
    P-Value [Acc &gt; NIR] : 1          
                                     
                  Kappa : NaN        
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : NA         
            Specificity :  1         
         Pos Pred Value : NA         
         Neg Pred Value : NA         
             Prevalence :  0         
         Detection Rate :  0         
   Detection Prevalence :  0         
      Balanced Accuracy : NA         
                                     
       'Positive' Class : negative   
                                     

Naive Bayes on continuous variables

So far, we have seen Naive Bayes work on factor variables. Does NB ever work on continous variables ? Yes, it does – ofcourse with discretized version of those variables ( Think of binning a normal distribution ). The key assumption there would be that the variable has a normal distribution. For example, think of the iris dataset – is the “Sepal length” of setosa species normally distributed ? Let’s find out.

from sklearn import datasets

iris = datasets.load_iris()

iris_data    = iris.data
iris_target  = iris.target

# matplotlib does not have the ability to plot the kernel density function
import matplotlib.pyplot as plt
# So, we are using seaborn instead
import seaborn as sns
%matplotlib inline

# You can check from these curves that Sepal data is normally distributed, but
# the petal data is not. Try them on one by one.

sns.distplot(iris_data[:,0], hist=True, kde=True)
sns.distplot(iris_data[:,1], hist=True, kde=True)
sns.distplot(iris_data[:,2], hist=True, kde=True)
sns.distplot(iris_data[:,3], hist=True, kde=True)

Only the Sepal data is normally distributed. Ideally, we should just be using the sepal data ( Sepal Length and Sepal Width ). However, let’s just use all of these and see what happens. As an exercise, try using just the sepal data and check for the accuracy.

# 1. train/test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris_data , iris_target, test_size=0.2)  

# 2. Naive Bayes modeling
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)  

# 3. Predict data
y_predict = model.predict(X_test)

# 4. Create a confusion matrix to check accuracy
print ( pd.crosstab(y_test, y_predict,rownames=['Actual'], colnames=['Predicted'],  margins=True) )

# 5. Print the accuracy score
from sklearn.metrics import confusion_matrix, accuracy_score

print ( confusion_matrix(y_test, y_predict) )
print ( accuracy_score(y_test,y_predict))
Predicted  0   1   2  All
Actual                   
0          6   0   0    6
1          0  11   1   12
2          0   1  11   12
All        6  12  12   30
[[ 6  0  0]
 [ 0 11  1]
 [ 0  1 11]]
0.9333333333333333


That’s pretty accurate as well , isn’t it ? In fact even though one of the assumptions (all the variables should be independent of each other ) is wrong, Naive Bayes still outperforms some other classification algorithms.

  • The priors ( Probability of a “Setosa” occuring or a “Virginica” occuring .. ) is 0.33 ( a third ) – which we know.
  • How about the conditional probabilities ? This is where it gets tricky for continuous variables. You cannot have conditional probabilities for each of the values ( as the number can get infinite ). So, in case of a normal distribution, an approximation is applied based on the following formula.

where μ is the mean and σ is the variance

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: