# Naive Bayes Classifier

## Contents

- What is Naive Bayes
- Bayes Theorem & Conditional Probability
- Naive Bayes Theorem
- Example – Classify Fruits based on characteristics
- Example – Classify Messages as Spam or Ham
- Challenge – Classify Congressman as Democrat or Republican
- Naive Bayes on Continuous variable.

### What is Naive Bayes

Say you get an email like so,

**From** : njlotterries1234@gmail.com**Subject** : You won Lottery**Body** : Congratulations !!! You won a lottery of 5 Million dollars. Click here to claim..

What do you think of this ? Is this a spam e-mail or not ? In all probability this is spam. How do you know it ? Well, you look at the index words – words like “lottery” , “viagra” , “free”, “money back”. When you see these words, generally you tend to classify that message as spam. This is exactly how Naive Bayes works. Let’s formalize our understanding a bit by going a bit deeper.

### Bayes Theorem & Conditional Probability

Before we get into “Naive” Bayes, we have to first understand **Bayes** theorem. To understand Bayes theorem, we have to first understand something called *Conditional Probability*. What exactly is it ?

Say there is a standard deck of cards and you draw a card at random.

- What is the probability that it is a red card ?

- What is the probability that it is a face card, given that it is a red card ?

This is called conditional probability. Bayes theorem is an alternate way to compute the same thing.

Now, let’s calculate each one of these probabilities.

- Probability of face card P(A)

- Probability of a red card

- Probability of a red card , given it is a face card.

- And finally, we calculate the probability of a face card, given its a red card P ( face | red )

What did we achieve here ? Looks like we have made things more complicated, right ? I agree with you. In fact, this formula is not all that useful in machine learning. But there is an assumption that makes this formula extraordinarily useful in ML. Let’s go back to the email example.

Again, not very useful. To calculate the probability of “You won lottery” is very arbitrary. You cannot calculate the probability of occurrence of all different phrases or combination of words. The next time around / the subject line might say “Congratulations!! You won lottery” -which is slightly different from ‘ ‘You won lottery” . Point being, you cannot possibly Calculate all different combination of words that could result from the use of all different words in the English dictionary.

### Naive Bayes

This is where the Bayes theorem becomes **Naive** . Let’s revisit the formula again.

The probability of the word “You” occurring in the email is independent of the Lord ‘ “Won” occurring. eg.,

- Do you have the paper with
**you**? - we have won the
**contract**

These Sentences are completely independent. When we break down the event into the respective independent events, probability can be Simplified as follows.

This is actually a “Naive” assumption – because in reality, there is some level of overlap. Meaning, when you mention the word “lottery”, you almost always use the word “win” or some variant-like ”won'” or “winning” . However, this is where ML is lucky. Even with the naive assumption, results are pretty good with text classification in real life. Let’s apply the simplification to the Bayes theorem once again.

With a bit of naivety, this formula became so much more useful. In fact, it makes it so useful that Naive Bayes is almost exclusively used for most text classification tasks. Let’s explore this example with some rough data – just believable, made-up data.

- Probability of “You won lottery” being spam.

- Probability of “You won spam” as NOT spam.

So, the probability of this phrase not being spam is 1.51.

Pretty effective, right? Especially given the simplification. Calculating the probability of the individual words is easy. The heart of this algorithm is, given any sentence, this algorithm can break it down into it’s components (words) and based on the “spamminess” of each of the words, the entire sentence can be classified as spam or not.

All we are trying to do in Naive Bayes, is to break down a complicated problem into its components. Once the component is classified, essentially the bigger piece is classified as well.

It is like solving a jigsaw puzzle. How do you solve one typically ? You look for smaller puzzles to solve. Say this is a picture of a car – you start to look for smaller components of the car, like a tire, a windshield and solve for each of these separately. Once you got the pieces figured out, all you have to do is to put them in order. Naive Bayes works more or less like this.

### Classify fruits based on Characteristics

Now that we understand the basics of Naive Bayes, let’s create a simple dataset and solve it in excel. The purpose behind this exercise is to get familiar with Naive Bayes calculation using a smaller dataset. This is going to solidify our understanding a bit further, before we dive into more complicated examples.

Solve the fruits dataset in excel

The probability of each of the characteristics – round, large, small etc, can be calculated as below.

Now, let’s move on to the individual conditional probabilities. For example, what is the probability that a fruit is round, given that it is an apple ? In all the cases of Apple, the fruit is always round.

However, what is the probability that a fruit is red, given that its an apple ? one out of three apples are red.

Like that, we keep calculating the conditional probabilities of all the individual characteristics. Think of this like calculating the probability of each individual word being spam or not.

Time to test our data. Let’s say, we want to calculate the probability of a fruit being an Apple, if it is round and large. All we have to do is plug the numbers.

What is the probability that a fruit is an apple, if it is round, large and smooth ?

Based on our little dataset, we are not doing too bad. let’s do the opposite now. What is the probability of a fruit being a grape, given that it is round, large and smooth ?

Makes sense, right ? grape is never “large”. Hence the probability of a fruit being a grape if it is “large” is relatively small – 16 %.

#### Solve the fruits dataset in Python

library(e1071) fruits = read.csv("./data/fruits.csv")

fruits

fruit round large small red green black golden yellow smooth rough <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> apple yes yes no yes no no no no yes no apple yes yes no no yes no no no yes no apple yes yes no no no no yes no yes no grape yes no yes yes no no no no yes no grape yes no yes no yes no no no yes no grape yes no yes no no yes no no yes no melon yes yes no no yes no no no yes no melon yes yes no no no no yes no no yes melon yes yes no no no no no yes no yes

model = naiveBayes(fruit ~ . , data = fruits)

pred = predict ( model , fruits[,2:11 ])

pred

apple apple apple grape grape grape apple melon melon

table(pred, fruits[,1])

pred apple grape melon apple 3 0 1 grape 0 3 0 melon 0 0 2

That’s not bad, given such a small set of characteristics. Let’s actually get the confusion matrix to get the accuracy percentage.

library(caret) cm = confusionMatrix(pred,as.factor(fruits[,1])) cm

Confusion Matrix and Statistics Reference Prediction apple grape melon apple 3 0 1 grape 0 3 0 melon 0 0 2 Overall Statistics Accuracy : 0.8889 95% CI : (0.5175, 0.9972) No Information Rate : 0.3333 P-Value [Acc > NIR] : 0.0009653 Kappa : 0.8333 Mcnemar's Test P-Value : NA Statistics by Class: Class: apple Class: grape Class: melon Sensitivity 1.0000 1.0000 0.6667 Specificity 0.8333 1.0000 1.0000 Pos Pred Value 0.7500 1.0000 1.0000 Neg Pred Value 1.0000 1.0000 0.8571 Prevalence 0.3333 0.3333 0.3333 Detection Rate 0.3333 0.3333 0.2222 Detection Prevalence 0.4444 0.3333 0.2222 Balanced Accuracy 0.9167 1.0000 0.8333

That’s an accuracy of almost 90%. We are not very far off, given our dataset is pretty small. The one place where we went wrong is in classify a melon wrongly as an apple. If we compared the predictions vs the actuals, we can see that we went wrong with the 7th entry ( a melon being mis-classified as an apple ).

predict = pred actual = fruits[,1] data.frame(predict,actual)

predict actual <fct> <fct> apple apple apple apple apple apple grape grape grape grape grape grape apple melon melon melon melon melon

Let’s check out the actual entry.

As you can see, the entry for melon ( watermelon ) coincides in its data points to the green apple. How could this happen ? This is because of an oversimplification with regards to size. We only have 2 sizes – small and large. However, both the apple and water melon are large ( and round and smooth ). And that’s why the NB algorithm got it wrong. If we had an extra size characteristic ( say XL ), that would have solved this problem.

### Classify messages as Spam

Now that we understood the basics of Naive Bayes along with a simple example in excel and R, we can proceed to solve the problem that we started with – To classify a message as spam or not.

#### **Step 1** – Get the dataset

There is a simple SMS ( text message ) dataset available at kaggle or at the UCI ML datesets. You can also download the file from Ajay Tech’s github page. Download the zip file and open it in excel as a tab delimited format. Each of these messages have been classified as either spam or ham ( ham is just a technical word for “non-spam” ). Open the dataset in excel as a tab-delimited format and give column names ( if not available already ).

**Step 2** – Read the dataset into R

data = read.csv("./data/spam.csv", encoding='ISO-8859-1')

head(data)

class message X X.1 X.2 <fct> <fct> <fct> <fct> <fct> 1 ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 2 ham Ok lar... Joking wif u oni... 3 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 4 ham U dun say so early hor... U c already then say... 5 ham Nah I don't think he goes to usf, he lives around here though 6 spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

data = data[,c(1,2)]

head(data)

class message <fct> <fct> 1 ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 2 ham Ok lar... Joking wif u oni... 3 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 4 ham U dun say so early hor... U c already then say... 5 ham Nah I don't think he goes to usf, he lives around here though 6 spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

**Step 3** – Simple EDA

- How many messages are there in the dataset ?

nrow(data)

5572

summary(data$class)

ham 4825 spam 747

- Out of them, count the occurances of spam vs ham(non-spam)

- What percentage of this is spam ?

summary(data$class)["spam"] / summary(data$class)["ham"] * 100

spam: 15.4818652849741

15 % of the messages are spam.

**Step 4** – Feature Engineering

Just like we converted the fruits dataset’s feature values from “yes” or “no” to a 1 or 0 , Naive Bayes (or for that matter most ML algorithms) need the feature data to be numeric in nature. In order to do it, we have to use some techniques from Natural language processing.

- Tokenize the message (into words) and create a sparse matrix

This process basically splits the sentence (message) to it’s individual words. Let’s see a sample before we tokenize the entire dataset.

Now, let’s do the same on our real messages dataset.

library(tm) message_corpus = Corpus(VectorSource(data$message)) print ( message_corpus)

<<SimpleCorpus>> Metadata: corpus specific: 1, document level (indexed): 0 Content: documents: 5572

message_dtm <- DocumentTermMatrix(message_corpus)

Document term matrix (DTM) is in a binary format. So, we can’t just print it out using indices. Instead, we use the **inspect ( )** function.

inspect(message_dtm[1:10,1:20])

<<DocumentTermMatrix (documents: 10, terms: 20)>> Non-/sparse entries: 21/179 Sparsity : 90% Maximal term length: 19 Weighting : term frequency (tf) Sample : Terms Docs amore available buffet... bugis cine crazy.. got great jurong there 1 1 1 1 1 1 1 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 1 7 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0

**Step 5** – Train/Test data split

Before we use the DTM as-is, we have to convert the 0,1’s to Factors – like a Yes and No. This is becuase Naive Bayes works well with Factors. Let’s write a small functiont that converts all values greater than 0 to a Yes and otherwise to No.

counts_to_factor = function(x){ x = ifelse(x > 0, 1, 0) x = factor(x, levels = c(0,1), labels = c("No", "Yes")) return (x) }

Before we apply this function to the DTM, let’s split the data into training and test datasets.

head(msg_train_dtm[,1:5])

index = sample(1:nrow(data),nrow(data)*.8) train = data[index,2] test = data[-index,2] msg_cor_train = Corpus(VectorSource(data[train,]$message)) msg_train_dtm = DocumentTermMatrix(msg_cor_train) msg_train_dtm = apply(msg_train_dtm, MARGIN = 2, counts_to_factor) msg_class_train = data$class[train] msg_cor_test = Corpus(VectorSource(data[test,]$message)) msg_test_dtm = DocumentTermMatrix(msg_cor_test) msg_test_dtm = apply(msg_test_dtm, MARGIN = 2, counts_to_factor) msg_class_test = data$class[test]

head(msg_train_dtm[,1:5])

2wks 87077 87077: club free 1 Yes Yes Yes Yes Yes 2 No No No No No 3 No No No No No 4 No No No No No 5 No No No No No 6 No No No No No

msg_train_df = as.data.frame(as.matrix(msg_train_dtm)) msg_test_df = as.data.frame(as.matrix(msg_test_dtm))

head(msg_train_df)

been curtsey? have practising you your off. pissed pretty whatever, ... not..tel clearer.. sections above da..al coins factory chart heroes, tips <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ... <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 1 1 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 3 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

**Step 6** – Model the data

library(e1071) model = naiveBayes(msg_train_dtm, msg_class_train)

**Step 7** – Evaluate the model.

pred = predict(model, msg_test_dtm)

table(msg_class_test, pred)

pred msg_class_test ham spam ham 950 13 spam 18 134

Measure the accuracy using the confusion matrix from the caret library.

library(caret) cm = confusionMatrix(pred,msg_class_test) cm

Confusion Matrix and Statistics Reference Prediction ham spam ham 2385 46 spam 12 343 Accuracy : 0.9792 95% CI : (0.9732, 0.9842) No Information Rate : 0.8604 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.9101 Mcnemar's Test P-Value : 1.47e-05 Sensitivity : 0.9950 Specificity : 0.8817 Pos Pred Value : 0.9811 Neg Pred Value : 0.9662 Prevalence : 0.8604 Detection Rate : 0.8561 Detection Prevalence : 0.8726 Balanced Accuracy : 0.9384 'Positive' Class : ham

There is scope for a ton of optimization here like

- convert all characters to lower case
- remove punctuation
- remove stop words etc

But that is a subject for another day. Here we will just focus on learning the Naive Bayes algorithm.

### Challenge

Let’s solve another problem in Naive Bayes. Load up a dataset called house-votes-84.csv from the data folder. The data set should look like this.

These are the results from Congressmen in the US, voting a Yes ( for ) or No (Against ) on 16 different issues. Instead of putting names, the class column identifies the congressmen as either a Republican or a Democrat.

**Task** – Identify the congressmen as either a Democrat or Republican based on his voting pattern.

**solution** – This problem is almost exactly similar to the fruits data we started with at the beginning of leaning Naive Bayes.

# 1. Import the dataset library(mlbench) data(HouseVotes84, package = "mlbench") data = HouseVotes84 head(data) # 2. train/test split index = sample(1:nrow(data),nrow(data)*.8) train = data[index,] test = data[-index,] # 3. model the data model = naiveBayes(Class ~ ., data = train) # 4. predict the data pred = predict(model, test) # 5. Accuracy table(pred, test$Class) library(caret) cm = confusionMatrix(pred,test$Class) print (cm)

Class V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> 1 republican n y n y y y n n n y NA y y y n y 2 republican n y n y y y n n n n n y y y n NA 3 democrat NA y y NA y y n n n n y n y y n n 4 democrat n y y n NA y n n n n y n y n n y 5 democrat y y y n y y n n n n y NA y y y y 6 democrat n y y n y y n n n n n n y y y y

pred democrat republican democrat 121 5 republican 21 71

Confusion Matrix and Statistics Reference Prediction democrat republican democrat 121 5 republican 21 71 Accuracy : 0.8807 95% CI : (0.8301, 0.9206) No Information Rate : 0.6514 P-Value [Acc > NIR] : 1.002e-14 Kappa : 0.7496 Mcnemar's Test P-Value : 0.003264 Sensitivity : 0.8521 Specificity : 0.9342 Pos Pred Value : 0.9603 Neg Pred Value : 0.7717 Prevalence : 0.6514 Detection Rate : 0.5550 Detection Prevalence : 0.5780 Balanced Accuracy : 0.8932 'Positive' Class : democrat

### Challenge – IMDB review Sentiment Analysis

Similar to the SPAM/HAM problem, we can also predict if an IMDB review is positive or negative based on the words in it.

# step 1 - Read the data file library("xlsx") data = read.xlsx("./data/imdb-reviews-sentiment.xlsx", sheetIndex = 1, header=TRUE) # step 2 - Create a DTM based on the text data library(tm) message_corpus = Corpus(VectorSource(data$review)) message_dtm <- DocumentTermMatrix(message_corpus) # step 3 - function to convert the integers to "Yes" or "No" factors in the DTM counts_to_factor = function(x){ x = ifelse(x > 0, 1, 0) x = factor(x, levels = c(0,1), labels = c("No", "Yes")) return (x) }

# step 4 - Split the DTMs to Train and test data and convert the integers to factors for "Yes" and "No" index = sample(1:nrow(data),nrow(data)*.8) train = data[index,2] test = data[-index,2] msg_cor_train = Corpus(VectorSource(data[train,]$review)) msg_train_dtm = DocumentTermMatrix(msg_cor_train) msg_train_dtm = apply(msg_train_dtm, MARGIN = 2, counts_to_factor) msg_class_train = data$sentiment[train] msg_cor_test = Corpus(VectorSource(data[test,]$review)) msg_test_dtm = DocumentTermMatrix(msg_cor_test) msg_test_dtm = apply(msg_test_dtm, MARGIN = 2, counts_to_factor) msg_class_test = data$sentiment[test]

# step 4 - model the data using Naive Bayes library(e1071) model = naiveBayes(msg_train_dtm, msg_class_train) #step 4- predict the results from the model using the test data pred = predict(model, msg_test_dtm)

# step 6 - get the accuracy from confusion matrix. library(caret) cm = confusionMatrix(pred,data$sentiment[test]) print (cm)

Confusion Matrix and Statistics Reference Prediction negative positive negative 0 0 positive 0 2000 Accuracy : 1 95% CI : (0.9982, 1) No Information Rate : 1 P-Value [Acc > NIR] : 1 Kappa : NaN Mcnemar's Test P-Value : NA Sensitivity : NA Specificity : 1 Pos Pred Value : NA Neg Pred Value : NA Prevalence : 0 Detection Rate : 0 Detection Prevalence : 0 Balanced Accuracy : NA 'Positive' Class : negative

### Naive Bayes on continuous variables

So far, we have seen Naive Bayes work on factor variables. Does NB ever work on continous variables ? Yes, it does – ofcourse with discretized version of those variables ( Think of binning a normal distribution ). The key assumption there would be that the variable has a normal distribution. For example, think of the iris dataset – is the “Sepal length” of setosa species normally distributed ? Let’s find out.

from sklearn import datasets iris = datasets.load_iris() iris_data = iris.data iris_target = iris.target

# matplotlib does not have the ability to plot the kernel density function import matplotlib.pyplot as plt # So, we are using seaborn instead import seaborn as sns %matplotlib inline # You can check from these curves that Sepal data is normally distributed, but # the petal data is not. Try them on one by one. sns.distplot(iris_data[:,0], hist=True, kde=True) sns.distplot(iris_data[:,1], hist=True, kde=True) sns.distplot(iris_data[:,2], hist=True, kde=True) sns.distplot(iris_data[:,3], hist=True, kde=True)

Only the Sepal data is normally distributed. Ideally, we should just be using the sepal data ( Sepal Length and Sepal Width ). However, let’s just use all of these and see what happens. As an exercise, try using just the sepal data and check for the accuracy.

# 1. train/test split from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(iris_data , iris_target, test_size=0.2) # 2. Naive Bayes modeling from sklearn.naive_bayes import MultinomialNB model = MultinomialNB().fit(X_train, y_train) # 3. Predict data y_predict = model.predict(X_test) # 4. Create a confusion matrix to check accuracy print ( pd.crosstab(y_test, y_predict,rownames=['Actual'], colnames=['Predicted'], margins=True) ) # 5. Print the accuracy score from sklearn.metrics import confusion_matrix, accuracy_score print ( confusion_matrix(y_test, y_predict) ) print ( accuracy_score(y_test,y_predict))

Predicted 0 1 2 All Actual 0 6 0 0 6 1 0 11 1 12 2 0 1 11 12 All 6 12 12 30 [[ 6 0 0] [ 0 11 1] [ 0 1 11]] 0.9333333333333333

That’s pretty accurate as well , isn’t it ? In fact even though one of the assumptions (all the variables should be independent of each other ) is wrong, Naive Bayes still outperforms some other classification algorithms.

- The priors ( Probability of a “Setosa” occuring or a “Virginica” occuring .. ) is 0.33 ( a third ) – which we know.
- How about the conditional probabilities ? This is where it gets tricky for continuous variables. You cannot have conditional probabilities for each of the values ( as the number can get infinite ). So, in case of a normal distribution, an approximation is applied based on the following formula.

where μ is the mean and σ is the variance