Perceptron from scratch
- Linearly Separable data
- Activation function
This is the most critical component of neural networks. In this section, we will learn about both forward propogation and backward propagation and the math behind back propagation.
This is the most fundamental type of element in a neural network. We have already seen what a perceptron is in the basics of neural networks section. However, we just scratched the surface. In this section, we will explore a perceptron in detail and explore a couple of simple problems it can solve.
Linearly Separable data
By definition, a perceptron can only solve linearly separable problems. What is a linearly separable problem ? Here are a couple of examples that show you linearly separable data. For example, two if the iris species are linearly separable by a hyperplane (in this case a single line). Similarly, an OR gate is also an example of a linearly separable dataset.
# Visualize a OR gate import numpy as np # OR gate data x = np.array([[1,0], [0,1], [0,0], [1,1]]) y = np.array([1,1,0,1]) import matplotlib.pyplot as plt %matplotlib inline plt.scatter(x[:,0],x[:,1],c=y) plt.xlabel("x") plt.ylabel("y") plt.title("OR gate")
Text(0.5, 1.0, 'OR gate')
# Visualize just 2 species (setosa, versicolor) that are linearly separable # using the predictors (Sepel Length, Sepal, Width) import matplotlib as mpl import matplotlib.pyplot as plt %matplotlib inline # iris data is readily available as a sklearn dataset. from sklearn import datasets iris = datasets.load_iris() data = iris.data # visualize just the first 100 rows (so that it contains only the species setosa and versicolor) # We are specifically not plotting the third species (virginica), because it is not # linearly separable. plt.scatter(data[0:100,0],data[0:100,1],c=iris.target[0:100]) plt.xlabel("sepal length") plt.ylabel("sepal width") plt.title("iris species - Setosa, Versicolor") plt.savefig("iris.png")
Now that we have an understanding of the data, let’s use gradient descent to solve for the weights.
What we are essentially trying to do is to find out values for weights and bias in such a way that
Here is how this function would look like.
# Show how a binary step function looks like. import numpy as np import matplotlib.pyplot as plt %matplotlib inline x = np.linspace (-5,5,100) y = np.zeros(len(x)) y[x>=0] = 1 y[x<0] = 0 plt.scatter(x,y) plt.xlabel("x") plt.ylabel("y") plt.title("Activation function - Binary step function ")
Text(0.5, 1.0, 'Activation function - Binary step function ')
Let’s solve the OR gate problem (or any other linearly separable problem) using a simple, single layer perceptron. Let’s start with the data first. Here is a quick high-level overview of the steps involved.
# OR gate data x = np.array([[1,0], [0,1], [0,0], [1,1]]) y = np.array([1,1,0,1])
What about back propagation ? This is where gradient descent comes in (along with its paraphernelia of partial derivatives, learning rate, cost function etc). There are a couple of options to calculate the cost function (residual sum of squares & cross entropy), but for now, let’s just use the residual sum of squares (RSS) cost function. We have already seen this in the gradient descent section.
Now, we want to see what the derivative of the cost function with respect to each of the variables (weights and bias).
similarly, the partial derivatives with respect to weight 2 (w2) and the bias (b) are
Once we have the partial derivatives, we can update the weights and biases
These equations can rattle any ML engineer, but remember, all of this is left to the library (tensorflow or any of the underlying deep learning library) to compute. The only reason why we are learning all of the math and hand-coding this in Python is to ensure that we get an in-depth understanding of back propagation. This is absolutely essential to be a good ML engineer.
Forward propagation is a relatively easy step. Let’s write a quick function for it .
def forward_prop(row) : y_hat = np.dot(x[row],w) + b if y_hat > 0 : return 1 else : return 0
Let’s now write a function for back propagation using all the geeky stuff above in “update rules” section.
def backward_prop(y_hat, row) : global b,w w = w + alpha * (y[row] - y_hat) * x[row] w = w + alpha * (y[row] - y_hat) * x[row] b = b + alpha * (y[row] - y_hat)
Initialize weights and biases
Initialize the weights and bias.
w = np.random.normal(size=2) b = np.random.normal() # learning rate. This is exactly the same term that we have already learnt in gradient descent. alpha = 0.01
As of now, we are working on global variables to make things simple. Later, we will make a class out of all this to make things easy going forward. Just one more function to go, before we set this in motion. Assuming the model is in place (which we are going to write in a minute), we also need a function to predict a y
value, right ? Just like any Machine Learning algorithm, we need a predict ( ) method. Once the model fits the data to the right set of weights, this one is very easy. All we have to do is run the data through one forward propagation cycle.
# return the predicted y_hat, for the test data set. def predict(x) : y =  # the user could be sending multiple rows. compute y_hat for each of the rows in the test dataset. for row in x : # weighted sum y_pred = np.dot(row,w) + b # run the weighted sum throught he activation function. if y_pred > 0 : y_pred = 1 else : y_pred = 0 # append the predicted y (y_hat)to an array y.append(y_pred) # return the predicted array of y_hat values for the corresponding test data (x) return y
The individual pieces of the simple perceptron have been coded. Now, we need to write the logic to
- take the input data. For each row
- do one cycle of forward propagation
- do one cycle of backward propagation and updated the weights and bias.
This exhausts one cycle of the input data. In Deep learning, this is called as an epoch. We need to repeat the entire process for a whole bunch of epochs.
Let’s write the logic for this.
# number of epochs for epoch in range(1000) : # for each row in x (cycle through the dataset) for row in range(x.shape) : # for each row in x, predict y_hat y_hat = forward_prop(row) # for each row calculate weights backward_prop(y_hat,row) print ( w, b)
[0.01363271 0.25196752] -0.009751486705392132
It is time to test our network. Let’s quickly print out x and y.
array([[1, 0], [0, 1], [0, 0], [1, 1]])
array([1, 1, 0, 1])
Since this is a small dataset, we don’t need a confusion matrix to calculate the accuracy. Let’s just use the predict function on the x array to predict y.
[1, 1, 0, 1]
That’s a perfect match. This is a small dataset. Let’s look at a slightly larger dataset and see if the perceptron is good enough to do linear separation. Let’s pick up the iris dataset from Scikit Learn’s
from sklearn import datasets iris = datasets.load_iris() data = iris.data
All of this data is not linearly separable. For example, if you plot the species against the sepal length and width, the species – versicolor and virginica are muddled. Only the first species (setosa) is clearly separated.
import matplotlib.pyplot as plt %matplotlib inline plt.scatter(data[:,0],data[:,1],c=iris.target)
So, let’s just use the first two species. Let’s plot it again.
# visualize just the first 100 rows (so that it contains only the species setosa and versicolor) # We are specifically not plotting the third species (virginica), because it is not # linearly separable. plt.scatter(data[0:100,0],data[0:100,1],c=iris.target[0:100]) plt.xlabel("sepal length") plt.ylabel("sepal width") plt.title("iris species - Setosa, Versicolor") plt.savefig("iris.png")
Now, we have a clear, linear separation. Let’s train our perceptron on this data and see if it works.
x = data[0:100,0:2] # iris sepal data ( sepal length and width ) y = iris.target[0:100] # iris species data (only setosa and versicolor) w = np.random.normal(size=2) b = np.random.normal() # learning rate alpha = 0.01 # number of epochs for epoch in range(1000) : # for each row in x for row in range(x.shape) : # for each row in x, predict y_hat y_hat = forward_prop(row) # for each row calculate weights backward_prop(y_hat,row) print ( w, b)
[ 0.80437979 -1.08684544] -1.0479456545593953
We can very well do a confusion matrix to check for accuracy.
y_pred = predict(x)
from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score print ( confusion_matrix(y,y_pred) ) print ( accuracy_score(y,y_pred))
[[49 1] [ 0 50]] 0.99
That a pretty good accuracy – almost 98%. It is a bit more interesting to see this visually – using matplotlib’s meshgrid.
import numpy as np x_all = np.linspace(0,10,100).reshape(-1,1) y_all = np.linspace(0,10,100).reshape(-1,1) xx,yy = np.meshgrid(x_all,y_all)
x_grid = np.concatenate((xx.ravel().reshape(-1,1),yy.ravel().reshape(-1,1)),axis=1)
array([[ 0. , 0. ], [ 0.1010101, 0. ], [ 0.2020202, 0. ], ..., [ 9.7979798, 10. ], [ 9.8989899, 10. ], [10. , 10. ]])
y_grid = predict(x_grid)
import matplotlib as mpl import matplotlib.pyplot as plt %matplotlib inline mpl.rcParams['figure.dpi'] = 200 plt.scatter(x_grid[:,0],x_grid[:,1],c=y_grid,alpha=0.1) plt.scatter(data[0:100,0],data[0:100,1],c=iris.target[0:100])