Perceptron from scratch
Contents
This is the most critical component of neural networks. In this section, we will learn about both forward propogation and backward propagation and the math behind back propagation.
Perceptron
This is the most fundamental type of element in a neural network. We have already seen what a perceptron is in the basics of neural networks section. However, we just scratched the surface. In this section, we will explore a perceptron in detail and explore a couple of simple problems it can solve.

Linearly Separable data
By definition, a perceptron can only solve linearly separable problems. What is a linearly separable problem ? Here are a couple of examples that show you linearly separable data. For example, two if the iris species are linearly separable by a hyperplane (in this case a single line). Similarly, an OR gate is also an example of a linearly separable dataset.

# Visualize a OR gate
import numpy as np
# OR gate data
x = np.array([[1,0],
[0,1],
[0,0],
[1,1]])
y = np.array([1,1,0,1])
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x[:,0],x[:,1],c=y)
plt.xlabel("x")
plt.ylabel("y")
plt.title("OR gate")
Text(0.5, 1.0, 'OR gate')

# Visualize just 2 species (setosa, versicolor) that are linearly separable
# using the predictors (Sepel Length, Sepal, Width)
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
# iris data is readily available as a sklearn dataset.
from sklearn import datasets
iris = datasets.load_iris()
data = iris.data
# visualize just the first 100 rows (so that it contains only the species setosa and versicolor)
# We are specifically not plotting the third species (virginica), because it is not
# linearly separable.
plt.scatter(data[0:100,0],data[0:100,1],c=iris.target[0:100])
plt.xlabel("sepal length")
plt.ylabel("sepal width")
plt.title("iris species - Setosa, Versicolor")
plt.savefig("iris.png")

Now that we have an understanding of the data, let’s use gradient descent to solve for the weights.
Activation function

What we are essentially trying to do is to find out values for weights and bias in such a way that

Here is how this function would look like.
# Show how a binary step function looks like.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace (-5,5,100)
y = np.zeros(len(x))
y[x>=0] = 1
y[x<0] = 0
plt.scatter(x,y)
plt.xlabel("x")
plt.ylabel("y")
plt.title("Activation function - Binary step function ")
Text(0.5, 1.0, 'Activation function - Binary step function ')

Steps
Let’s solve the OR gate problem (or any other linearly separable problem) using a simple, single layer perceptron. Let’s start with the data first. Here is a quick high-level overview of the steps involved.

# OR gate data
x = np.array([[1,0],
[0,1],
[0,0],
[1,1]])
y = np.array([1,1,0,1])
Cost function
What about back propagation ? This is where gradient descent comes in (along with its paraphernelia of partial derivatives, learning rate, cost function etc). There are a couple of options to calculate the cost function (residual sum of squares & cross entropy), but for now, let’s just use the residual sum of squares (RSS) cost function. We have already seen this in the gradient descent section.

Partial Derivatives
Now, we want to see what the derivative of the cost function with respect to each of the variables (weights and bias).

similarly, the partial derivatives with respect to weight 2 (w2) and the bias (b) are

Update Rules
Once we have the partial derivatives, we can update the weights and biases

These equations can rattle any ML engineer, but remember, all of this is left to the library (tensorflow or any of the underlying deep learning library) to compute. The only reason why we are learning all of the math and hand-coding this in Python is to ensure that we get an in-depth understanding of back propagation. This is absolutely essential to be a good ML engineer.
Forward Propagation
Forward propagation is a relatively easy step. Let’s write a quick function for it .
def forward_prop(row) :
y_hat = np.dot(x[row],w) + b
if y_hat > 0 :
return 1
else :
return 0

Backward Propagation
Let’s now write a function for back propagation using all the geeky stuff above in “update rules” section.
def backward_prop(y_hat, row) :
global b,w
w[0] = w[0] + alpha * (y[row] - y_hat) * x[row][0]
w[1] = w[1] + alpha * (y[row] - y_hat) * x[row][1]
b = b + alpha * (y[row] - y_hat)
Initialize weights and biases
Initialize the weights and bias.

w = np.random.normal(size=2)
b = np.random.normal()
# learning rate. This is exactly the same term that we have already learnt in gradient descent.
alpha = 0.01
Predict Function
As of now, we are working on global variables to make things simple. Later, we will make a class out of all this to make things easy going forward. Just one more function to go, before we set this in motion. Assuming the model is in place (which we are going to write in a minute), we also need a function to predict a y
value, right ? Just like any Machine Learning algorithm, we need a predict ( ) method. Once the model fits the data to the right set of weights, this one is very easy. All we have to do is run the data through one forward propagation cycle.

# return the predicted y_hat, for the test data set.
def predict(x) :
y = []
# the user could be sending multiple rows. compute y_hat for each of the rows in the test dataset.
for row in x :
# weighted sum
y_pred = np.dot(row,w) + b
# run the weighted sum throught he activation function.
if y_pred > 0 :
y_pred = 1
else :
y_pred = 0
# append the predicted y (y_hat)to an array
y.append(y_pred)
# return the predicted array of y_hat values for the corresponding test data (x)
return y
Training
The individual pieces of the simple perceptron have been coded. Now, we need to write the logic to
- take the input data. For each row
- do one cycle of forward propagation
- do one cycle of backward propagation and updated the weights and bias.
This exhausts one cycle of the input data. In Deep learning, this is called as an epoch. We need to repeat the entire process for a whole bunch of epochs.

Let’s write the logic for this.
# number of epochs
for epoch in range(1000) :
# for each row in x (cycle through the dataset)
for row in range(x.shape[0]) :
# for each row in x, predict y_hat
y_hat = forward_prop(row)
# for each row calculate weights
backward_prop(y_hat,row)
print ( w, b)
[0.01363271 0.25196752] -0.009751486705392132
Predict
It is time to test our network. Let’s quickly print out x and y.
x
array([[1, 0],
[0, 1],
[0, 0],
[1, 1]])
y
array([1, 1, 0, 1])
Since this is a small dataset, we don’t need a confusion matrix to calculate the accuracy. Let’s just use the predict function on the x array to predict y.
predict(x)
[1, 1, 0, 1]
That’s a perfect match. This is a small dataset. Let’s look at a slightly larger dataset and see if the perceptron is good enough to do linear separation. Let’s pick up the iris dataset from Scikit Learn’s
from sklearn import datasets
iris = datasets.load_iris()
data = iris.data
All of this data is not linearly separable. For example, if you plot the species against the sepal length and width, the species – versicolor and virginica are muddled. Only the first species (setosa) is clearly separated.
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(data[:,0],data[:,1],c=iris.target)

So, let’s just use the first two species. Let’s plot it again.
# visualize just the first 100 rows (so that it contains only the species setosa and versicolor)
# We are specifically not plotting the third species (virginica), because it is not
# linearly separable.
plt.scatter(data[0:100,0],data[0:100,1],c=iris.target[0:100])
plt.xlabel("sepal length")
plt.ylabel("sepal width")
plt.title("iris species - Setosa, Versicolor")
plt.savefig("iris.png")

Now, we have a clear, linear separation. Let’s train our perceptron on this data and see if it works.
x = data[0:100,0:2] # iris sepal data ( sepal length and width )
y = iris.target[0:100] # iris species data (only setosa and versicolor)
w = np.random.normal(size=2)
b = np.random.normal()
# learning rate
alpha = 0.01
# number of epochs
for epoch in range(1000) :
# for each row in x
for row in range(x.shape[0]) :
# for each row in x, predict y_hat
y_hat = forward_prop(row)
# for each row calculate weights
backward_prop(y_hat,row)
print ( w, b)
[ 0.80437979 -1.08684544] -1.0479456545593953
We can very well do a confusion matrix to check for accuracy.
y_pred = predict(x)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
print ( confusion_matrix(y,y_pred) )
print ( accuracy_score(y,y_pred))
[[49 1]
[ 0 50]]
0.99
That a pretty good accuracy – almost 98%. It is a bit more interesting to see this visually – using matplotlib’s meshgrid.
import numpy as np
x_all = np.linspace(0,10,100).reshape(-1,1)
y_all = np.linspace(0,10,100).reshape(-1,1)
xx,yy = np.meshgrid(x_all,y_all)
x_grid = np.concatenate((xx.ravel().reshape(-1,1),yy.ravel().reshape(-1,1)),axis=1)
x_grid
array([[ 0. , 0. ],
[ 0.1010101, 0. ],
[ 0.2020202, 0. ],
...,
[ 9.7979798, 10. ],
[ 9.8989899, 10. ],
[10. , 10. ]])
y_grid = predict(x_grid)
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200
plt.scatter(x_grid[:,0],x_grid[:,1],c=y_grid,alpha=0.1)
plt.scatter(data[0:100,0],data[0:100,1],c=iris.target[0:100])