A bar chart represents data values in the form of vertical bars. Each
vertical bar in the graph represents an individual category. The bars
are used to compare values in different categories. In a bar chart, the
length of a bar is proportional to the value it represents and the width
remains same for all bars. One axis of the chart represents categories
and the other axis represents the value scale.
Below we see a bar chart plotted using the bar function defined in pyplot module. The bar chart displays cars sales during a ten year period for an automobile company. The first argument to the bar function indicates the position of the bar on the x-axis with the center at the x-tick position. The second argument indicates the height. The width of each bar is 0.8 which is the default setting, and this can be changed using the ‘width’ parameter.
from matplotlib import pyplot as plt
import numpy as np
year = [2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018]
toyota_sales = [1843669,1496211,1488588,1396837,1764802,1893874,2004373,2098545,2106332,2129177,2128201]
x_pos = np.arange(len(year))
plt.bar(x_pos,toyota_sales,color='#623aa2',alpha=0.25,edgecolor='k',label='Toyota')
plt.xticks(x_pos,year,rotation=30)
plt.title('Toyota Car sales')
plt.xlabel('Year')
plt.ylabel('No of units sold')
plt.show()
Clustered Bar Chart
A clustered or a grouped bar chart is used to compare multiple data
sets side by side. Say, you want to compare values of multiple datasets
that come under the same category, then a clustered bar chart comes in
handy. The previous example can be extended to display car sales of
different automobile companies.
In the previous example, we used a bar chart to display the sales of an automobile company for a ten year period. Now we would like to compare the sales of three different companies for the same period. So we are going to have three vertical bars under each category, each bar representing a company. In order to differentiate the three datasets we use different colors for the bars.
Horizontal Bar Charts represent data in the form of horizontal bars,
each bar representing an individual category. The data categories are
shown on the y-axis and the data values are shown on the x-axis. The
length of a bar is proportional to the value it represents.
The example below demonstrates how to plot a bar chart, the input datasets required for plotting are available in a csv file. We will import the built-in csv module to work with csv files.
import csv
with open (r'C:\Users\Ajay Tech\Desktop\air_pollution_index.csv') as input_file:
csv_file = csv.reader(input_file,delimiter = ',')
Header = next(csv_file)
country = []
index = []
for row in csv_file:
country.append(row[0])
index.append(row[1])
plt.bar(country,index,color='#ff753e')
plt.title('Air Pollution Index')
plt.xlabel('Index')
plt.show()
The above graph is plotted using the bar function. As can be observed from the figure the x-axis labels are overlapping with each other because the labels are too long. This problem can be solved using a horizontal bar chart, which makes optimal use of the space available. If the data labels are long or if you have too many data sets to plot, then horizontal bar charts can be used for plotting.
A pie chart is a circular chart divided into segments. The segments
of a pie chart are called wedges. Each wedge represents an individual
category. Pie charts display the contribution of each wedge to the total
value and the sum of the values of all wedges should be equal to the
whole circle. Pie charts are useful when there are few categories to
compare(5 or less) else it becomes difficult to interpret the data.
Below we see a pie chart plotted using the pie function defined in pyplot module. The pie chart displays the percentage of marks obtained by a student in four subjects. The circle is divided into four wedges and the area of each wedge is proportionate to the value it represents.
from matplotlib import pyplot as plt
plt.style.use('default')
x = [22,18,13,10]
labels = ['maths','physics','chemistry','english']
colors = ['m','r','y','c']
plt.pie(x,labels=labels,colors=colors,autopct='%.1f%%')
plt.title('Marks obtained in an exam')
plt.show()
The pie chart below displays the market share of mobile phone vendors worldwide.
x — The first argument passed to the pie function is an array or a list denoting the values for the categories to be compared.
Labels — The labels argument is a list of strings used for labelling each wedge.
Colors — You can define an array/list of colors and then pass it to
the pie function that will be applied to each wedge in pie chart in the
order specified in the array.
Explode — If you want to highlight or emphasize key data in a pie
chart use the explode parameter. The explode parameter explodes/expands a
wedge, so the wedge is moved slightly outward from its center. This
parameter accepts an array and each element in the array specifies by
what fraction of the radius the wedge needs to be exploded. The value
has to be defined for all wedges in the pie chart, so the length of the
array should be equal to the number of wedges in the pie chart.
Autopct — If you want to label the wedges with their numeric value
in a pie chart, then use autopct parameter. This parameter allows us to
display the percent value using string formatting. Say for example, the
percent value calculated for a segment is 34.678666 and if you want to
display the percent value rounded to 1 decimal place then autopct
parameter should be assigned the format string ‘%1.1f’ then the wedge
will be labelled with the numeric value 34.6. If you want to add a
percent sign (%) to the label then use two percent signs(%%) in the
format string so that the special character ‘%’ escapes itself.
Startangle — By default the Startangle is zero, which means starting
from the positive x-axis the wedges are arranged in the counter clock
wise direction. If you specify a different Startangle then the start of
the pie chart is shifted by this angle in degrees and then the wedges
are arranged in counter clock wise direction from this position.
Counterclock — Specifies the direction in which the wedges are
arranged, clockwise or counter clockwise. The default value is True.
Shadow — A shadow effect can be added to the pie chart using the
shadow parameter of the pie() function, passing boolean value – True
will make a shadow appear below the pie chart. By default shadow is
turned off.
Wedgeprops — The wedges of the pie chart can be customized using the
wedgeprop parameter. A dictionary with the property name and value as
the key, value pairs can be passed as the wedgeprop argument. The wedge
properties like edgecolor, linestyle, linewidth can be specified.
The pyplot module in matplotlib supports a variety of plots such as –
line plot, pie chart, bar chart, histogram, scatter plot etc., The
module defines methods that are used to render various plots. In this
tutorial, we will discuss about line plots.
A line plot is created by connecting the values in the input data
with straight lines. Line plots are used to determine the relation
between two datasets. A dataset is a collection of values. Each dataset
is plotted along an axis ie., x and y axis. In order to draw a line
plot, we call the plot function defined in pyplot module. We pass two
arguments (arrays or lists) to the plot function, the first argument
denotes the x-coordinates, second argument denotes the y-coordinates.
The plot function plots the data points (x1,y1), (x2,y2) and so on
defined in the input datasets and by default, draws a line between these
data points. Before drawing a plot, let us the see the components that
make up a basic plot.
Components of a basic plot
A basic plot is made up of the following components:
Title – Title describes the information that we want to convey using the graph.
Label – Label is a short description of the datasets being plotted.
Scales – Scales determine the reference points for data displayed on the graph.
Points – Points in a graph represent the input data in the form of x-coordinate and y-coordinate (x1,y1).
Lines – Lines are used to connect points to highlight the change in values.
Plotting a graph
Matplotlib makes extensive use of the Numpy library which contains a number of mathematical functions which can be used to perform various mathematical operations. We need to import Matplotlib and Numpy libraries before making any calls to the routines defined in them. The below example demonstrates creation of a line plot by passing two numpy arrays x and y as arguments to the plot function.
from matplotlib import pyplot as plt
import numpy as np
x = np.linspace(0,10,10)
y = np.linspace(0,10,10)
plt.title('First Plot')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.plot(x,y,marker='o')
plt.show()
In order to render the above plot, we simply passed two arrays x,y to the plot function. We can see that the plot function has:
Drawn the x and y axis.
Marked evenly spaced scales(tick marks) on both the axes.
Plotted the data points.
Connected the data points with lines.
Added the title, xlabel, ylabel.
Before executing the above mentioned steps, the plot function first
creates a figure object. Figure objects are the individual windows on
the screen in which Matplotlib displays the graphical output. It is a
container for the graphical output. In the Jupyter NB, figures rendered
by the Matplotlib library, are included inline. The plot function
implicitly creates a figure object and then plots the graph. So, we do
not have to call any other function to instantiate a figure object to
render a plot when using the plot function. The standard size of a
figure object is 8 inches wide by 6 inches high.
Say, for example, we have a requirement to create a figure with a
specified size (4 inches wide, 4 inches high). For this, we need to call
the Figure method defined in the pyplot module explicitly. The
‘figsize’ parameter of this method allows us to specify the width and
height of a figure in unit inches and new a figure will be created. In
order to render a plot, call the plot function.
The savefig() method saves the figure to a data file with a name specified by the string argument. The filename can be a full path and can also include a file extension if needed.
We can pass a single dataset or an array to the plot function as shown in cell below. The plot function uses the values 0, 1, …, N-1 as the x coordinates where ‘N’ is the size of the y array.
We can plot multiple plots on the same graph by calling the plot function for each dataset pair, this is useful to compare the plots. Each plot is rendered on top of another plot. Notice how Matplotlib applies a different color to each plot. The plots share the figure, x and y axis.
A line drawn on a graph has several properties such as color, width, style, transparency etc., these properties can be customized as per our requirement when we call the plot function.
Colors in data visualization are used to enhance the look of the
graphs, communicate the information clearly and to distinguish one set
of data from another. The following basic colors are defined in
Matplotlib.
Matplotlib also supports HEX colors. Web designers and developers use
HEX colors in web designing. A HEX color is represented as a six-digit
combination of numbers and letters defined by the amount of red, green
and blue (RGB) that makes up the color.
Linestyle
Linestyle specifies whether the line is solid, dashed etc.,
Markers
Markers on a line plot are used to highlight particular data points.
Legend
A legend is a box associating labels(text) with lines on a graph. The
legend() method is used to add a legend to the plot. This method can be
called in multiple ways:
plt.legend() – When no arguments are passed to the legend() method,
the plots to be added in the legend are automatically detected, and the
corresponding labels are used in the legend.
plt.legend([‘label1’, ‘label2’, ‘label3’]) – The legend method can
also be called by passing a list of string labels, where each string is
used as a label for the plots in the order they were created. This
method can be used to create a legend for the plots already existing on
the axes.
Note: This way of using the legend method is often discouraged because
you should remember the order in which the plots are created which
can be confusing.
plt.legend([plot1,plot2,plot3],[‘label1′,’label2′,’label3’]) – We
can explicitly specify the plots and labels by passing the list of plots
followed by the list of string labels arranged in order to the legend
method.
Data visualization is the representation of data in a graphical
format. It helps putting data in a visual form. Information,
relationships, patterns that might go unnoticed in a text-based format
can be easily recognized with data visualization software. This is
because the human brain can understand and process visuals such as
images, graphs or charts more easily compared to having the data in
spreadsheets or in the form of reports. Data visualizations can turn
large and small datasets into visuals.
The table below displays an independent variable ‘x’ and three
functions a,b and c. Each of these functions are dependent on the
variable ‘x’. Let us use the data in the table below to plot these
functions.
From the plot, we can observe the relation between the variable and
each of the functions and can infer that the cubic function(c) grows
much faster compared to the identity function(a) and the square
function(b).
In this example, we have a small dataset, so analysing data from the
table is easy. But what if we have a dataset with millions of entries or
a complex function to be analysed. In that case, having a graphical
representation of data would be useful. There are various types of
graphs or charts to represent data in a visual form. The type of data
given and what we want to convey to the user determine the appropriate
graph to be used. Line plots, Pie charts, Bar charts, Histogram, Scatter
Plots etc., are few examples of graphs.
Matplotlib
This course will take an in-depth look at the Matplotlib tool for visualization in Python.
Matplotlib is a Python package that is widely used throughout the
scientific Python community to create high-quality and publication-ready
graphics.
Matplotlib is Python’s alternative to MATLAB and it has the advantage
of being free and open-source, whereas MATLAB is expensive and closed
source.
The Matplotlib library provides the pyplot module, which contains
functions which closely resemble the MATLAB plotting syntax and
functionality.
Matplotlib is built on Numpy arrays. It supports a wide range of
export formats suitable for both web and print publishing. It supports
high-quality output formats such as PNG, PDF, SVG, EPS and PGF.
Installing Matplotlib
Install Matplotlib with pip
Matplotlib can be installed using the Python package manager, pip. To
install Matplotlib with pip, open a terminal window and type:
$ pip install matplotlib
This command installs Matplotlib in the current working Python environment.
Install Matplotlib with the Anaconda distribution of Python
The easiest way to install Matplotlib is to download and install the
Anaconda distribution of Python. The Anaconda distribution of Python
comes with Matplotlib included and no further installation steps are
required. You can download the latest version of Anaconda by following
this link – https://www.anaconda.com/download/.
Backend
Matplotlib uses a backend to render the plots. Backend is a utility
used to create graphs. There are two types of backends, interactive and
non-interactive. Interactive backends display the figure in a graphical
user interface, which allows us to pan and zoom the figure.
Non-interactive backends are used to produce image files. Matplotlib
supports the following backends:
The Jupyter Notebook supports the ‘inline’ backend. With this
backend, the output of plotting commands is displayed inline, that is
directly below the code cell that produced it. The inline backend
renders a static or a stand alone plot. The resulting plots will be
stored in the notebook document.
The ‘inline’ backend can be invoked using the following command: %matplotlib inline
The Jupyter notebook also supports the ‘notebook’ backend which
renders an interactive plot. Just below the plot, we can find a toolbar
to switch views, pan, zoom and download options.
The ‘notebook’ backend can be invoked using the following command: %matplotlib notebook
Basic Plotting with Matplotlib
%matplotlib inline
from matplotlib import pyplot as plt
plt.plot([1,2,3,4,5],[1,2,3,4,5])
plt.show()
%matplotlib notebook
from matplotlib import pyplot as plt
plt.plot([1,2,3,4,5],[5,4,3,2,1])
plt.show()
This is the most critical component of neural networks. In this
section, we will learn about both forward propogation and backward
propagation and the math behind back propagation.
Perceptron
This is the most fundamental type of element in a neural network. We
have already seen what a perceptron is in the basics of neural networks
section. However, we just scratched the surface. In this section, we
will explore a perceptron in detail and explore a couple of simple
problems it can solve.
Linearly Separable data
By definition, a perceptron can only solve linearly separable problems. What is a linearly separable
problem ? Here are a couple of examples that show you linearly
separable data. For example, two if the iris species are linearly
separable by a hyperplane (in this case a single line). Similarly, an OR
gate is also an example of a linearly separable dataset.
# Visualize a OR gate
import numpy as np
# OR gate data
x = np.array([[1,0],
[0,1],
[0,0],
[1,1]])
y = np.array([1,1,0,1])
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x[:,0],x[:,1],c=y)
plt.xlabel("x")
plt.ylabel("y")
plt.title("OR gate")
Text(0.5, 1.0, 'OR gate')
# Visualize just 2 species (setosa, versicolor) that are linearly separable
# using the predictors (Sepel Length, Sepal, Width)
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
# iris data is readily available as a sklearn dataset.
from sklearn import datasets
iris = datasets.load_iris()
data = iris.data
# visualize just the first 100 rows (so that it contains only the species setosa and versicolor)
# We are specifically not plotting the third species (virginica), because it is not
# linearly separable.
plt.scatter(data[0:100,0],data[0:100,1],c=iris.target[0:100])
plt.xlabel("sepal length")
plt.ylabel("sepal width")
plt.title("iris species - Setosa, Versicolor")
plt.savefig("iris.png")
Now that we have an understanding of the data, let’s use gradient descent to solve for the weights.
Activation function
What we are essentially trying to do is to find out values for weights and bias in such a way that
Here is how this function would look like.
# Show how a binary step function looks like.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace (-5,5,100)
y = np.zeros(len(x))
y[x>=0] = 1
y[x<0] = 0
plt.scatter(x,y)
plt.xlabel("x")
plt.ylabel("y")
plt.title("Activation function - Binary step function ")
Text(0.5, 1.0, 'Activation function - Binary step function ')
Steps
Let’s solve the OR gate problem (or any other linearly separable
problem) using a simple, single layer perceptron. Let’s start with the
data first. Here is a quick high-level overview of the steps involved.
# OR gate data
x = np.array([[1,0],
[0,1],
[0,0],
[1,1]])
y = np.array([1,1,0,1])
Cost function
What about back propagation ? This is where gradient descent comes in
(along with its paraphernelia of partial derivatives, learning rate,
cost function etc). There are a couple of options to calculate the cost
function (residual sum of squares & cross entropy), but for now,
let’s just use the residual sum of squares (RSS) cost function. We have
already seen this in the gradient descent section.
Partial Derivatives
Now, we want to see what the derivative of the cost function with respect to each of the variables (weights and bias).
similarly, the partial derivatives with respect to weight 2 (w2) and the bias (b) are
Update Rules
Once we have the partial derivatives, we can update the weights and biases
These
equations can rattle any ML engineer, but remember, all of this is left
to the library (tensorflow or any of the underlying deep learning
library) to compute. The only reason why we are learning all of the math
and hand-coding this in Python is to ensure that we get an in-depth
understanding of back propagation. This is absolutely essential to be a
good ML engineer.
Forward Propagation
Forward propagation is a relatively easy step. Let’s write a quick function for it .
def forward_prop(row) :
y_hat = np.dot(x[row],w) + b
if y_hat > 0 :
return 1
else :
return 0
Backward Propagation
Let’s now write a function for back propagation using all the geeky stuff above in “update rules” section.
w = np.random.normal(size=2)
b = np.random.normal()
# learning rate. This is exactly the same term that we have already learnt in gradient descent.
alpha = 0.01
Predict Function
As of now, we are working on global variables to make things simple.
Later, we will make a class out of all this to make things easy going
forward. Just one more function to go, before we set this in motion.
Assuming the model is in place (which we are going to write in a
minute), we also need a function to predict a y
value, right ? Just like any Machine Learning algorithm, we need a predict ( )
method. Once the model fits the data to the right set of weights, this
one is very easy. All we have to do is run the data through one forward
propagation cycle.
# return the predicted y_hat, for the test data set.
def predict(x) :
y = []
# the user could be sending multiple rows. compute y_hat for each of the rows in the test dataset.
for row in x :
# weighted sum
y_pred = np.dot(row,w) + b
# run the weighted sum throught he activation function.
if y_pred > 0 :
y_pred = 1
else :
y_pred = 0
# append the predicted y (y_hat)to an array
y.append(y_pred)
# return the predicted array of y_hat values for the corresponding test data (x)
return y
Training
The individual pieces of the simple perceptron have been coded. Now, we need to write the logic to
take the input data. For each row
do one cycle of forward propagation
do one cycle of backward propagation and updated the weights and bias.
This exhausts one cycle of the input data. In Deep learning, this is called as an epoch. We need to repeat the entire process for a whole bunch of epochs.
Let’s write the logic for this.
# number of epochs
for epoch in range(1000) :
# for each row in x (cycle through the dataset)
for row in range(x.shape[0]) :
# for each row in x, predict y_hat
y_hat = forward_prop(row)
# for each row calculate weights
backward_prop(y_hat,row)
print ( w, b)
[0.01363271 0.25196752] -0.009751486705392132
Predict
It is time to test our network. Let’s quickly print out x and y.
x
array([[1, 0],
[0, 1],
[0, 0],
[1, 1]])
y
array([1, 1, 0, 1])
Since this is a small dataset, we don’t need a confusion matrix to
calculate the accuracy. Let’s just use the predict function on the x
array to predict y.
predict(x)
[1, 1, 0, 1]
That’s a perfect match. This is a small dataset. Let’s look at a
slightly larger dataset and see if the perceptron is good enough to do
linear separation. Let’s pick up the iris dataset from Scikit Learn’s
from sklearn import datasets
iris = datasets.load_iris()
data = iris.data
All of this data is not linearly separable. For example, if you plot
the species against the sepal length and width, the species – versicolor
and virginica are muddled. Only the first species (setosa) is clearly
separated.
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(data[:,0],data[:,1],c=iris.target)
So, let’s just use the first two species. Let’s plot it again.
# visualize just the first 100 rows (so that it contains only the species setosa and versicolor)
# We are specifically not plotting the third species (virginica), because it is not
# linearly separable.
plt.scatter(data[0:100,0],data[0:100,1],c=iris.target[0:100])
plt.xlabel("sepal length")
plt.ylabel("sepal width")
plt.title("iris species - Setosa, Versicolor")
plt.savefig("iris.png")
Now, we have a clear, linear separation. Let’s train our perceptron on this data and see if it works.
x = data[0:100,0:2] # iris sepal data ( sepal length and width )
y = iris.target[0:100] # iris species data (only setosa and versicolor)
w = np.random.normal(size=2)
b = np.random.normal()
# learning rate
alpha = 0.01
# number of epochs
for epoch in range(1000) :
# for each row in x
for row in range(x.shape[0]) :
# for each row in x, predict y_hat
y_hat = forward_prop(row)
# for each row calculate weights
backward_prop(y_hat,row)
print ( w, b)
[ 0.80437979 -1.08684544] -1.0479456545593953
We can very well do a confusion matrix to check for accuracy.
y_pred = predict(x)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
print ( confusion_matrix(y,y_pred) )
print ( accuracy_score(y,y_pred))
[[49 1]
[ 0 50]]
0.99
That a pretty good accuracy – almost 98%. It is a bit more interesting to see this visually – using matplotlib’s meshgrid.
In simple terms, Gradient Descent is an algorithm to compute the minimum of a function.
OK – So, what is the big deal ? Well, most of the time in most machine
learning algorithms, there is always a cost function that needs to be
minimized. The best Machine Learning Algorithm is usually the one with
the most inclusive and simple cost function. Once a cost function is
defined, it is just a matter of solving for a minimum to arrive at the
solution. That is why Gradient Descent is extremely useful in the context of Machine learning. Let’s see an example.
Since the problem is simple enough to be solved mathematically, we have used the OLS (Ordinary Least Squares) technique to fit a straight line to the Linear Regression problem. You can view the equation for Ordinary Least Square to solve linear regression here. What is the cost function in this case?
picture here
Cost function = Sum of Squares of Residuals
The mathematical solution to minimize this cost function as derived by OLS is as follows.
where x¯ represents the average of x and y¯
represents the average of y
However, when the number of independent variables increase, OLS is not a good solution. That is where Gradient Descent shines. While OLS is an analytical solution, Gradient Descent is a numerical solution. However, to understand Gradient Descent, we have to be conversant with the following concepts in Math.
Derivatives
Partial Derivatives
Math
Derivatives
A derivative is the slope of a function. Let’s take a simple straight line –
A simple dataset for this could be
x = Number of DNA Samples
y = Number of DNA pairs.
Let’s plot a sample dataset and try to compute the slope.
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.array([1,2,3,4,5,6,7,8,9,10])
y = x * 2
print ( "x = ",x)
print ( "y = ",y)
plt.plot(x,y)
plt.plot(x[1], y[1], marker='o', markersize=10, color="red")
plt.plot(x[3], y[3], marker='o', markersize=10, color="red")
plt.hlines(y=y[1], xmin=x[1], xmax=x[3], color='b')
plt.vlines(x=x[3], ymin=y[1], ymax=y[3], color='b')
plt.text(4.2,5,(y[3] - y[1]))
plt.text(3,3,(x[3] - x[1]))
In this case, the slope is not constant as measured by same metric as
we have done previously. The slope seems to be changing with x.
A correct way to define slope (or derivative) is to take an
infinitesimally small increase in x and the corresponding value of y and
divide them as before. Mathematically, it is defined as,
If f(x) is a function of x,
For example, if x = 4, increase x by a very small amount, say Δ=0.0001
. Now, let’s compute the value of y as well and plug them into the equation above
x = 4
dx = 0.0001
so, the derivative of f(x)=x2 is 2x
. We have not derived this mathematically – instead, we are trying to understanding with numbers, how a derivative works.
Derivative represents the change in the value of a function with respect to the variable (with which the derivative is being applied)
Partial Derivatives
Partial derivatives are almost similar to regular derivatives –
except that partial derivatives work only on a particular variable. For
example, say the speed of a car is dependent on
engine RPM
slope of the road
you can also write it as
Now, how does the speed (z)
of the car vary with a unit increase in the engine RPM ? The answer is 8
– pretty straightforward. That is represented mathematically using
Let’s take another example – The equation of a 2-d plane can be generalized as below
You can visualize a plane like this –
As you can see, the plane intersects the z-axis at 2 ( Where the
value of x & y are 0 ). Now, how far does the function vary, with a
unit variation in x ?
Once again, I want you to take the intuitive meaning out of this –
For a unit change in x, the function changes by so much in the direction of x – That is a partial derivative.
A plane is simple to understand. However, the interpretation would be
the same even if it were a complicated curve in a 3-d space – Like a
hill.
Gradient Descent for Linear Regression
Now that we understand derivaties (both regular and partial), we are
ready to graduate to Gradient Descent. Imagine a set of data points like
so.
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.array([1,3,4,5,6,7,8,10])
y = np.array([4,3,7,7,8,10,8,11])
plt.scatter(x,y)
Say, we want to fit a straight line to this data set using Linear Regression. How would you do it ? Very simple
plt.scatter( x,y,alpha=0.5)
plt.plot([0,15], [point_1,point_2],color="red")
y_actual = y
y_predicted = model.predict(x.reshape(-1,1))
for index,x_count in enumerate(x) :
if y_actual[index] > y_predicted[index] :
plt.vlines(x=x_count, ymin=y_predicted[index], ymax=y_actual[index], color='b')
if y_actual[index] <= y_predicted[index] :
plt.vlines(x=x_count, ymin=y_actual[index], ymax=y_predicted[index], color='b')
The blue lines represent the residuals (or errors). We can calculate
the slope (and intercept) of the fit using OLS (Ordinary Least Squares)
or using Gradient Descent. We already know how OLS works in Linear Regression.
We will see how Gradient Descent works. The equation for a straight
line that would fit all the data points is some variation of
where
m = slope
b = intercept
Either way, we are minimizing the Sum of Squares of Errors. We started out with the definition of this at the beginning of the chapter.
Cost Function
Just to make things simple, assume a value of intercept (b) to be
fixed at 2.6 ( b = 2.6 as we have previously solved for it). Imagine
that we chart the cost function with different values of slope(m).
n = len(y)
cost_function = []
m = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0]
for slope in m :
cost = 0
for i in range(n):
cost = cost + (1/(2*n)) * ( (y_actual[i] - slope * x[i] - 2.6) ** 2 )
cost_function.append(cost)
plt.scatter(m,cost_function)
Visually, we can eyeball the minimum value of the cost function to be
somewhere around 0.8. This is inline with the scikit learn’s
LinearRegression model that we have solved above.
But how to mathematically solve for this (without using Ordinary Least Squares) ? That is where Gradient Descent comes in.
Gradient Descent is a technique to find out the minimum of a function numerically.
Imagine a ball put at a random location on the cost curve shown above.
If you were to let the ball go, it would roll down to the bottom. Why
does this happen ? Gravity moves the ball from a higher energy state to
a lower energy state. You already know this. What is more interesting
is the path it takes. The path should always be from a position of
higher slope to a position of lower slope.
If you see the slope of the ball at each of the 4 positions
highlighted above, it is pretty clear that the slope(dashed line) is
decreasing with every move down the curve. Slope represents the
derivative of the curve. In this case derivative of the cost function
with respect to the slope (x-axis).
How much do you move by ?
The amount and the direction you move is controlled by how much the cost function changes.
How much we move is based on how fast the cost function changes with
the slope (or intercept). The way to learn that is by finding out the
derivative of the cost function with respect to the slope (and
intercept). For now, just to make things simple and to be able to view
things in 2D, we are only keeping the slope as the variable (and the
intercept as constant). We will see in the next section how to work on
minimizing the cost function for both slope and intercept.
x = x.astype(float)
y = y.astype(float)
print ( x )
print ( y )
steps = 5
m = 0
n = len(x)
l_rate = 0.0001
# Start Gradient Descent
for step in range(steps) :
y_pred = m * x + 2.6
# Derivative of the cost function w.r.t m
m_der = (-1/n) * sum( (y - y_pred) * x)
# move m
m = m - m_der
print ( m)
The value of m oscillates hugely. That is because, m
gives a general sense of direction, but doesn’t tell us how far we need
to go. You can’t go an infinite distance in the direction of m. We need
to take baby steps.
Take a small step, evaluate slope, take another small
step in the direction of the least slope. This is the essence of
Gradient Descent.
It is like a baby learning to take steps. So, there is a concept called learning rate that controls how far we move in the direction of the most descent. Let’s rewrite the program with learning rate.
import time
l_rate = 0.001
steps = 1000
m = 0
m_array = []
# Start Gradient Descent
for step in range(steps) :
y_pred = m * x + 2.6
# Derivative of the cost function w.r.t m
m_der = (-1/n) * sum( (y - y_pred) * x)
# move m
m = m - l_rate * m_der
m_array.append(m)
print ( "optimum slope (m) = ", m)
optimum slope (m) = 0.8453333333333318
Let’s plot the journey of the ball down the cost curve.
# Cost Function
n = len(y)
y_actual = y
cost_function = []
m = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0]
for slope in m :
cost = 0
for i in range(n):
cost = cost + (1/(2*n)) * ( (y_actual[i] - slope * x[i] - 2.6) ** 2 )
cost_function.append(cost)
# Steps taken
n = len(y)
cost_function_m = []
m_steps = m_array
for slope in m_steps :
cost = 0
for i in range(n):
cost = cost + (1/(2*n)) * ( (y_actual[i] - slope * x[i] - 2.6) ** 2 )
cost_function_m.append(cost)
plt.scatter(m_steps,cost_function_m) # steps taken.
plt.scatter(m,cost_function) # cost function
plt.xlabel("m - slope")
plt.ylabel("cost function")
Text(0, 0.5, 'cost function')
As you can see, even after 500 steps, the ball was not able to roll
down to it’s minimum. That is because of such a small learning rate –
0.0001.
Learning Rate
How fast do you go down the path ? It depends on how fast you want to
converge (without overshooting). An arbitrary parameter called learning
rate( α ) would determine how fast you go down the path. If you want to
converge fast, can you increase the learning rate ? Probably not. Here
is why.
If you set a learning rate = 0.1, that is roughly how fast you move
along the x-axis. However, if you set the learning rate to 0.7 (thinking
you could move down the curve faster), here is what would happen – You
essentially miss the minimum.
Here is a quick plot of how the ball moves with a learning rate of
0.05 within just 100 iterations. The ball is going back and forth
because it is overshooting. However, it finally settles down at the
minimum.
Optimize Gradient Descent for both Slope & Intercept
So far, we have optimized Gradient Descent for just slope. How about the intercept ? The moment we introduce the 2nd parameter – intercept , the cost function becomes 3d.
x-axis = slope
y-axis = intercept
z-axis = cost function
Matplotlib provides a rudementary 3d scatter plot. We will be using that to plot the 3d plot of the cost function.
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1,3,4,5,6,7,8,10])
y = np.array([4,3,7,7,8,10,8,11])
# This import registers the 3D projection, but is otherwise unused.
from mpl_toolkits.mplot3d import Axes3D # noqa: F401 unused import
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
slope_values = np.arange(start=0,stop=5,step=0.05)
intercept_values = np.arange(start=0,stop=5,step=0.05)
# y_pred = slope * x + intercept
n = len(y)
cost_function = []
for index, slope in enumerate(slope_values) :
cost = 0
for i in range(n):
cost = cost + (1/(2*n)) * ( (y[i] - slope_values[index] * x[i] - intercept_values[index]) ** 2 )
cost_function.append(cost)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(slope_values, intercept_values, cost_function,marker='o')
plt.show()
Jupyter notebook doesn’t allo 3-d rotation, but if you try this in
your standard IDE (say VS Code), you would be able to get a 3-d look at
the plot.
Now, let’s optimize the cost function for both slope and intercept.
Here are the partial derivatives of the cost function for the slope and
intercept.
x = np.array([1,3,4,5,6,7,8,10])
y = np.array([4,3,7,7,8,10,8,11])
l_rate = 0.01 # Learning rate
steps = 4000 # number of iterations ( steps )
m = 0 # initial slope
b = 0 # initial intercept
n = float(len(x))
m_array = []
b_array = []
# Start Gradient Descent
for step in range(steps) :
y_pred = m * x + b
# Derivative of the cost function w.r.t slope (m)
m_der = (-1/n) * sum( (y - y_pred) * x)
# Derivative of the cost function w.r.t intercept (b)
b_der = (-1/n) * sum( y-y_pred )
# move m
m = m - l_rate * m_der
b = b - l_rate * b_der
# gather the slope and intercept in an array to plot later
m_array.append(m)
b_array.append(b)
print (" optimim slope(m) = ", m)
print ( "optimum intercept (m) = ", b)
slope_values = np.arange(start=0,stop=3,step=0.05)
intercept_values = np.arange(start=0,stop=3,step=0.05)
n = len(y)
cost_function = []
for index, slope in enumerate(slope_values) :
cost = 0
for i in range(n):
cost = cost + (1/(2*n)) * ( (y[i] - slope_values[index] * x[i] - intercept_values[index]) ** 2 )
cost_function.append(cost)
slope_values_new = m_array
intercept_values_new = b_array
cost_function_new = []
for index, slope in enumerate(slope_values_new) :
cost = 0
for i in range(n):
cost = cost + (1/(2*n)) * ( (y[i] - slope_values_new[index] * x[i] - intercept_values_new[index]) ** 2 )
cost_function_new.append(cost)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(slope_values, intercept_values, cost_function,marker='o')
ax.scatter(slope_values_new, intercept_values_new, cost_function_new,marker='o')
plt.show()
Stochastic Gradient Descent
For every iteration in Gradient Descent algo, the entire dataset is
used to calculate the derivative of the cost function. This is very
expensive if the dataset gets larger.
Imagine real world problems like image processing that has millions
of pixels in a single image. Gradient Descent becomes almost impossible
to compute if we don’t optimize.
One possible solution is to use Stochastic Gradient Descent. The word Stochastic stands for random.
Instead of using every observation ( rows in the dataset), just use a
random observation each time the derivative is being computed. In a
standard Gradient Descent, the derivative of the cost function w.r.t
slope is defined like this.
where i is the index of a random data row.
In Stochastic Gradient Descent, calculating the cost function is not
done for the entire training set. Instead, you pick a random row in the
dataset and calculate the cost function for that particular row only.
Surprisingly, this gives pretty good results (given the compromise)
This is computationally so much more efficient than doing the full dataset.
Let’s do this in python.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
l_rate = 0.001
steps = 1000
m = 0
n = len(x)
x = np.array([1,3,4,5,6,7,8,10])
y = np.array([4,3,7,7,8,10,8,11])
m_array = []
cost_function_m = []
# Start Gradient Descent
for step in range(steps) :
# CHANGE - At every step, get a new random number
random_index = np.random.randint(0,len(x))
# CHANGE - only calculate the predicted "y" value for that particular data row
y_pred = m * x[random_index] + 2.6
# Derivative of the cost function w.r.t m
# CHANGE - calculate the derivative only for a particular row in the data
m_der = (-1/n) * sum( (y[random_index] - y_pred) * x)
# move m
m = m - l_rate * m_der
m_array.append(m)
m_steps = m_array
for slope in m_steps :
cost = 0
for i in range(n):
cost = cost + (1/(2*n)) * ( (y[i] - slope * x[i] - 2.6) ** 2 )
cost_function_m.append(cost)
plt.scatter(m_steps,cost_function_m) # steps taken.
plt.xlabel("m - slope")
plt.ylabel("cost function")
print ( "optimum slope (m) = ", m)
optimum slope (m) = 0.8793576691371693
That’s pretty close to the real value (as calculated by OLS), right ?
For just 10 observations, this is not a big deal. Imagine the
performance gain if the number of rows were extremely large, as would
happen in real datasets. However, there is a cost to this trade-off. The
solution (optimum slope in this case) varies with each run. For
example, try running the code above 4 or 5 times – each time you get a
different solution. Although the difference is not a lot, still the
stochastic gradient descent results in a slightly different solution
with every run. How to counter this ? A compromise between standard
gradient descent and stochastic gradient descent is possible – It is
called Mini-batch gradient descent.
Mini-batch Gradient Descent
In practice though, a technique called mini-batch Gradient Descent is
used mostly for Gradient Descent problems. It is hybird solution
between standard gradient descent and stochastic gradient descent. The
following picture highlights the difference between standard vs
stochastic vs mini-batch gradient descent methods.
Let’s program Stochastic Gradient Descent in Python.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
l_rate = 0.001
steps = 1000
m = 0
n = len(x)
x = np.array([1,3,4,5,6,7,8,10])
y = np.array([4,3,7,7,8,10,8,11])
m_array = []
cost_function_m = []
x_range = np.arange(len(x))
# Start Gradient Descent
for step in range(steps) :
# CHANGE - At every step, get a set of new random numbers
random_index = np.random.choice(x_range, size=3, replace=False)
# CHANGE - only calculate the predicted "y" value for that particular data row
y_pred = m * x[random_index] + 2.6
# Derivative of the cost function w.r.t m
# CHANGE - calculate the derivative only for a particular row in the data
m_der = (-1/n) * sum( (y[random_index] - y_pred) * x[random_index])
# move m
m = m - l_rate * m_der
m_array.append(m)
m_steps = m_array
for slope in m_steps :
cost = 0
for i in range(n):
cost = cost + (1/(2*n)) * ( (y[i] - slope * x[i] - 2.6) ** 2 )
cost_function_m.append(cost)
plt.scatter(m_steps,cost_function_m) # steps taken.
plt.xlabel("m - slope")
plt.ylabel("cost function")
print ( "optimum slope (m) = ", m)
optimum slope (m) = 0.8472922870088795
This time the slope value is pretty steady.
Mini-batch gradient descent achieves a compromise between
the time-consuming, but accurate Gradient Descent and a quick, but
slighlty inaccurate Stochastic Gradient Descent.
Gradient Descent is a generic Cost Minimization algorithm. As long as
there is a convex cost function. If there are multiple minima, Gradient
Decent only arrives at a local minima.
In order to do that, we have to first understand 3 things
Logistic Regression equation
Cost function
Partial Derivative of the Cost function w.r.t. x
The equation for logistic regression is
where
w is a vector of numbers
b is a number
x is a vector of predictors
We haven’t seen w and b in Linear Regression that we saw previously,
right ? Where have them sprung from ? Well, we are generalizing for the
number of predictors a bit. For a single predictor, we can write the
Logistic Regression as follows.
where
m slope
b intercept
The output of the Logistic Regression is actually a sigmoid curve.
You can get a visual like so.
from scipy.special import expit
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(-10,10)
y = expit (x)
plt.plot(x,y)
plt.grid()
The cost function for Logistic Regression can be formulated similar to a Linear Regression
However, this can result in a non-convex curve for logistic
regression. So, instead of using Sum of Squares of Error, Logistic
Regresion uses Cross Entropy for it’s cost function.
The partial derivative of the Cost function w.r.t (w,b) is
Now that we have the 3 required things, let’s write our gradient descent learning algorithm
where α
is the learning rate
For a single predictor, if you think of the slope/intercept (like the Linear Regression above), this would become
where i is the number of rows in the dataset
Now, let’s implement Logistic Regression in Python. Let’s take a simple dataset.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.array([1,3,4,5,6,7,8,10])
y = np.array([0,0,0,1,1,1,1,1])
plt.scatter(x,y,c=y)
Seems simple enough, right ? Let’s first get a baseline using Scikit Learn’s LogisticRegression model.
from sklearn import linear_model
from scipy.special import expit
model_lr = linear_model.LogisticRegression(C=1e5, solver='lbfgs')
model_lr.fit(x.reshape(-1,1), y)
x_test = np.linspace(1.0,10.0,100)
# predict dummy y_test data based on the logistic model
y_test = x_test * model_lr.coef_ + model_lr.intercept_
sigmoid = expit(y_test)
plt.scatter(x,y, c=y)
# ravel to convert the 2-d array to a flat array
plt.plot(x_test,sigmoid.ravel(),c="green", label = "logistic fit")
plt.yticks([0, 0.2, 0.4, 0.5, 0.6, 0.7, 1])
plt.axhline(.5, color="red", label="cutoff")
plt.legend(loc="lower right")
Let’s try and fit this data using Logisic Regression based on Gradient Descent.
model_lr.intercept_
array([-67.57978497])
import numpy as np
from math import exp
import matplotlib.pyplot as plt
%matplotlib inline
l_rate = 0.001
steps = 1000
m = 0
n = len(x)
x = np.array([1,3,4,5,6,7,8,10])
y = np.array([0,0,0,1,1,1,1,1])
import time
l_rate = 1
steps = 1000
m = 0
m_array = []
# Start Gradient Descent
for step in range(steps) :
y_pred_log = m * x + (-67.57)
y_pred = 1/(1 + np.exp(-y_pred_log))
# Derivative of the cost function w.r.t m
m_der = (-1/n) * sum( (y - y_pred) * x)
# move m
m = m - l_rate * m_der
m_array.append(m)
print ( "optimum slope (m) = ", m)
optimum slope (m) = 15.017184591647284
Once again, we got pretty close to the slope as predicted by Scikit Learn’s LogisticRegression model.
model_lr.coef_
array([[15.03246322]])
Now that we understand how Gradient Descent works, let’s move on to
the next important topic in Neural Networks – Back Propogation.
A Neural network is an interconnected set of Neurons, arranged in layers. Input goes on one end and output the other end.
For example, the picture above is a neural network with 4 nodes in
the input layer and 3 nodes in the output layer. This is the exact
structure that we have used for iris classification that we have solved
in our Hello World example on Day 1. The layer in between is called the
hidden layer. This is what gives the name – Deep Learning – because the
network is deep (with not just the input and output layer, but one or
many hidden layers).
This is the basic structure of a neural network. The number of nodes
or layers could change, but this is the basic structure of a typical
neural network. To understand a neural network better, we have to get
started from the basics.
Biological Neuron
Neural Network was inspired by the brain. A human brain consists of
billions of neurons that are interconnected. Here is a quick picture
from wikipedia.
x1,x2..xn represent the inputs. y1,y2..yn
are the outputs. So, essentially a neuron tranforms a set of inputs to a
set of outputs. When many such neurons are connected, they form an
intelligent system.
Perceptron
The most simplest way to represent a neuron, mathematically is with a perceptron.
A perceptron receives inputs, adds them up and produces an output.
What is the big deal about it ? It is just basic addition, right ? True –
That’s where the concept of weights come in.
Each of the inputs is multiplied by a weight. So, instead of just
summing up the inputs, you multiply them with the weights and sum it up
(weighted sum of inputs). The weighted sum could be a number within a
very large range, depending on the input range and the weights. What is
the use of having a number that could be anywhere from −∞ to +∞
To normalize this, a bias or threshold is introduced.
What does a perceptron achieve
The calculation above seems simple enough, but what exactly does it
acheive ? Think of it like a decision making machine. It weights input
parameters and provides a Yes or No decision. For example, say you want
to decide if you want to learn Deep Learning or not, how do you go about
it in your mind ?
You weigh your inputs ( multiply the inputs with the corresponding
weightage) and arrive at a figure. In fact, each of these inputs are
also given a number internally in your mind. However, the way a human
brain functions is far more complicated. Like I said before, neural
networks & deep learning are just “based on” how the human brain
works. it is not an exact replica.
Sigmoid Neuron
While a perceptron is good enough for simple tasks, it has its
limitations when building complex neural networks. That is where sigmoid
neurons come in. If you have seen logistic regression in Machine Learning before, you will already have an idea on what a sigmoid function does. It essentially maps a range of numbers between −∞ to +∞
to values betwen 0 and 1.
A perceptron outputs either a 0 or a 1 depending on the
weighted inputs & threshold. A sigmoid neuron outputs a value
between 0 and 1. This makes the sigmoid neuron much more useful in large
scale neural networks.
The weighted sum of inputs + bias is calculated just as above.
Now, instead of just outputting this, a sigmoid neuron calculates a
sigmoid of the calculated weighted input + bias and outputs a value
between a 0 and 1.
You can a have a visual of the sigmoid function as seen below.
from scipy.special import expit
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.linspace(-1000,1000)
y = expit (x)
plt.plot(x,y)
plt.grid()
This looks like a binary curve ( that can only take a value of 0 or
1), but if you closely observe the curve between a range of say -10 to
10, you can clearly observe the gradual progression.
x = np.linspace(-10,10)
y = expit (x)
plt.plot(x,y)
plt.grid()
This is the logistic regression curve. Only when the value of the (
weighted sum + bias ) stays very close to 0, do you observe the logistic
curve. For any other extreme value, the output is pretty much either a 0
or 1 (very much like a perceptron).
Advantages of Sigmoid Neuron over a Perceptron
Since the output range of a sigmoid neuron is smooth, small changes
in the inputs will result in small changes in the output. So, instead of
just doing a flip of the switch (0 or 1), sigmoid function acts more
like a slider. This feature of sigmoid functions output makes it very
useful for neural networks learning.
Changes to your output are essentially a function of
changes in the weights and biases. This is the basis of Neural Network
learning.
However, to understand this mathematically, we have to understand a
little bit of derivatives, partial derivatives and then the actual
back-propogation algorithm itself – Gradient Descent. These will be the
topic of our next chapter.
Many times, the MNIST database (a database of handwritten digits) is typically the Hello World
application when introducing Neural Networks for the first time.
However, we are going to make it even simpler by taking the iris dataset
and create a keras based tensorflow Neural Network to classify species.
Please refer to Iris Data to understand more about the dataset we will be working on. You can also refer to Classification in Python to understand more about a non-neural network based classification approach to classifying the species in the iris dataset.
Once you understand how to solve the iris classification problem in
Neural Networks, we will move to image recognition. As you will see,
structurally there is not a lot of difference in the way we build the
neural net for both of these problems.
This is just a “Hello World” tutorial. It is not intended to teach you the internals of Neural Networks. With that background, we are now ready to say hello to Neural Networks using Tensorflow.
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Tensorflow is probably the most popular, open source library from
Google that is used to implement Deep Learning. You can build neural
networks of any complexity using Tensorflow. However, building a neural
net from scratch typically involves defining
Layers
Linking the layers
loss function
weight adjustments etc
Defining these manually is very time consuming and daunting for
newbies. What is needed is an abstract layer above Tensorflow, that
makes building neural nets much quicker and easier.
Keras is the answer. Keras is a high level Python based API that can
be used to build neural nets by leveraging Tensorflow. By the way,
Tensorflow is not the only deep learning package out there. Here is a
quick visual that shows you where Keras and Tensorflow stand in the
hierarchy.
# load iris dataset
from sklearn import datasets
iris = datasets.load_iris()
# preview the iris data
print ( iris.data[0:5,:] ) # data
print ( iris.target[0:5] ) # target species
# train/test split @ 20% test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data , iris.target, test_size=0.2)
The following 8 lines of code is all you need to solve the problem.
Quickly execute it to see the output for yourself. However, there is
quite a lot of explantion to be done here. Let’s take it step by step.
Step 1 – What type of neural network are we building ?
model = keras.Sequential()
There are two types of Neural networks that can be build in Keras
Sequential
Functional
This classification is related to the structure of the Neural
Network. However, most of the time you will be using Sequential model.
It can solve most of the problems. In a sequential neural net, neurons
are arranged in layers and in sequence . The firing and wiring happen in sequence,
hence the name. Later in the course when we see an example of a
functional neural net, the difference will be clear. Here is a quick
visual of what we are building.
Finally, when the network is trained, the outputs corresponding to
the species (for the corresponding data points) will light-up. When we
look at the last step, we will understand what I meant by light-up.
Step 2 – How are the neurons connected ?
We are building a Dense neural network.
model.add(keras.layers.Dense(4,input_shape=(4,)))
A Dense neural network is one in which each neuron is connected to all other neurons in the previous and next layers.
You can see from the visual below that the arrows coming in to each
neuron are connected to all the neurons in the previous layer. This is
the most typical type of neural network.
Also, with this statement, we are just building the input layer. An
input layer with 4 nodes, one node for each of the inputs. Naturally,
the assumption at this point would be that there would be as many nodes
in the input layer as the number of inputs. So, why specify the input_shape
parameter again ? In later examples we will see that the input data
shape need not always match with the input nodes. We specify the
input_shape parameter as a tuple. In this case the input is a 1-d
vector. Once again, later in the course we will see examples of 2-d
data.
The parameter input_shape is only used when creating the first layer. The next set of steps (hidden layer and output layer) do not need this parameter.
Step 3 – Hidden Layers
This is where the magic happens. Let’s try it with just one hidden layer.
Irrespective of the layer (input, hidden or output), the way to add layers is using the add function. That should make things easy for us. The new parameter that you see in the hidden layer is the activation parameter.
In this case, we are using a relu activation function. ReLU stands for Rectified Linear Unit. The mathematical definition of relu is
The output of the activation function looks like this.
Step 4 – Output Layer
After the hidden layer is added, we add the output layer. Since we
are doing a multi-class classification, the preferred activation
function is called as a softmax – more on this later. A softmax
activation function gives out multiple probability values and the one
with the highest probability is the predicted output.
So far, we have created the structure of the neural net – layer by
layer. At each step, we have defined the number of nodes and the
activation function to be used. Once we have completed it, we now have
to compile the model.
So far, we have just defined how the neural net should look like. With the compile ( )
method, Keras translates the parameters you have specified into an
optimized series of steps that can then be executed on the computer.
Without the compile step, you cannot fit (train) the model. We
will see how we can use metrics in a bit, but optimizer and loss
parameters requrie quite a bit of explanation.
Typically, Machine Learning algorithm requires some kind of a loss
function to be minimized. Gradient Descent is a commonly used loss
function. For classification problems, a common loss function is Cross Entropy. Cross Entropy is also called as Log Loss. Mathematically, this is a how a cross entropy function can be defined for 2 classes.
Let’s look at an example below. Say, we are just looking at 2 species of iris flowers.
0 – setosa
1 – virginica
If the model has predicted the species to be setosa with a probability of 0.2, the loss function can be calculated as follows.
import numpy as np
p = np.array([0.0001,0.001,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99])
# when y = 1, what is the loss function ?
y = 1
l = - (y * np.log10(p) + (1-y) * np.log10(1-p) )
# now plot it to see how the loss function decreases as the predicted value approaches the actual value (of y = 1)
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(p,l)
plt.xlabel("Different predicted values when the actual value is 1")
plt.ylabel("Loss function")
Text(0, 0.5, 'Loss function')
What this plot means is that the more the predicted value deviates
from the actual value, the more the loss function is. For example, when
the predicted value reaches close to the actual value (of 1 in this
case), the loss function gets closer and closer to 0.
At this point, you can see a quick summary of the model you have created so far
This is where we train the dataset. The word epoch
represents one complete iteration over the training dataset. With each
epoc (one pass over the entire dataset) the weights are adjusted and the
accuracy slowly increases. Since you have accuracy as a metric in step 5, it is shown at each of the training epoch. That way you see how the accuracy increases with each epoch.
After finishing 100 epochs, the accuracy is around 69% – not bad for
our first attempt. We will enhance it shortly to 90% by just adding one
more hidden layer.
Step 7 – Predict data
Now that the model is trained, you can start predicting your test
data. This step is pretty straight forward if you have already used
sklearn to predict test data based on any machine learning model.
Since this is a multi-class output, what the neural net outputs are
probabilities. The highest probability among the three elements is the
predicted value. However, we need to convert these probabilities back to
indices.
Since the output is categorical data, a quick confusion matrix will
show use how far we are from the model. Use scikit learn’s confusion
matrix should do.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred_class)
print ( cm )
[[15 0 0]
[ 0 5 1]
[ 0 0 9]]
And of course the final number – accuracy.
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred_class)
0.9666666666666667
That is pretty low by any Machine Learning standards for this dataset. Let’s optimize it.
Step 9 – Optimize Model
There are a couple of ways to optimize for higher accuracy. One way
is to increase the nodes in the hidden layer. Let’s try to increase the
number of nodes from 8 to 20 and see how the network performs.
model = keras.Sequential()
model.add(keras.layers.Dense(4,input_shape=(4,)))
# BEGIN change - increase the number of nodes from 8 to 20
model.add(keras.layers.Dense(20,activation="relu"))
# END change
model.add(keras.layers.Dense(3,activation="softmax"))
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(X_train, y_train, epochs=100)
y_pred = model.predict(X_test)
y_pred_class = np.argmax(y_pred,axis=1)
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred_class)
We have now hit 90% accuracy. That’s pretty much close to what most
ML models would achieve. Let’s try to keep the number of nodes the same,
but add one more hidden layer.
model = keras.Sequential()
model.add(keras.layers.Dense(4,input_shape=(4,)))
model.add(keras.layers.Dense(8,activation="relu"))
# BEGIN Change - add one more hidden layer
model.add(keras.layers.Dense(8,activation="relu"))
# END Change
model.add(keras.layers.Dense(3,activation="softmax"))
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(X_train, y_train, epochs=100)
y_pred = model.predict(X_test)
y_pred_class = np.argmax(y_pred,axis=1)
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred_class)
That’s 80% accuracy. The immediate question you might have is – How
should you choose the number of nodes or the number of hidden layers ?
Unfortunately, the meaning of weights and outputs is essentially a
blackbox to humans. Meaning, we cannot make any sense out of it.
Choosing the size and complexity of a neural network
(like the numbner of nodes and the number of hidden layers) is more art
than science
MNIST handwritten digits classification
If all we had to do in Neural Networks was classify iris data, we
wouldn’t be needing Neural networks to start with. We need a more
involved dataset to quality as a “Hello World” program in Neural
Networks. Welcome the MNIST digits dataset. It is a dataset of
handwritten images that are scanned, standardized and optimized for
machine learning. Tensorflow comes built-in with this dataset. Let’s
quickly load it to see how these images look like.
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# load the data
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
# Show the first picture in the training dataset.
plt.figure()
plt.imshow(train_images[0])
It is a handwritten digit – 5. Although the image looks in color,
these are actually gray-scale images. All of the data is standardized
into 28×28 pixels. And each pixel has an intensity value between 0 and
255 ( 0 – 28 ). Since this is a small image (just 28×28 pixels), we can write it to an excel file and see the numbers visually.
# write the first image to a csv file.
np.savetxt("image_1.csv",train_images[0],delimiter=",")
---------------------------------------------------------------------------
PermissionError Traceback (most recent call last)
<ipython-input-68-3e86563d8a79> in <module>
1 # write the first image to a csv file.
----> 2 np.savetxt("image_1.csv",train_images[0],delimiter=",")
<__array_function__ internals> in savetxt(*args, **kwargs)
~\AppData\Roaming\Python\Python37\site-packages\numpy\lib\npyio.py in savetxt(fname, X, fmt, delimiter, newline, header, footer, comments, encoding)
1362 if _is_string_like(fname):
1363 # datasource doesn't support creating a new file ...
-> 1364 open(fname, 'wt').close()
1365 fh = np.lib._datasource.open(fname, 'wt', encoding=encoding)
1366 own_fh = True
PermissionError: [Errno 13] Permission denied: 'image_1.csv'
If you open the csv file in excel, and adjust the column size and
change the number format to zero decimals, you should see a picture like
this. Can you identify the digit 5 in there ?
Confirm that the image is in fact a 5.
# Print the first label
train_labels[0]
Prepare the model. The first layer in this case is slightly different
from the first layer in the iris example above. As you can see from the
input image data, it is a 28×28 dimension numpy array. However, we are
going to be working with flat (a flat set of neurons in each layer). So,
the first layer will be essentially a 784 (28 x 28 = 784) node layer
that will be created automatically by flattening the input array.
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)
])
The prediction is a vector of 10 x 1 dimension of probabilities. The
value with the highest probability in the output array is the predicted
value. For example, let’s see what is the first image in the test_image.
test_labels[0]
It is a 7. Let’s see how the probabilities are predicted.
You can see from the array above that the highest probability value
is at location 7. Let’s apply the argmax function of numpy to just
filter out the values with the highest probability into a new array. For
example, the first output array has an argmax output of 7 ( argmax ( ) outputs the index of the highest element in a numpy array).
Let’s apply the argmax on the output array to predicted array.
predicted_final = np.argmax(predictions,axis=1)
from sklearn.metrics import accuracy_score
accuracy_score(test_labels,predicted_final)
That’s a 95% accuracy on the test datasets.
Great !! Our “Hello World” of Neural Networks is complete. The next
couple of days we will focus on the moving parts of a neural network and
how Gradient Descent is used in Neural networks to optimize the
weights. This is how the neural network essentially learns.
SQL stands for Structured Query Language. It is the language of Relational databases. So, before we understand SQL, let’s understand what a Relational database is. The next couple of pages would introduce databases at a very high level.
What is a Database
A database is a software that is used to store data. There are many
types of databases (relational, graph, document etc) but we are going to
focus only on relational database as it is the most popular database format. Also, SQL is specifically used for relational databases.
Relational databases comprise of data in tabular format. For example,
let’s see how an employee’s data could be structured in a relational
database.
This is how we humans would like to see data. However, it is not
scalable to store large amounts of data in a way that is more suitable
for fast read and writes. That is where the concept of data
normalization comes in. Normalization is a subject in itself and we will
not be discussing it here. However, the process is pretty intuitive
overall. Here is how the same data would be structured in a good
database.
There are many databases in use today. For example, MySQL, Oracle,
Microsoft SQL Server, PostgreSQL etc. In this chapter, we will be
focussing on the MySQL database. Here is a quick instruction set on How to install MySQL Database.
By default MySQL database comes with very few data tables. MySQL provides a big employee
database (with 300K employee records) that you can download from the
web and upload onto your database. You can use it to practice data
extraction. The data above is taken from that employee database.
What is SQL
SQL is the language of relational databases. It can
be used to create, extract, manipulate and delete data in the database.
For example, here is a quick SQL to extract the first 10 rows of data
from a database table EMPLOYEES.
SELECT * FROM employees LIMIT 10;
The “*” indicates that all columns need to be selected. As you can
see, the syntax is pretty intuitive. It is deliberately designed to be
like English.
SQL Connectors
Like discussed previously, there are a variety of databases. Each of
these databases has a specific connector and you have to install it.
Since we will be working with MySQL database, we will install MySQL
connector.
pip install mysql-connector
Typically, you will be given the database details like below.
Server Address : xx.xx.xx.xx
port : 33xx
schema : xxxx
user id : xxxx
password : xxxx
import mysql.connector
db = mysql.connector.connect(
host = "localhost",
user = "ajaytech",
password = "ajaytech"
)
db
Database interactions are typically retrieved using something called a cursor.
A Cursor is just a pointer to a set of data retrieved from the
database. It is upto us to iterate over the retrieved data and get what
we went. Typically this is done using a loop. So, this is basically a 2
step process
Execute an SQL statement and get the result into a cursor
Iterate over the cursor to get the data
For example, let’s do these 2 steps to list all the databases. Each database is essentially a collection of tables.
Step 1 – Get the list of tables into a cursor
cur = db.cursor()
cur.execute("SHOW DATABASES")
Step 2 – Iterate over the cursor to get the list of databases
Once we know the list of databases, we have to select the database first. Once we do that, we can freely go about executing the select statements on that particular database.
cur.execute("use world")
List all the tables in the database.
cur.execute("show tables")
for table in cur:
print(table)
('city',)
('country',)
('countrylanguage',)
Let’s pick a table – say country. Now, let’s extract all the columns in that table. They will become the columns of our Pandas dataframe.
cur.execute("show columns from country")
column_names = []
for column in cur:
column_names.append(column[0])
column_names
Once we got the column names, let’s get the actual data from the table.
cur.execute("select * from country")
import pandas as pd
country_data = pd.DataFrame(columns = column_names)
rows = []
for data in cur:
rows.append(list(data))
country_data = pd.DataFrame(rows)
Great !!! We just need one last step before we finish creating the
table into a Pandas dataframe. Set the column names that we have already
extracted in a previous step.
Instead of going the cursor route, you can also
choose an ORM that can abstract the meta associated with the database
table. A detailed description of an ORM is once again a subject in
itself and we will not be discussing it today.
ORM stands for Object Relational Mapper. Instead of
just focusing on SQL, ORM lets us use any Object Oriented paradigm to
extract data from a relational database. Just think of ORM as a OO
wrapper around SQL. Here is a quick visual to show the same.
One such ORM is SQLAlchemy.
pip install SQLAlchemy
import pandas as pd
import mysql.connector
import sqlalchemy as db
engine = db.create_engine('mysql+pymysql://ajaytech:ajaytech@localhost:3306/world')
Code Name
0 ABW Aruba
1 AFG Afghanistan
2 AGO Angola
3 AIA Anguilla
4 ALB Albania
If you want to go specifically with SQL, then you can use the read_sql_query ( ) method.
df = pd.read_sql_query("select code, name from country",engine)
df.head()
Code Name
0 ABW Aruba
1 AFG Afghanistan
2 AGO Angola
3 AIA Anguilla
4 ALB Albania
Essentially, the same method.
Loading Data
Instead of working on the sample tables provided as part of MySQL
installation (like country,city etc), let’s load up a bigger dataset
into the system. MySQL provides one such database right on their website
– and loading it is just as easy. Just search google for “Employee
database mysql” or click here. The link to download the actual data is available here. Once there, click on the Clone or Download button and then click on the Download Zip button as shown below.
Unzip the file and navigate into the directory.
In the address bar, type in cmd and that would open the command prompt with the current directory.
like so..
Once inside, type in
mysql -u ajaytech -p -t < employees.sql
where “ajaytech” is the user name you have chosen while installing MySQL – like so..
You have to use the full path of the mysql command, if it is not added to your path, like shown above
"C:\Program Files\MySQL\MySQL Server 8.0\bin\mysql"
Once the data hase been loaded, refresh your MySQL workbench and you should be able to see the database(schema).
You should be able to drill-down to find the tables. For example, here is a snapshot of the EMPLOYEES table.
This one is a pretty decent sized database – 300K odd employees as you can see from the screenshot below.
Selecting Data
For the purpose of extracting data for Data Science & Machine Learning, we will mostly be working with SELECTING
data. So, our focus will mainly be on selecting data (as opposed to
UPDATEs, DELETEs or CREATE or other administrative aspects of the
database).
SELECT Statement
This is probably the most used SQL statement in Data Science or
otherwise. We have seen an example of how to select data from the
country table above. In the following sections, we will focus on the
employees database and how to extract data from the tables inside it.
Here is a snapshot of the SELECT statement syntax in it’s most generic format. As you can see, it is pretty intense.
Luckily, we only have to deal with very few of these options. Let’s start with the basics.
df = pd.read_sql_query("select emp_no, first_name from employees.employees",engine)
df.head()
emp_no first_name
0 10001 Georgi
1 10002 Bezalel
2 10003 Parto
3 10004 Chirstian
4 10005 Kyoichi
As you can see, this one took some time. That is because we have
pulled in all the records. A quick look at the shape of the dataframe
should make this clear.
df.shape
(300024, 2)
Let’s try to just pull in the first 100 records.
df = pd.read_sql_query("SELECT emp_no, first_name FROM employees.employees LIMIT 100",engine)
df.head()
emp_no first_name
0 10001 Georgi
1 10002 Bezalel
2 10003 Parto
3 10004 Chirstian
4 10005 Kyoichi
This time, the select ran pretty quick, right ? That is because, we
limited the rows retrieved to just 100. You can quickly verify this with
the dataframe’s shape.
df.shape
(100, 2)
Just to make the syntax look simple, we have capitalized the SQL keywords in the select statement.
SELECT emp_no, first_name FROM employees.employees LIMIT 100
as opposed to
select emp_no, first_name from employees.employees limit 100
That way you know what the SQL statement is doing by quickly looking
for the capitalized keywords. Also, there is a quick way to beautify SQL
statements in the MySQL Workbench. Just type in the statement and click
on the “Beautify/Format” button.
Look at the way the SQL Workbench has formatted it. It took care of
capitalization, tabbing and formatting. This form of the SQL statement
is more useful in cases where the statement is pretty large.
So far, we have seen 3 keywords ,
SELECT – The main keyword to select data from database tables.
FROM – used to specify the tables from which the data needs to be extracted.
LIMIT – used to limit the number of rows to extract.
Here is a quick visual
In case you are wondering why we are selecting the schema or database
everytime we do a select, that is because there are multiple
schemas/databases in any database system. To use a specific database by
default without having to specify the database table everytime, use the
following command.
USE employees
If you do this in the workbench, the corresponding database is highlighted.
In python, when you create the connection using the ORM, specify the database.
import pandas as pd
import mysql.connector
import sqlalchemy as db
engine = db.create_engine('mysql+pymysql://ajaytech:ajaytech@localhost:3306/employees'
And from that point on, you can just specify the table (without specifying the database).
df = pd.read_sql_query("SELECT emp_no, first_name FROM employees LIMIT 100",engine)
df.head()
emp_no first_name
0 10001 Georgi
1 10002 Bezalel
2 10003 Parto
3 10004 Chirstian
4 10005 Kyoichi
Great ! Let’s move on to the next clause – WHERE. Next to FROM, this is probably the next important clause in the SELECT statement.
df = pd.read_sql_query("SELECT emp_no, first_name, birth_date FROM employees WHERE emp_no = 10002",engine)
df.head()
Let’s add some more filter conditions using the WHERE and AND clauses.
df = pd.read_sql_query("SELECT emp_no, first_name, gender \
FROM employees \
WHERE gender = 'M' AND \
emp_no <= 10003",engine)
df.head()
emp_no first_name gender
0 10001 Georgi M
1 10003 Parto M
You can use many comparators that you are used to in any programming languages
Less than (<) , Greater than (>)
Greather than or equal to (>=) , Less than or equal to (<=)
Not equal to ( != or <> )
BETWEEN clause
df = pd.read_sql_query("SELECT emp_no, first_name, gender \
FROM employees \
WHERE emp_no BETWEEN 10001 AND 10003",engine)
df.head()
emp_no first_name gender
0 10001 Georgi M
1 10002 Bezalel F
2 10003 Parto M
IN Clause
What if you wanted to select specific employee numbers ?
df = pd.read_sql_query("SELECT * \
FROM employees \
WHERE emp_no IN (10001, 10003,10004)",engine)
df.head()
emp_no birth_date first_name last_name gender hire_date
0 10001 1953-09-02 Georgi Facello M 1986-06-26
1 10003 1959-12-03 Parto Bamford M 1986-08-28
2 10004 1954-05-01 Chirstian Koblick M 1986-12-01
LIKE clause
df = pd.read_sql_query("SELECT * FROM employees WHERE first_name LIKE 'G%%' AND emp_no < 10010 ",engine)
df.head()
emp_no birth_date first_name last_name gender hire_date
0 10001 1953-09-02 Georgi Facello M 1986-06-26
ORDER BY clause
df = pd.read_sql_query("select * from employees where emp_no < 10005 order by emp_no desc ",engine)
df.head()
emp_no birth_date first_name last_name gender hire_date
0 10004 1954-05-01 Chirstian Koblick M 1986-12-01
1 10003 1959-12-03 Parto Bamford M 1986-08-28
2 10002 1964-06-02 Bezalel Simmel F 1985-11-21
3 10001 1953-09-02 Georgi Facello M 1986-06-26
AS clause
df = pd.read_sql_query("SELECT emp_no AS employee_number,first_name \
FROM employees WHERE emp_no < 10005 ORDER BY emp_no desc ",engine)
df.head()
employee_number first_name
0 10004 Chirstian
1 10003 Parto
2 10002 Bezalel
3 10001 Georgi
Aggregate Functions
MIN function
df = pd.read_sql_query("SELECT emp_no ,MIN(birth_date),first_name \
FROM employees WHERE emp_no < 10005 ORDER BY emp_no desc ",engine)
df.head()
emp_no MIN(birth_date) first_name
0 10001 1953-09-02 Georgi
The other aggregate functions supported by MySQL are
AVG () – Average
SUM () – Summation
COUNT () – Row count
MAX () – Maximum value
AVG aggregate function
df = pd.read_sql_query("SELECT AVG(salary) FROM salaries \
WHERE emp_no < 10005 AND \
to_date = '9999-01-01'",engine)
df.head()
AVG(salary)
0 69713.25
MAX aggregate function
df = pd.read_sql_query("SELECT MAX(salary) FROM salaries \
WHERE emp_no < 10005 AND \
to_date = '9999-01-01'",engine)
df.head()
MAX(salary)
0 88958
Table Joins
Suppose you wanted to find out the salary of employee 10001, and print the following information
emp_no
first_name, last_name
salary
How would you do it ? The data is available in 2 different tables
EMPLOYEES
SALARIES
You would have to join information from both of these tables to get the final output you need. Here is how you do it.
df = pd.read_sql_query("SELECT first_name, last_name, salary \
FROM employees JOIN salaries \
WHERE employees.emp_no = salaries.emp_no AND \
employees.emp_no < 10005 AND \
to_date = '9999-01-01'",engine)
df.head()
first_name last_name salary
0 Georgi Facello 88958
1 Bezalel Simmel 72527
2 Parto Bamford 43311
3 Chirstian Koblick 74057
Say, you wanted to output the employee number also – Try this.
df = pd.read_sql_query("SELECT emp_no, first_name, last_name, salary \
FROM employees JOIN salaries \
WHERE employees.emp_no = salaries.emp_no AND \
employees.emp_no < 10005 AND \
to_date = '9999-01-01'",engine)
df.head()
---------------------------------------------------------------------------
InternalError: (pymysql.err.InternalError) (1052, "Column 'emp_no' in field list is ambiguous")
SQL is complaining that Column ’emp_no’ in field list is ambiguous.
That is because both the tables(employees,salaries) have a column called
emp_no – and MySQL doesn’t understand which one we are referring to. Let’s make this clear for the database.
df = pd.read_sql_query("SELECT employees.emp_no, first_name, last_name, salary \
FROM employees JOIN salaries \
WHERE employees.emp_no = salaries.emp_no AND \
employees.emp_no < 10005 AND \
to_date = '9999-01-01'",engine)
df.head()
Most of the time regular JOINs are good enough. The default JOIN that is used to join tables is an INNER JOIN. Most of the time, this join is good enough.