Bar Charts

Bar Charts


  Data visualization

Contents

Bar Chart

A bar chart represents data values in the form of vertical bars. Each vertical bar in the graph represents an individual category. The bars are used to compare values in different categories. In a bar chart, the length of a bar is proportional to the value it represents and the width remains same for all bars. One axis of the chart represents categories and the other axis represents the value scale.

Below we see a bar chart plotted using the bar function defined in pyplot module. The bar chart displays cars sales during a ten year period for an automobile company. The first argument to the bar function indicates the position of the bar on the x-axis with the center at the x-tick position. The second argument indicates the height. The width of each bar is 0.8 which is the default setting, and this can be changed using the ‘width’ parameter.

from matplotlib import pyplot as plt
import numpy as np
year = [2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018]
toyota_sales = [1843669,1496211,1488588,1396837,1764802,1893874,2004373,2098545,2106332,2129177,2128201]
x_pos = np.arange(len(year))
plt.bar(x_pos,toyota_sales,color='#623aa2',alpha=0.25,edgecolor='k',label='Toyota')
plt.xticks(x_pos,year,rotation=30)
plt.title('Toyota Car sales')
plt.xlabel('Year')
plt.ylabel('No of units sold')
plt.show()

Clustered Bar Chart

A clustered or a grouped bar chart is used to compare multiple data sets side by side. Say, you want to compare values of multiple datasets that come under the same category, then a clustered bar chart comes in handy. The previous example can be extended to display car sales of different automobile companies.

In the previous example, we used a bar chart to display the sales of an automobile company for a ten year period. Now we would like to compare the sales of three different companies for the same period. So we are going to have three vertical bars under each category, each bar representing a company. In order to differentiate the three datasets we use different colors for the bars.

year = [2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018]
hyundai_sales = [401742,435064,538228,645691,703007,720783,725718,761710,768057,664943,667634]
honda_sales = [1284261,1045061,1096874,1023986,1266569,1359876,1373029,1409386,1476582,1486827,1445894]
toyota_sales = [1843669,1496211,1488588,1396837,1764802,1893874,2004373,2098545,2106332,2129177,2128201]
x_pos = np.arange(len(year))
width = 0.25
plt.bar(x_pos-width, toyota_sales, edgecolor='k',color='#E59998',width=width,label='Toyota')
plt.bar(x_pos, honda_sales, edgecolor='k',color='#FAD8AA',width=width,label='Honda',alpha=0.75)
plt.bar(x_pos+width, hyundai_sales, edgecolor='k',color='#538790',width=width,label='Hyundai')
plt.xticks(x_pos,year,rotation=30)
plt.title('Car sales')
plt.xlabel('Year')
plt.ylabel('No of units sold')
plt.legend()
plt.show()

Horizontal Bar Chart

Horizontal Bar Charts represent data in the form of horizontal bars, each bar representing an individual category. The data categories are shown on the y-axis and the data values are shown on the x-axis. The length of a bar is proportional to the value it represents.

The example below demonstrates how to plot a bar chart, the input datasets required for plotting are available in a csv file. We will import the built-in csv module to work with csv files.

import csv
with open (r'C:\Users\Ajay Tech\Desktop\air_pollution_index.csv') as input_file:
    csv_file = csv.reader(input_file,delimiter = ',')
    Header = next(csv_file)
    country = []
    index = []
    for row in csv_file:
        country.append(row[0])
        index.append(row[1])

plt.bar(country,index,color='#ff753e') 
plt.title('Air Pollution Index')
plt.xlabel('Index')
plt.show()

The above graph is plotted using the bar function. As can be observed from the figure the x-axis labels are overlapping with each other because the labels are too long. This problem can be solved using a horizontal bar chart, which makes optimal use of the space available. If the data labels are long or if you have too many data sets to plot, then horizontal bar charts can be used for plotting.

plt.barh(country,index,color='#ff753e') 
plt.title('Air Pollution Index')
plt.xlabel('Index')
plt.show()

Pie Chart

Pie Chart


  Data visualization

Contents

  • Pie Chart

Pie Chart

A pie chart is a circular chart divided into segments. The segments of a pie chart are called wedges. Each wedge represents an individual category. Pie charts display the contribution of each wedge to the total value and the sum of the values of all wedges should be equal to the whole circle. Pie charts are useful when there are few categories to compare(5 or less) else it becomes difficult to interpret the data.

Below we see a pie chart plotted using the pie function defined in pyplot module. The pie chart displays the percentage of marks obtained by a student in four subjects. The circle is divided into four wedges and the area of each wedge is proportionate to the value it represents.

from matplotlib import pyplot as plt
plt.style.use('default')
x = [22,18,13,10]
labels = ['maths','physics','chemistry','english']
colors = ['m','r','y','c']
plt.pie(x,labels=labels,colors=colors,autopct='%.1f%%')
plt.title('Marks obtained in an exam')
plt.show()

The pie chart below displays the market share of mobile phone vendors worldwide.

plt.style.use('default')
x = [33.71,19.37,4.85,3.82,7.42,30.83]
labels = ['Samsung','Apple','Huawei','LG','Unknown','Others']
colors = ['m','r','y','b','g','c']    
plt.pie(x,labels=labels,colors=colors, explode=[0,0,0,.2,0,0],autopct='%1.2f%%',startangle=45,counterclock =True,shadow=True,
        wedgeprops={"edgecolor":"0",'linewidth': 1,'linestyle': '-'})
plt.title ('Smartphone market share')
plt.axis('equal')
plt.legend(loc=2)
plt.show()

x — The first argument passed to the pie function is an array or a list denoting the values for the categories to be compared.

Labels — The labels argument is a list of strings used for labelling each wedge.

Colors — You can define an array/list of colors and then pass it to the pie function that will be applied to each wedge in pie chart in the order specified in the array.

Explode — If you want to highlight or emphasize key data in a pie chart use the explode parameter. The explode parameter explodes/expands a wedge, so the wedge is moved slightly outward from its center. This parameter accepts an array and each element in the array specifies by what fraction of the radius the wedge needs to be exploded. The value has to be defined for all wedges in the pie chart, so the length of the array should be equal to the number of wedges in the pie chart.

Autopct — If you want to label the wedges with their numeric value in a pie chart, then use autopct parameter. This parameter allows us to display the percent value using string formatting. Say for example, the percent value calculated for a segment is 34.678666 and if you want to display the percent value rounded to 1 decimal place then autopct parameter should be assigned the format string ‘%1.1f’ then the wedge will be labelled with the numeric value 34.6. If you want to add a percent sign (%) to the label then use two percent signs(%%) in the format string so that the special character ‘%’ escapes itself.

Startangle — By default the Startangle is zero, which means starting from the positive x-axis the wedges are arranged in the counter clock wise direction. If you specify a different Startangle then the start of the pie chart is shifted by this angle in degrees and then the wedges are arranged in counter clock wise direction from this position.

Counterclock — Specifies the direction in which the wedges are arranged, clockwise or counter clockwise. The default value is True.

Shadow — A shadow effect can be added to the pie chart using the shadow parameter of the pie() function, passing boolean value – True will make a shadow appear below the pie chart. By default shadow is turned off.

Wedgeprops — The wedges of the pie chart can be customized using the wedgeprop parameter. A dictionary with the property name and value as the key, value pairs can be passed as the wedgeprop argument. The wedge properties like edgecolor, linestyle, linewidth can be specified.

Line Plots

Line Plots


  Data visualization

Contents

What is a Line Plot?

The pyplot module in matplotlib supports a variety of plots such as – line plot, pie chart, bar chart, histogram, scatter plot etc., The module defines methods that are used to render various plots. In this tutorial, we will discuss about line plots.

A line plot is created by connecting the values in the input data with straight lines. Line plots are used to determine the relation between two datasets. A dataset is a collection of values. Each dataset is plotted along an axis ie., x and y axis. In order to draw a line plot, we call the plot function defined in pyplot module. We pass two arguments (arrays or lists) to the plot function, the first argument denotes the x-coordinates, second argument denotes the y-coordinates. The plot function plots the data points (x1,y1), (x2,y2) and so on defined in the input datasets and by default, draws a line between these data points. Before drawing a plot, let us the see the components that make up a basic plot.

Components of a basic plot

A basic plot is made up of the following components:

  1. Title – Title describes the information that we want to convey using the graph.
  2. Label – Label is a short description of the datasets being plotted.
  3. Scales – Scales determine the reference points for data displayed on the graph.
  4. Points – Points in a graph represent the input data in the form of x-coordinate and y-coordinate (x1,y1).
  5. Lines – Lines are used to connect points to highlight the change in values.

Plotting a graph

Matplotlib makes extensive use of the Numpy library which contains a number of mathematical functions which can be used to perform various mathematical operations. We need to import Matplotlib and Numpy libraries before making any calls to the routines defined in them. The below example demonstrates creation of a line plot by passing two numpy arrays x and y as arguments to the plot function.

from matplotlib import pyplot as plt
import numpy as np
x = np.linspace(0,10,10)
y = np.linspace(0,10,10)
plt.title('First Plot')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.plot(x,y,marker='o')
plt.show()

In order to render the above plot, we simply passed two arrays x,y to the plot function. We can see that the plot function has:

  1. Drawn the x and y axis.
  2. Marked evenly spaced scales(tick marks) on both the axes.
  3. Plotted the data points.
  4. Connected the data points with lines.
  5. Added the title, xlabel, ylabel.

Before executing the above mentioned steps, the plot function first creates a figure object. Figure objects are the individual windows on the screen in which Matplotlib displays the graphical output. It is a container for the graphical output. In the Jupyter NB, figures rendered by the Matplotlib library, are included inline. The plot function implicitly creates a figure object and then plots the graph. So, we do not have to call any other function to instantiate a figure object to render a plot when using the plot function. The standard size of a figure object is 8 inches wide by 6 inches high.

Say, for example, we have a requirement to create a figure with a specified size (4 inches wide, 4 inches high). For this, we need to call the Figure method defined in the pyplot module explicitly. The ‘figsize’ parameter of this method allows us to specify the width and height of a figure in unit inches and new a figure will be created. In order to render a plot, call the plot function.

The savefig() method saves the figure to a data file with a name specified by the string argument. The filename can be a full path and can also include a file extension if needed.

plt.figure(num=1,figsize=(4,4),dpi=100)
plt.plot(x,y,marker='o')
plt.savefig('second_plot.png')
plt.close()

Creating a line plot by passing a single array

We can pass a single dataset or an array to the plot function as shown in cell below. The plot function uses the values 0, 1, …, N-1 as the x coordinates where ‘N’ is the size of the y array.

plt.title('Second Plot')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.plot(y)
plt.show()

Multiple plots on the same graph

We can plot multiple plots on the same graph by calling the plot function for each dataset pair, this is useful to compare the plots. Each plot is rendered on top of another plot. Notice how Matplotlib applies a different color to each plot. The plots share the figure, x and y axis.

plt.plot(x,y)
plt.plot(x,x**2)
plt.plot(x,x**1/2)
plt.title('Multiple Plots')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.show()

The above code can be re-written as follows, that is by passing each dataset pair in an order as arguments to the plot function.

plt.plot(x,y,x,x**2,x,x**1/2)
plt.title('Multiple Plots2')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.show()

Line Properties

A line drawn on a graph has several properties such as color, width, style, transparency etc., these properties can be customized as per our requirement when we call the plot function.

plt.plot(x,y,color='c',marker='o',linewidth=2,linestyle='--',alpha=0.5,label='line1')
plt.plot(x,x**2,color='#90BC38',marker='D',linewidth=2,alpha=0.5,label='line2')
plt.legend()
plt.show(

Color codes

Colors in data visualization are used to enhance the look of the graphs, communicate the information clearly and to distinguish one set of data from another. The following basic colors are defined in Matplotlib.

Matplotlib also supports HEX colors. Web designers and developers use HEX colors in web designing. A HEX color is represented as a six-digit combination of numbers and letters defined by the amount of red, green and blue (RGB) that makes up the color.

Linestyle

Linestyle specifies whether the line is solid, dashed etc.,

Markers

Markers on a line plot are used to highlight particular data points.

Legend

A legend is a box associating labels(text) with lines on a graph. The legend() method is used to add a legend to the plot. This method can be called in multiple ways:

  1. plt.legend() – When no arguments are passed to the legend() method, the plots to be added in the legend are automatically detected, and the corresponding labels are used in the legend.
  2. plt.legend([‘label1’, ‘label2’, ‘label3’]) – The legend method can also be called by passing a list of string labels, where each string is used as a label for the plots in the order they were created. This method can be used to create a legend for the plots already existing on the axes. Note: This way of using the legend method is often discouraged because you should remember the order in which the plots are created which can be confusing.
  3. plt.legend([plot1,plot2,plot3],[‘label1′,’label2′,’label3’]) – We can explicitly specify the plots and labels by passing the list of plots followed by the list of string labels arranged in order to the legend method.

Introduction to Matplotlib

Introduction to Matplotlib


  Data visualization

Contents

Data visualization

Data visualization is the representation of data in a graphical format. It helps putting data in a visual form. Information, relationships, patterns that might go unnoticed in a text-based format can be easily recognized with data visualization software. This is because the human brain can understand and process visuals such as images, graphs or charts more easily compared to having the data in spreadsheets or in the form of reports. Data visualizations can turn large and small datasets into visuals.

The table below displays an independent variable ‘x’ and three functions a,b and c. Each of these functions are dependent on the variable ‘x’. Let us use the data in the table below to plot these functions.

From the plot, we can observe the relation between the variable and each of the functions and can infer that the cubic function(c) grows much faster compared to the identity function(a) and the square function(b).

In this example, we have a small dataset, so analysing data from the table is easy. But what if we have a dataset with millions of entries or a complex function to be analysed. In that case, having a graphical representation of data would be useful. There are various types of graphs or charts to represent data in a visual form. The type of data given and what we want to convey to the user determine the appropriate graph to be used. Line plots, Pie charts, Bar charts, Histogram, Scatter Plots etc., are few examples of graphs.

Matplotlib

This course will take an in-depth look at the Matplotlib tool for visualization in Python.

Matplotlib is a Python package that is widely used throughout the scientific Python community to create high-quality and publication-ready graphics.

Matplotlib is Python’s alternative to MATLAB and it has the advantage of being free and open-source, whereas MATLAB is expensive and closed source.

The Matplotlib library provides the pyplot module, which contains functions which closely resemble the MATLAB plotting syntax and functionality.

Matplotlib is built on Numpy arrays. It supports a wide range of export formats suitable for both web and print publishing. It supports high-quality output formats such as PNG, PDF, SVG, EPS and PGF.

Installing Matplotlib

Install Matplotlib with pip

Matplotlib can be installed using the Python package manager, pip. To install Matplotlib with pip, open a terminal window and type:

$ pip install matplotlib

This command installs Matplotlib in the current working Python environment.

Install Matplotlib with the Anaconda distribution of Python

The easiest way to install Matplotlib is to download and install the Anaconda distribution of Python. The Anaconda distribution of Python comes with Matplotlib included and no further installation steps are required. You can download the latest version of Anaconda by following this link – https://www.anaconda.com/download/.

Backend

Matplotlib uses a backend to render the plots. Backend is a utility used to create graphs. There are two types of backends, interactive and non-interactive. Interactive backends display the figure in a graphical user interface, which allows us to pan and zoom the figure. Non-interactive backends are used to produce image files. Matplotlib supports the following backends:

Backends: GTKAgg, GTK3Agg, GTK, GTKCairo, GTK3Cairo, WXAgg

The Jupyter Notebook supports the ‘inline’ backend. With this backend, the output of plotting commands is displayed inline, that is directly below the code cell that produced it. The inline backend renders a static or a stand alone plot. The resulting plots will be stored in the notebook document.

The ‘inline’ backend can be invoked using the following command: %matplotlib inline

The Jupyter notebook also supports the ‘notebook’ backend which renders an interactive plot. Just below the plot, we can find a toolbar to switch views, pan, zoom and download options.

The ‘notebook’ backend can be invoked using the following command: %matplotlib notebook

Basic Plotting with Matplotlib

%matplotlib inline
from matplotlib import pyplot as plt
plt.plot([1,2,3,4,5],[1,2,3,4,5])
plt.show()

%matplotlib notebook
from matplotlib import pyplot as plt
plt.plot([1,2,3,4,5],[5,4,3,2,1])
plt.show()

Perceptron from scratch

Perceptron from scratch


  Deep Learning

Contents

This is the most critical component of neural networks. In this section, we will learn about both forward propogation and backward propagation and the math behind back propagation.

Perceptron

This is the most fundamental type of element in a neural network. We have already seen what a perceptron is in the basics of neural networks section. However, we just scratched the surface. In this section, we will explore a perceptron in detail and explore a couple of simple problems it can solve.

Linearly Separable data

By definition, a perceptron can only solve linearly separable problems. What is a linearly separable problem ? Here are a couple of examples that show you linearly separable data. For example, two if the iris species are linearly separable by a hyperplane (in this case a single line). Similarly, an OR gate is also an example of a linearly separable dataset.

# Visualize a OR gate
import numpy as np

# OR gate data
x = np.array([[1,0],
              [0,1],
              [0,0],
              [1,1]])
y   = np.array([1,1,0,1])

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(x[:,0],x[:,1],c=y)
plt.xlabel("x")
plt.ylabel("y")
plt.title("OR gate")
Text(0.5, 1.0, 'OR gate')
# Visualize just 2 species (setosa, versicolor) that are linearly separable 
# using the predictors (Sepel Length, Sepal, Width)
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

# iris data is readily available as a sklearn dataset.
from sklearn import datasets
iris = datasets.load_iris()
data = iris.data

# visualize just the first 100 rows (so that it contains only the species setosa and versicolor)
# We are specifically not plotting the third species (virginica), because it is not 
# linearly separable.
plt.scatter(data[0:100,0],data[0:100,1],c=iris.target[0:100])
plt.xlabel("sepal length")
plt.ylabel("sepal width")
plt.title("iris species - Setosa, Versicolor")
plt.savefig("iris.png")

Now that we have an understanding of the data, let’s use gradient descent to solve for the weights.

Activation function

What we are essentially trying to do is to find out values for weights and bias in such a way that

Here is how this function would look like.

# Show how a binary step function looks like.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace (-5,5,100)
y = np.zeros(len(x))
y[x>=0] = 1
y[x<0] = 0

plt.scatter(x,y)
plt.xlabel("x")
plt.ylabel("y")
plt.title("Activation function - Binary step function ")

Text(0.5, 1.0, 'Activation function - Binary step function ')

Steps

Let’s solve the OR gate problem (or any other linearly separable problem) using a simple, single layer perceptron. Let’s start with the data first. Here is a quick high-level overview of the steps involved.

# OR gate data
x = np.array([[1,0],
              [0,1],
              [0,0],
              [1,1]])
y   = np.array([1,1,0,1])

Cost function

What about back propagation ? This is where gradient descent comes in (along with its paraphernelia of partial derivatives, learning rate, cost function etc). There are a couple of options to calculate the cost function (residual sum of squares & cross entropy), but for now, let’s just use the residual sum of squares (RSS) cost function. We have already seen this in the gradient descent section.

Partial Derivatives

Now, we want to see what the derivative of the cost function with respect to each of the variables (weights and bias).

similarly, the partial derivatives with respect to weight 2 (w2) and the bias (b) are

Update Rules

Once we have the partial derivatives, we can update the weights and biases

These equations can rattle any ML engineer, but remember, all of this is left to the library (tensorflow or any of the underlying deep learning library) to compute. The only reason why we are learning all of the math and hand-coding this in Python is to ensure that we get an in-depth understanding of back propagation. This is absolutely essential to be a good ML engineer.

Forward Propagation

Forward propagation is a relatively easy step. Let’s write a quick function for it .

def forward_prop(row) :
    y_hat = np.dot(x[row],w) + b
    if y_hat > 0 :
        return 1
    else :
        return 0

Backward Propagation

Let’s now write a function for back propagation using all the geeky stuff above in “update rules” section.

def backward_prop(y_hat, row) :
    global b,w
    w[0]  = w[0] + alpha * (y[row] - y_hat) * x[row][0]
    w[1]  = w[1] + alpha * (y[row] - y_hat) * x[row][1]
    b     = b + alpha * (y[row] - y_hat) 

Initialize weights and biases

Initialize the weights and bias.

w = np.random.normal(size=2)
b = np.random.normal()

# learning rate. This is exactly the same term that we have already learnt in gradient descent.
alpha = 0.01

Predict Function

As of now, we are working on global variables to make things simple. Later, we will make a class out of all this to make things easy going forward. Just one more function to go, before we set this in motion. Assuming the model is in place (which we are going to write in a minute), we also need a function to predict a y

value, right ? Just like any Machine Learning algorithm, we need a predict ( ) method. Once the model fits the data to the right set of weights, this one is very easy. All we have to do is run the data through one forward propagation cycle.

# return the predicted y_hat, for the test data set.
def predict(x) :
    y = []
    
    # the user could be sending multiple rows. compute y_hat for each of the rows in the test dataset.
    for row in x :
        
        # weighted sum
        y_pred = np.dot(row,w) + b
        
        # run the weighted sum throught he activation function.
        if y_pred > 0 :
            y_pred = 1
        else :
            y_pred = 0
            
        # append the predicted y (y_hat)to an array
        y.append(y_pred)
        
    # return the predicted array of y_hat values for the corresponding test data (x)
    return y

Training

The individual pieces of the simple perceptron have been coded. Now, we need to write the logic to

  • take the input data. For each row
    • do one cycle of forward propagation
    • do one cycle of backward propagation and updated the weights and bias.

This exhausts one cycle of the input data. In Deep learning, this is called as an epoch. We need to repeat the entire process for a whole bunch of epochs.

Let’s write the logic for this.

# number of epochs
for epoch in range(1000) :
    
    # for each row in x (cycle through the dataset)
    for row in range(x.shape[0]) :
        
        # for each row in x, predict y_hat
        y_hat = forward_prop(row)

        # for each row calculate weights
        backward_prop(y_hat,row)

print ( w, b)
[0.01363271 0.25196752] -0.009751486705392132

Predict

It is time to test our network. Let’s quickly print out x and y.

x
array([[1, 0],
       [0, 1],
       [0, 0],
       [1, 1]])
y
array([1, 1, 0, 1])

Since this is a small dataset, we don’t need a confusion matrix to calculate the accuracy. Let’s just use the predict function on the x array to predict y.

predict(x)
[1, 1, 0, 1]

That’s a perfect match. This is a small dataset. Let’s look at a slightly larger dataset and see if the perceptron is good enough to do linear separation. Let’s pick up the iris dataset from Scikit Learn’s

from sklearn import datasets

iris = datasets.load_iris()

data = iris.data

All of this data is not linearly separable. For example, if you plot the species against the sepal length and width, the species – versicolor and virginica are muddled. Only the first species (setosa) is clearly separated.

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(data[:,0],data[:,1],c=iris.target)

So, let’s just use the first two species. Let’s plot it again.

# visualize just the first 100 rows (so that it contains only the species setosa and versicolor)
# We are specifically not plotting the third species (virginica), because it is not 
# linearly separable.
plt.scatter(data[0:100,0],data[0:100,1],c=iris.target[0:100])
plt.xlabel("sepal length")
plt.ylabel("sepal width")
plt.title("iris species - Setosa, Versicolor")
plt.savefig("iris.png")

Now, we have a clear, linear separation. Let’s train our perceptron on this data and see if it works.

x = data[0:100,0:2]  # iris sepal data ( sepal length and width )
y = iris.target[0:100] # iris species data (only setosa and versicolor)

w = np.random.normal(size=2)
b = np.random.normal()

# learning rate
alpha = 0.01

# number of epochs
for epoch in range(1000) :
    
    # for each row in x
    for row in range(x.shape[0]) :
        
        # for each row in x, predict y_hat
        y_hat = forward_prop(row)
        # for each row calculate weights
        backward_prop(y_hat,row)

print ( w, b)
[ 0.80437979 -1.08684544] -1.0479456545593953

We can very well do a confusion matrix to check for accuracy.

y_pred = predict(x)

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

print ( confusion_matrix(y,y_pred) )
print ( accuracy_score(y,y_pred))

[[49  1]
 [ 0 50]]
0.99

That a pretty good accuracy – almost 98%. It is a bit more interesting to see this visually – using matplotlib’s meshgrid.

import numpy as np
 
x_all = np.linspace(0,10,100).reshape(-1,1)
y_all = np.linspace(0,10,100).reshape(-1,1)
 
xx,yy = np.meshgrid(x_all,y_all)

x_grid = np.concatenate((xx.ravel().reshape(-1,1),yy.ravel().reshape(-1,1)),axis=1)

x_grid
array([[ 0.       ,  0.       ],
       [ 0.1010101,  0.       ],
       [ 0.2020202,  0.       ],
       ...,
       [ 9.7979798, 10.       ],
       [ 9.8989899, 10.       ],
       [10.       , 10.       ]])
y_grid = predict(x_grid)

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

mpl.rcParams['figure.dpi'] = 200

plt.scatter(x_grid[:,0],x_grid[:,1],c=y_grid,alpha=0.1)
plt.scatter(data[0:100,0],data[0:100,1],c=iris.target[0:100])

Gradient Descent

Gradient Descent


  Deep Learning

Contents

What is Gradient Descent

In simple terms, Gradient Descent is an algorithm to compute the minimum of a function. OK – So, what is the big deal ? Well, most of the time in most machine learning algorithms, there is always a cost function that needs to be minimized. The best Machine Learning Algorithm is usually the one with the most inclusive and simple cost function. Once a cost function is defined, it is just a matter of solving for a minimum to arrive at the solution. That is why Gradient Descent is extremely useful in the context of Machine learning. Let’s see an example.

Gradient Descent for Linear Regression

Let’s start with the simplestML problem – Linear Regression. In the Machine Learning in Python Tutorial, we have covered Regression in Python in great detail.

Since the problem is simple enough to be solved mathematically, we have used the OLS (Ordinary Least Squares) technique to fit a straight line to the Linear Regression problem. You can view the equation for Ordinary Least Square to solve linear regression here. What is the cost function in this case?

picture here

Cost function = Sum of Squares of Residuals

The mathematical solution to minimize this cost function as derived by OLS is as follows.

where x¯ represents the average of x and y¯

represents the average of y

However, when the number of independent variables increase, OLS is not a good solution. That is where Gradient Descent shines. While OLS is an analytical solution, Gradient Descent is a numerical solution. However, to understand Gradient Descent, we have to be conversant with the following concepts in Math.

  • Derivatives
  • Partial Derivatives

Math

Derivatives

A derivative is the slope of a function. Let’s take a simple straight line –

A simple dataset for this could be

  • x = Number of DNA Samples
  • y = Number of DNA pairs.

Let’s plot a sample dataset and try to compute the slope.

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

x = np.array([1,2,3,4,5,6,7,8,9,10])
y = x * 2

print ( "x = ",x)
print ( "y = ",y)

plt.plot(x,y)
plt.plot(x[1], y[1], marker='o', markersize=10, color="red")
plt.plot(x[3], y[3], marker='o', markersize=10, color="red")

plt.hlines(y=y[1], xmin=x[1], xmax=x[3], color='b')
plt.vlines(x=x[3], ymin=y[1], ymax=y[3], color='b')

plt.text(4.2,5,(y[3] - y[1]))
plt.text(3,3,(x[3] - x[1]))
x =  [ 1  2  3  4  5  6  7  8  9 10]
y =  [ 2  4  6  8 10 12 14 16 18 20]



Text(3, 3, '2')

A simple dataset for this could be

  • x = Reach of a product
  • y = Sales of the product.

Let’s plot a sample dataset and try to compute the slope.

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

x = np.array([1,2,3,4,5,6,7,8,9,10])
y = x ** 2

print ( "x = ",x)
print ( "y = ",y)

plt.plot(x,y)
plt.plot(x[1], y[1], marker='o', markersize=10, color="red")
plt.plot(x[3], y[3], marker='o', markersize=10, color="red")

plt.hlines(y=y[1], xmin=x[1], xmax=x[3], color='b')
plt.vlines(x=x[3], ymin=y[1], ymax=y[3], color='b')

plt.text(4.2,8,(y[3] - y[1]))
plt.text(3,0.0,(x[3] - x[1]))

plt.text(5,10,"slope = 12/2 = 6")

plt.plot(x[5], y[5], marker='o', markersize=10, color="red")
plt.plot(x[7], y[7], marker='o', markersize=10, color="red")

plt.hlines(y=y[5], xmin=x[5], xmax=x[7], color='b')
plt.vlines(x=x[7], ymin=y[5], ymax=y[7], color='b')

plt.text(8.3,50,(y[7] - y[5]))
plt.text(7,30,(x[7] - x[5]))

plt.text(8.58,45,"slope = 14")
x =  [ 1  2  3  4  5  6  7  8  9 10]
y =  [  1   4   9  16  25  36  49  64  81 100]
Text(8.58, 45, 'slope = 14')

In this case, the slope is not constant as measured by same metric as we have done previously. The slope seems to be changing with x.

A correct way to define slope (or derivative) is to take an infinitesimally small increase in x and the corresponding value of y and divide them as before. Mathematically, it is defined as,

If f(x) is a function of x,

For example, if x = 4, increase x by a very small amount, say Δ=0.0001

. Now, let’s compute the value of y as well and plug them into the equation above

  • x = 4
  • dx = 0.0001

so, the derivative of f(x)=x2 is 2x

. We have not derived this mathematically – instead, we are trying to understanding with numbers, how a derivative works.

Derivative represents the change in the value of a function with respect to the variable (with which the derivative is being applied)

Partial Derivatives

Partial derivatives are almost similar to regular derivatives – except that partial derivatives work only on a particular variable. For example, say the speed of a car is dependent on

  • engine RPM
  • slope of the road

you can also write it as

Now, how does the speed (z) of the car vary with a unit increase in the engine RPM ? The answer is 8 – pretty straightforward. That is represented mathematically using

Let’s take another example – The equation of a 2-d plane can be generalized as below

You can visualize a plane like this –

As you can see, the plane intersects the z-axis at 2 ( Where the value of x & y are 0 ). Now, how far does the function vary, with a unit variation in x ?

Once again, I want you to take the intuitive meaning out of this –

For a unit change in x, the function changes by so much in the direction of x – That is a partial derivative.

A plane is simple to understand. However, the interpretation would be the same even if it were a complicated curve in a 3-d space – Like a hill.

Gradient Descent for Linear Regression

Now that we understand derivaties (both regular and partial), we are ready to graduate to Gradient Descent. Imagine a set of data points like so.

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

x = np.array([1,3,4,5,6,7,8,10])
y = np.array([4,3,7,7,8,10,8,11])

plt.scatter(x,y)

Say, we want to fit a straight line to this data set using Linear Regression. How would you do it ? Very simple

from sklearn.linear_model import LinearRegression 
model = LinearRegression()

model.fit(x.reshape(-1,1),y) 

slope = model.coef_  
intercept   = model.intercept_

print ( "slope = ", slope)
print ( "intercept = ", intercept)

slope =  [0.84482759]
intercept =  2.6034482758620694
point_1 = slope*0 + intercept 
point_2 = slope*15 + intercept 
print ( point_1, point_2) 
plt.scatter( x,y,alpha=0.5)
plt.plot([0,15], [point_1,point_2],color="red")

[2.60344828] [15.27586207]
plt.scatter( x,y,alpha=0.5)
plt.plot([0,15], [point_1,point_2],color="red")

y_actual    = y
y_predicted = model.predict(x.reshape(-1,1))

for index,x_count in enumerate(x) :
    if y_actual[index] > y_predicted[index] :
        plt.vlines(x=x_count, ymin=y_predicted[index], ymax=y_actual[index], color='b')
    if y_actual[index] <= y_predicted[index] :
        plt.vlines(x=x_count, ymin=y_actual[index], ymax=y_predicted[index], color='b')

The blue lines represent the residuals (or errors). We can calculate the slope (and intercept) of the fit using OLS (Ordinary Least Squares) or using Gradient Descent. We already know how OLS works in Linear Regression. We will see how Gradient Descent works. The equation for a straight line that would fit all the data points is some variation of

where

  • m = slope
  • b = intercept

Either way, we are minimizing the Sum of Squares of Errors. We started out with the definition of this at the beginning of the chapter.

Cost Function

Just to make things simple, assume a value of intercept (b) to be fixed at 2.6 ( b = 2.6 as we have previously solved for it). Imagine that we chart the cost function with different values of slope(m).

n = len(y)

cost_function = []
m = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0]
for slope in m : 
    cost = 0
    for i in range(n):
        cost = cost + (1/(2*n)) * ( (y_actual[i] - slope * x[i] - 2.6) ** 2 )
    cost_function.append(cost)

plt.scatter(m,cost_function)

Visually, we can eyeball the minimum value of the cost function to be somewhere around 0.8. This is inline with the scikit learn’s LinearRegression model that we have solved above.

But how to mathematically solve for this (without using Ordinary Least Squares) ? That is where Gradient Descent comes in.

Gradient Descent is a technique to find out the minimum of a function numerically.

Imagine a ball put at a random location on the cost curve shown above.

If you were to let the ball go, it would roll down to the bottom. Why does this happen ? Gravity moves the ball from a higher energy state to a lower energy state. You already know this. What is more interesting is the path it takes. The path should always be from a position of higher slope to a position of lower slope.

If you see the slope of the ball at each of the 4 positions highlighted above, it is pretty clear that the slope(dashed line) is decreasing with every move down the curve. Slope represents the derivative of the curve. In this case derivative of the cost function with respect to the slope (x-axis).

How much do you move by ?

The amount and the direction you move is controlled by how much the cost function changes.

How much we move is based on how fast the cost function changes with the slope (or intercept). The way to learn that is by finding out the derivative of the cost function with respect to the slope (and intercept). For now, just to make things simple and to be able to view things in 2D, we are only keeping the slope as the variable (and the intercept as constant). We will see in the next section how to work on minimizing the cost function for both slope and intercept.

x = x.astype(float)
y = y.astype(float)

print ( x )
print ( y )
steps = 5
m = 0
n = len(x)
l_rate = 0.0001

# Start Gradient Descent
for step in range(steps) :
    y_pred = m * x + 2.6

    # Derivative of the cost function w.r.t m
    m_der  = (-1/n) * sum( (y - y_pred) * x)
    
    # move m
    m = m -  m_der
    print ( m)

[ 1.  3.  4.  5.  6.  7.  8. 10.]
[ 4.  3.  7.  7.  8. 10.  8. 11.]
31.700000000000003
-1125.3500000000001
41106.975000000006
-1500372.8875000002
54763642.09375

The value of m oscillates hugely. That is because, m

gives a general sense of direction, but doesn’t tell us how far we need to go. You can’t go an infinite distance in the direction of m. We need to take baby steps.

Take a small step, evaluate slope, take another small step in the direction of the least slope. This is the essence of Gradient Descent.

It is like a baby learning to take steps. So, there is a concept called learning rate that controls how far we move in the direction of the most descent. Let’s rewrite the program with learning rate.

import time

l_rate = 0.001
steps = 1000
m = 0

m_array = []

# Start Gradient Descent
for step in range(steps) :
    y_pred = m * x + 2.6

    # Derivative of the cost function w.r.t m
    m_der  = (-1/n) * sum( (y - y_pred) * x)
    
    # move m
    m = m -  l_rate * m_der
    m_array.append(m)

print ( "optimum slope (m) = ", m)

optimum slope (m) =  0.8453333333333318

Let’s plot the journey of the ball down the cost curve.

# Cost Function 
n = len(y)
y_actual    = y

cost_function = []
m = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0]
for slope in m : 
    cost = 0
    for i in range(n):
        cost = cost + (1/(2*n)) * ( (y_actual[i] - slope * x[i] - 2.6) ** 2 )
    cost_function.append(cost)


# Steps taken    
n = len(y)

cost_function_m = []
m_steps = m_array
for slope in m_steps : 
    cost = 0
    for i in range(n):
        cost = cost + (1/(2*n)) * ( (y_actual[i] - slope * x[i] - 2.6) ** 2 )
    cost_function_m.append(cost)

plt.scatter(m_steps,cost_function_m) # steps taken.
plt.scatter(m,cost_function) # cost function
plt.xlabel("m - slope")
plt.ylabel("cost function")
Text(0, 0.5, 'cost function')

As you can see, even after 500 steps, the ball was not able to roll down to it’s minimum. That is because of such a small learning rate – 0.0001.

Learning Rate

How fast do you go down the path ? It depends on how fast you want to converge (without overshooting). An arbitrary parameter called learning rate( α ) would determine how fast you go down the path. If you want to converge fast, can you increase the learning rate ? Probably not. Here is why.

If you set a learning rate = 0.1, that is roughly how fast you move along the x-axis. However, if you set the learning rate to 0.7 (thinking you could move down the curve faster), here is what would happen – You essentially miss the minimum.

Here is a quick plot of how the ball moves with a learning rate of 0.05 within just 100 iterations. The ball is going back and forth because it is overshooting. However, it finally settles down at the minimum.

Optimize Gradient Descent for both Slope & Intercept

So far, we have optimized Gradient Descent for just slope. How about the intercept ? The moment we introduce the 2nd parameter – intercept , the cost function becomes 3d.

  • x-axis = slope
  • y-axis = intercept
  • z-axis = cost function

Matplotlib provides a rudementary 3d scatter plot. We will be using that to plot the 3d plot of the cost function.

import matplotlib.pyplot as plt
import numpy as np

x = np.array([1,3,4,5,6,7,8,10])
y = np.array([4,3,7,7,8,10,8,11])

# This import registers the 3D projection, but is otherwise unused.
from mpl_toolkits.mplot3d import Axes3D  # noqa: F401 unused import

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

slope_values     = np.arange(start=0,stop=5,step=0.05)
intercept_values = np.arange(start=0,stop=5,step=0.05)
# y_pred    = slope * x + intercept

n = len(y)

cost_function = []

for index, slope in enumerate(slope_values) : 
    cost = 0
    for i in range(n):
        cost = cost + (1/(2*n)) * ( (y[i] - slope_values[index] * x[i] - intercept_values[index]) ** 2 )
    cost_function.append(cost)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(slope_values, intercept_values, cost_function,marker='o')

plt.show()

Jupyter notebook doesn’t allo 3-d rotation, but if you try this in your standard IDE (say VS Code), you would be able to get a 3-d look at the plot.

Now, let’s optimize the cost function for both slope and intercept. Here are the partial derivatives of the cost function for the slope and intercept.

x = np.array([1,3,4,5,6,7,8,10])
y = np.array([4,3,7,7,8,10,8,11])

l_rate = 0.01 # Learning rate
steps = 4000    # number of iterations ( steps )

m = 0 # initial slope
b = 0 # initial intercept

n = float(len(x))

m_array = []
b_array = []

# Start Gradient Descent
for step in range(steps) :
    y_pred = m * x + b

    # Derivative of the cost function w.r.t slope (m)
    m_der  = (-1/n) * sum( (y - y_pred) * x)
    # Derivative of the cost function w.r.t intercept (b)    
    b_der  = (-1/n) * sum( y-y_pred )
    
    # move m
    m = m -  l_rate * m_der
    b = b -  l_rate * b_der
    
    # gather the slope and intercept in an array to plot later 
    m_array.append(m)
    b_array.append(b)
    
print (" optimim slope(m) = ", m)
print ( "optimum intercept (m) = ", b)

optimim slope(m) =  0.8450107631510549
optimum intercept (m) =  2.6022056448336817

Now that we have the

slope_values     = np.arange(start=0,stop=3,step=0.05)
intercept_values = np.arange(start=0,stop=3,step=0.05)

n = len(y)

cost_function = []

for index, slope in enumerate(slope_values) : 
    cost = 0
    for i in range(n):
        cost = cost + (1/(2*n)) * ( (y[i] - slope_values[index] * x[i] - intercept_values[index]) ** 2 )
    cost_function.append(cost)

slope_values_new     = m_array
intercept_values_new = b_array

cost_function_new = []
for index, slope in enumerate(slope_values_new) : 
    cost = 0
    for i in range(n):
        cost = cost + (1/(2*n)) * ( (y[i] - slope_values_new[index] * x[i] - intercept_values_new[index]) ** 2 )
    cost_function_new.append(cost)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(slope_values, intercept_values, cost_function,marker='o')
ax.scatter(slope_values_new, intercept_values_new, cost_function_new,marker='o')

plt.show()

Stochastic Gradient Descent

For every iteration in Gradient Descent algo, the entire dataset is used to calculate the derivative of the cost function. This is very expensive if the dataset gets larger.

Imagine real world problems like image processing that has millions of pixels in a single image. Gradient Descent becomes almost impossible to compute if we don’t optimize.

One possible solution is to use Stochastic Gradient Descent. The word Stochastic stands for random. Instead of using every observation ( rows in the dataset), just use a random observation each time the derivative is being computed. In a standard Gradient Descent, the derivative of the cost function w.r.t slope is defined like this.

where i is the index of a random data row.

In Stochastic Gradient Descent, calculating the cost function is not done for the entire training set. Instead, you pick a random row in the dataset and calculate the cost function for that particular row only.

  • Surprisingly, this gives pretty good results (given the compromise)
  • This is computationally so much more efficient than doing the full dataset.

Let’s do this in python.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

l_rate = 0.001
steps = 1000
m = 0

n = len(x)

x = np.array([1,3,4,5,6,7,8,10])
y = np.array([4,3,7,7,8,10,8,11])

m_array = []
cost_function_m = []

# Start Gradient Descent
for step in range(steps) :
    
    # CHANGE - At every step, get a new random number
    random_index = np.random.randint(0,len(x))
    
    # CHANGE - only calculate the predicted "y" value for that particular data row
    y_pred = m * x[random_index] + 2.6

    # Derivative of the cost function w.r.t m
    # CHANGE - calculate the derivative only for a particular row in the data
    m_der  = (-1/n) * sum( (y[random_index] - y_pred) * x)
    
    # move m
    m = m -  l_rate * m_der
    m_array.append(m)

m_steps = m_array
for slope in m_steps : 
    cost = 0
    for i in range(n):
        cost = cost + (1/(2*n)) * ( (y[i] - slope * x[i] - 2.6) ** 2 )
    cost_function_m.append(cost)

plt.scatter(m_steps,cost_function_m) # steps taken.
plt.xlabel("m - slope")
plt.ylabel("cost function")

print ( "optimum slope (m) = ", m)
optimum slope (m) =  0.8793576691371693

That’s pretty close to the real value (as calculated by OLS), right ?

For just 10 observations, this is not a big deal. Imagine the performance gain if the number of rows were extremely large, as would happen in real datasets. However, there is a cost to this trade-off. The solution (optimum slope in this case) varies with each run. For example, try running the code above 4 or 5 times – each time you get a different solution. Although the difference is not a lot, still the stochastic gradient descent results in a slightly different solution with every run. How to counter this ? A compromise between standard gradient descent and stochastic gradient descent is possible – It is called Mini-batch gradient descent.

Mini-batch Gradient Descent

In practice though, a technique called mini-batch Gradient Descent is used mostly for Gradient Descent problems. It is hybird solution between standard gradient descent and stochastic gradient descent. The following picture highlights the difference between standard vs stochastic vs mini-batch gradient descent methods.

Let’s program Stochastic Gradient Descent in Python.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

l_rate = 0.001
steps = 1000
m = 0

n = len(x)

x = np.array([1,3,4,5,6,7,8,10])
y = np.array([4,3,7,7,8,10,8,11])

m_array = []
cost_function_m = []

x_range = np.arange(len(x))

# Start Gradient Descent
for step in range(steps) :
    
    # CHANGE - At every step, get a set of new random numbers
    random_index = np.random.choice(x_range, size=3, replace=False)
    
    # CHANGE - only calculate the predicted "y" value for that particular data row
    y_pred = m * x[random_index] + 2.6

    # Derivative of the cost function w.r.t m
    # CHANGE - calculate the derivative only for a particular row in the data
    m_der  = (-1/n) * sum( (y[random_index] - y_pred) * x[random_index])
    
    # move m
    m = m -  l_rate * m_der
    m_array.append(m)

m_steps = m_array
for slope in m_steps : 
    cost = 0
    for i in range(n):
        cost = cost + (1/(2*n)) * ( (y[i] - slope * x[i] - 2.6) ** 2 )
    cost_function_m.append(cost)

plt.scatter(m_steps,cost_function_m) # steps taken.
plt.xlabel("m - slope")
plt.ylabel("cost function")

print ( "optimum slope (m) = ", m)
optimum slope (m) =  0.8472922870088795

This time the slope value is pretty steady.

Mini-batch gradient descent achieves a compromise between the time-consuming, but accurate Gradient Descent and a quick, but slighlty inaccurate Stochastic Gradient Descent.

Gradient Descent is a generic Cost Minimization algorithm. As long as there is a convex cost function. If there are multiple minima, Gradient Decent only arrives at a local minima.

Gradient Descent for Logistic Regression

If you have been through the machine learning tutorial, you must have already seen what is logistic regression works and the math behind Logistic Regression. In the tutorial, we have used Scikit Learn’s LogisticRegression class to fit the data using Logistic Regression. However, we want to understand how Gradient Descent works for Logistic Regression.

In order to do that, we have to first understand 3 things

  • Logistic Regression equation
  • Cost function
  • Partial Derivative of the Cost function w.r.t. x

The equation for logistic regression is

where

  • w is a vector of numbers
  • b is a number
  • x is a vector of predictors

We haven’t seen w and b in Linear Regression that we saw previously, right ? Where have them sprung from ? Well, we are generalizing for the number of predictors a bit. For a single predictor, we can write the Logistic Regression as follows.

where

  • m slope
  • b intercept

The output of the Logistic Regression is actually a sigmoid curve.

You can get a visual like so.

from scipy.special import expit
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(-10,10)
y = expit (x)
 
plt.plot(x,y)
plt.grid()

The cost function for Logistic Regression can be formulated similar to a Linear Regression

However, this can result in a non-convex curve for logistic regression. So, instead of using Sum of Squares of Error, Logistic Regresion uses Cross Entropy for it’s cost function.

The partial derivative of the Cost function w.r.t (w,b) is

Now that we have the 3 required things, let’s write our gradient descent learning algorithm

where α

is the learning rate

For a single predictor, if you think of the slope/intercept (like the Linear Regression above), this would become

where i is the number of rows in the dataset

Now, let’s implement Logistic Regression in Python. Let’s take a simple dataset.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.array([1,3,4,5,6,7,8,10])
y = np.array([0,0,0,1,1,1,1,1])

plt.scatter(x,y,c=y)

Seems simple enough, right ? Let’s first get a baseline using Scikit Learn’s LogisticRegression model.

from sklearn import linear_model
from scipy.special import expit
 
model_lr = linear_model.LogisticRegression(C=1e5, solver='lbfgs')
model_lr.fit(x.reshape(-1,1), y)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Let’s fit it visually.

x_test = np.linspace(1.0,10.0,100)
# predict dummy y_test data based on the logistic model
y_test = x_test * model_lr.coef_ + model_lr.intercept_
 
sigmoid = expit(y_test)

plt.scatter(x,y, c=y)
 
# ravel to convert the 2-d array to a flat array
plt.plot(x_test,sigmoid.ravel(),c="green", label = "logistic fit")
plt.yticks([0, 0.2, 0.4, 0.5, 0.6, 0.7, 1])
plt.axhline(.5, color="red", label="cutoff")
plt.legend(loc="lower right")

Let’s try and fit this data using Logisic Regression based on Gradient Descent.

model_lr.intercept_
array([-67.57978497])
import numpy as np
from math import exp
import matplotlib.pyplot as plt
%matplotlib inline

l_rate = 0.001
steps = 1000
m = 0

n = len(x)

x = np.array([1,3,4,5,6,7,8,10])
y = np.array([0,0,0,1,1,1,1,1])
import time

l_rate = 1
steps = 1000
m = 0

m_array = []

# Start Gradient Descent
for step in range(steps) :
    y_pred_log = m * x + (-67.57)
    y_pred     = 1/(1 + np.exp(-y_pred_log))

    # Derivative of the cost function w.r.t m
    m_der  = (-1/n) * sum( (y - y_pred) * x)
    
    # move m
    m = m -  l_rate * m_der
    m_array.append(m)

print ( "optimum slope (m) = ", m)
optimum slope (m) =  15.017184591647284

Once again, we got pretty close to the slope as predicted by Scikit Learn’s LogisticRegression model.

model_lr.coef_
array([[15.03246322]])

Now that we understand how Gradient Descent works, let’s move on to the next important topic in Neural Networks – Back Propogation.

Neural Network Basics

Neural Network Basics


  Deep Learning

Contents

Basic Structure

A Neural network is an interconnected set of Neurons, arranged in layers. Input goes on one end and output the other end.

For example, the picture above is a neural network with 4 nodes in the input layer and 3 nodes in the output layer. This is the exact structure that we have used for iris classification that we have solved in our Hello World example on Day 1. The layer in between is called the hidden layer. This is what gives the name – Deep Learning – because the network is deep (with not just the input and output layer, but one or many hidden layers).

This is the basic structure of a neural network. The number of nodes or layers could change, but this is the basic structure of a typical neural network. To understand a neural network better, we have to get started from the basics.

Biological Neuron

Neural Network was inspired by the brain. A human brain consists of billions of neurons that are interconnected. Here is a quick picture from wikipedia.

x1,x2..xn represent the inputs. y1,y2..yn

are the outputs. So, essentially a neuron tranforms a set of inputs to a set of outputs. When many such neurons are connected, they form an intelligent system.

Perceptron

The most simplest way to represent a neuron, mathematically is with a perceptron.

A perceptron receives inputs, adds them up and produces an output. What is the big deal about it ? It is just basic addition, right ? True – That’s where the concept of weights come in.

Each of the inputs is multiplied by a weight. So, instead of just summing up the inputs, you multiply them with the weights and sum it up (weighted sum of inputs). The weighted sum could be a number within a very large range, depending on the input range and the weights. What is the use of having a number that could be anywhere from −∞ to +∞

To normalize this, a bias or threshold is introduced.

What does a perceptron achieve

The calculation above seems simple enough, but what exactly does it acheive ? Think of it like a decision making machine. It weights input parameters and provides a Yes or No decision. For example, say you want to decide if you want to learn Deep Learning or not, how do you go about it in your mind ?

Inputs 	               Weight
Job Prospect 	        30%
Interesting enough 	20%
Future Growth 	        30%
Salary 	                20%

You weigh your inputs ( multiply the inputs with the corresponding weightage) and arrive at a figure. In fact, each of these inputs are also given a number internally in your mind. However, the way a human brain functions is far more complicated. Like I said before, neural networks & deep learning are just “based on” how the human brain works. it is not an exact replica.

Sigmoid Neuron

While a perceptron is good enough for simple tasks, it has its limitations when building complex neural networks. That is where sigmoid neurons come in. If you have seen logistic regression in Machine Learning before, you will already have an idea on what a sigmoid function does. It essentially maps a range of numbers between −∞ to +∞

to values betwen 0 and 1.

A perceptron outputs either a 0 or a 1 depending on the weighted inputs & threshold. A sigmoid neuron outputs a value between 0 and 1. This makes the sigmoid neuron much more useful in large scale neural networks.

The weighted sum of inputs + bias is calculated just as above.

Now, instead of just outputting this, a sigmoid neuron calculates a sigmoid of the calculated weighted input + bias and outputs a value between a 0 and 1.

You can a have a visual of the sigmoid function as seen below.

from scipy.special import expit
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

x = np.linspace(-1000,1000)
y = expit (x)
 
plt.plot(x,y)
plt.grid()

This looks like a binary curve ( that can only take a value of 0 or 1), but if you closely observe the curve between a range of say -10 to 10, you can clearly observe the gradual progression.

x = np.linspace(-10,10)
y = expit (x)
 
plt.plot(x,y)
plt.grid()

This is the logistic regression curve. Only when the value of the ( weighted sum + bias ) stays very close to 0, do you observe the logistic curve. For any other extreme value, the output is pretty much either a 0 or 1 (very much like a perceptron).

Advantages of Sigmoid Neuron over a Perceptron

Since the output range of a sigmoid neuron is smooth, small changes in the inputs will result in small changes in the output. So, instead of just doing a flip of the switch (0 or 1), sigmoid function acts more like a slider. This feature of sigmoid functions output makes it very useful for neural networks learning.

Changes to your output are essentially a function of changes in the weights and biases. This is the basis of Neural Network learning.

However, to understand this mathematically, we have to understand a little bit of derivatives, partial derivatives and then the actual back-propogation algorithm itself – Gradient Descent. These will be the topic of our next chapter.

Hello World in Neural Networks

Hello World in Neural Networks


  Deep Learning

Contents

iris Classification using Tensorflow

Many times, the MNIST database (a database of handwritten digits) is typically the Hello World application when introducing Neural Networks for the first time. However, we are going to make it even simpler by taking the iris dataset and create a keras based tensorflow Neural Network to classify species. Please refer to Iris Data to understand more about the dataset we will be working on. You can also refer to Classification in Python to understand more about a non-neural network based classification approach to classifying the species in the iris dataset.

Once you understand how to solve the iris classification problem in Neural Networks, we will move to image recognition. As you will see, structurally there is not a lot of difference in the way we build the neural net for both of these problems.

This is just a “Hello World” tutorial. It is not intended to teach you the internals of Neural Networks. With that background, we are now ready to say hello to Neural Networks using Tensorflow.

import tensorflow as tf
from   tensorflow import keras

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Tensorflow is probably the most popular, open source library from Google that is used to implement Deep Learning. You can build neural networks of any complexity using Tensorflow. However, building a neural net from scratch typically involves defining

  • Layers
  • Linking the layers
  • loss function
  • weight adjustments etc

Defining these manually is very time consuming and daunting for newbies. What is needed is an abstract layer above Tensorflow, that makes building neural nets much quicker and easier.

Keras is the answer. Keras is a high level Python based API that can be used to build neural nets by leveraging Tensorflow. By the way, Tensorflow is not the only deep learning package out there. Here is a quick visual that shows you where Keras and Tensorflow stand in the hierarchy.

# load iris dataset
from sklearn import datasets
iris = datasets.load_iris()

# preview the iris data
print ( iris.data[0:5,:]  ) # data
print ( iris.target[0:5]  ) # target species

# train/test split @ 20% test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data , iris.target, test_size=0.2)  
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0 0 0 0 0]

The following 8 lines of code is all you need to solve the problem. Quickly execute it to see the output for yourself. However, there is quite a lot of explantion to be done here. Let’s take it step by step.

model = keras.Sequential()
model.add(keras.layers.Dense(4,input_shape=(4,)))
model.add(keras.layers.Dense(8,activation="relu"))
model.add(keras.layers.Dense(8,activation="relu"))
model.add(keras.layers.Dense(3,activation="softmax"))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=100)

y_pred = model.predict(X_test)
y_pred
Epoch 1/100
120/120 [==============================] - 0s 521us/sample - loss: 1.6050 - acc: 0.2917
Epoch 2/100
120/120 [==============================] - 0s 37us/sample - loss: 1.5040 - acc: 0.2917
Epoch 3/100
120/120 [==============================] - 0s 33us/sample - loss: 1.4096 - acc: 0.2917
Epoch 4/100
120/120 [==============================] - 0s 29us/sample - loss: 1.3262 - acc: 0.4333
Epoch 5/100
120/120 [==============================] - 0s 29us/sample - loss: 1.2607 - acc: 0.5667
Epoch 6/100
120/120 [==============================] - 0s 29us/sample - loss: 1.2080 - acc: 0.4667
Epoch 7/100
120/120 [==============================] - 0s 33us/sample - loss: 1.1707 - acc: 0.4917
Epoch 8/100
120/120 [==============================] - 0s 29us/sample - loss: 1.1451 - acc: 0.5000
Epoch 9/100
120/120 [==============================] - 0s 33us/sample - loss: 1.1258 - acc: 0.5167
Epoch 10/100
120/120 [==============================] - 0s 62us/sample - loss: 1.1068 - acc: 0.5417
Epoch 11/100
120/120 [==============================] - 0s 54us/sample - loss: 1.0904 - acc: 0.5833
Epoch 12/100
120/120 [==============================] - 0s 37us/sample - loss: 1.0770 - acc: 0.5833
Epoch 13/100
120/120 [==============================] - 0s 33us/sample - loss: 1.0674 - acc: 0.5250
Epoch 14/100
120/120 [==============================] - 0s 33us/sample - loss: 1.0610 - acc: 0.3917
Epoch 15/100
120/120 [==============================] - 0s 33us/sample - loss: 1.0548 - acc: 0.3583
Epoch 16/100
120/120 [==============================] - 0s 33us/sample - loss: 1.0498 - acc: 0.3417
Epoch 17/100
120/120 [==============================] - 0s 33us/sample - loss: 1.0453 - acc: 0.3417
Epoch 18/100
120/120 [==============================] - 0s 54us/sample - loss: 1.0397 - acc: 0.3417
Epoch 19/100
120/120 [==============================] - 0s 33us/sample - loss: 1.0344 - acc: 0.3417
Epoch 20/100
120/120 [==============================] - 0s 41us/sample - loss: 1.0294 - acc: 0.3417
Epoch 21/100
120/120 [==============================] - 0s 29us/sample - loss: 1.0241 - acc: 0.3417
Epoch 22/100
120/120 [==============================] - 0s 29us/sample - loss: 1.0185 - acc: 0.3417
Epoch 23/100
120/120 [==============================] - 0s 37us/sample - loss: 1.0128 - acc: 0.3417
Epoch 24/100
120/120 [==============================] - 0s 33us/sample - loss: 1.0065 - acc: 0.3417
Epoch 25/100
120/120 [==============================] - 0s 37us/sample - loss: 0.9998 - acc: 0.3417
Epoch 26/100
120/120 [==============================] - 0s 54us/sample - loss: 0.9933 - acc: 0.3417
Epoch 27/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9869 - acc: 0.3583
Epoch 28/100
120/120 [==============================] - 0s 37us/sample - loss: 0.9786 - acc: 0.3667
Epoch 29/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9715 - acc: 0.4250
Epoch 30/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9627 - acc: 0.5500
Epoch 31/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9554 - acc: 0.6083
Epoch 32/100
120/120 [==============================] - 0s 50us/sample - loss: 0.9456 - acc: 0.6250
Epoch 33/100
120/120 [==============================] - 0s 42us/sample - loss: 0.9367 - acc: 0.6250
Epoch 34/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9272 - acc: 0.6333
Epoch 35/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9173 - acc: 0.6333
Epoch 36/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9076 - acc: 0.6500
Epoch 37/100
120/120 [==============================] - 0s 33us/sample - loss: 0.8972 - acc: 0.6500
Epoch 38/100
120/120 [==============================] - 0s 33us/sample - loss: 0.8869 - acc: 0.6500
Epoch 39/100
120/120 [==============================] - 0s 50us/sample - loss: 0.8768 - acc: 0.6583
Epoch 40/100
120/120 [==============================] - 0s 37us/sample - loss: 0.8661 - acc: 0.6583
Epoch 41/100
120/120 [==============================] - 0s 33us/sample - loss: 0.8548 - acc: 0.6583
Epoch 42/100
120/120 [==============================] - 0s 37us/sample - loss: 0.8442 - acc: 0.6583
Epoch 43/100
120/120 [==============================] - 0s 29us/sample - loss: 0.8339 - acc: 0.6667
Epoch 44/100
120/120 [==============================] - 0s 29us/sample - loss: 0.8220 - acc: 0.6667
Epoch 45/100
120/120 [==============================] - 0s 33us/sample - loss: 0.8111 - acc: 0.6667
Epoch 46/100
120/120 [==============================] - 0s 58us/sample - loss: 0.7997 - acc: 0.6750
Epoch 47/100
120/120 [==============================] - 0s 41us/sample - loss: 0.7883 - acc: 0.6833
Epoch 48/100
120/120 [==============================] - 0s 33us/sample - loss: 0.7770 - acc: 0.6833
Epoch 49/100
120/120 [==============================] - 0s 50us/sample - loss: 0.7658 - acc: 0.6750
Epoch 50/100
120/120 [==============================] - 0s 37us/sample - loss: 0.7541 - acc: 0.6750
Epoch 51/100
120/120 [==============================] - 0s 29us/sample - loss: 0.7431 - acc: 0.6917
Epoch 52/100
120/120 [==============================] - 0s 33us/sample - loss: 0.7317 - acc: 0.6917
Epoch 53/100
120/120 [==============================] - 0s 33us/sample - loss: 0.7211 - acc: 0.7167
Epoch 54/100
120/120 [==============================] - 0s 33us/sample - loss: 0.7099 - acc: 0.7250
Epoch 55/100
120/120 [==============================] - 0s 33us/sample - loss: 0.6991 - acc: 0.7167
Epoch 56/100
120/120 [==============================] - 0s 37us/sample - loss: 0.6885 - acc: 0.7167
Epoch 57/100
120/120 [==============================] - 0s 45us/sample - loss: 0.6782 - acc: 0.7083
Epoch 58/100
120/120 [==============================] - 0s 33us/sample - loss: 0.6684 - acc: 0.7083
Epoch 59/100
120/120 [==============================] - 0s 29us/sample - loss: 0.6599 - acc: 0.7167
Epoch 60/100
120/120 [==============================] - 0s 29us/sample - loss: 0.6481 - acc: 0.7667
Epoch 61/100
120/120 [==============================] - 0s 29us/sample - loss: 0.6382 - acc: 0.7583
Epoch 62/100
120/120 [==============================] - 0s 29us/sample - loss: 0.6286 - acc: 0.7750
Epoch 63/100
120/120 [==============================] - 0s 33us/sample - loss: 0.6196 - acc: 0.7667
Epoch 64/100
120/120 [==============================] - 0s 58us/sample - loss: 0.6111 - acc: 0.7667
Epoch 65/100
120/120 [==============================] - 0s 29us/sample - loss: 0.6018 - acc: 0.7833
Epoch 66/100
120/120 [==============================] - 0s 29us/sample - loss: 0.5936 - acc: 0.7917
Epoch 67/100
120/120 [==============================] - 0s 33us/sample - loss: 0.5860 - acc: 0.8000
Epoch 68/100
120/120 [==============================] - 0s 25us/sample - loss: 0.5769 - acc: 0.8250
Epoch 69/100
120/120 [==============================] - 0s 37us/sample - loss: 0.5688 - acc: 0.8167
Epoch 70/100
120/120 [==============================] - 0s 29us/sample - loss: 0.5610 - acc: 0.8250
Epoch 71/100
120/120 [==============================] - 0s 37us/sample - loss: 0.5537 - acc: 0.8417
Epoch 72/100
120/120 [==============================] - 0s 54us/sample - loss: 0.5461 - acc: 0.8500
Epoch 73/100
120/120 [==============================] - 0s 33us/sample - loss: 0.5397 - acc: 0.8417
Epoch 74/100
120/120 [==============================] - 0s 33us/sample - loss: 0.5323 - acc: 0.8417
Epoch 75/100
120/120 [==============================] - 0s 37us/sample - loss: 0.5266 - acc: 0.8500
Epoch 76/100
120/120 [==============================] - 0s 33us/sample - loss: 0.5184 - acc: 0.8583
Epoch 77/100
120/120 [==============================] - 0s 33us/sample - loss: 0.5126 - acc: 0.8583
Epoch 78/100
120/120 [==============================] - 0s 29us/sample - loss: 0.5079 - acc: 0.8500
Epoch 79/100
120/120 [==============================] - 0s 50us/sample - loss: 0.5031 - acc: 0.8583
Epoch 80/100
120/120 [==============================] - 0s 41us/sample - loss: 0.4950 - acc: 0.8500
Epoch 81/100
120/120 [==============================] - 0s 37us/sample - loss: 0.4883 - acc: 0.8500
Epoch 82/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4828 - acc: 0.8583
Epoch 83/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4782 - acc: 0.8583
Epoch 84/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4727 - acc: 0.8583
Epoch 85/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4672 - acc: 0.8750
Epoch 86/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4621 - acc: 0.8750
Epoch 87/100
120/120 [==============================] - 0s 45us/sample - loss: 0.4567 - acc: 0.8750
Epoch 88/100
120/120 [==============================] - 0s 45us/sample - loss: 0.4513 - acc: 0.8750
Epoch 89/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4470 - acc: 0.8833
Epoch 90/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4426 - acc: 0.8917
Epoch 91/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4374 - acc: 0.8833
Epoch 92/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4326 - acc: 0.8750
Epoch 93/100
120/120 [==============================] - 0s 37us/sample - loss: 0.4277 - acc: 0.8833
Epoch 94/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4234 - acc: 0.8833
Epoch 95/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4188 - acc: 0.8917
Epoch 96/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4139 - acc: 0.8917
Epoch 97/100
120/120 [==============================] - 0s 41us/sample - loss: 0.4095 - acc: 0.8917
Epoch 98/100
120/120 [==============================] - 0s 45us/sample - loss: 0.4054 - acc: 0.8917
Epoch 99/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4006 - acc: 0.8917
Epoch 100/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3962 - acc: 0.8917
array([[1.20811760e-02, 3.91645938e-01, 5.96272826e-01],
       [1.69933531e-02, 3.97484392e-01, 5.85522234e-01],
       [9.59960818e-01, 3.92647162e-02, 7.74465443e-04],
       [1.90705255e-01, 6.64598465e-01, 1.44696265e-01],
       [3.11258342e-03, 2.26298794e-01, 7.70588636e-01],
       [8.99805903e-01, 9.60557535e-02, 4.13831137e-03],
       [6.15397003e-03, 2.78556108e-01, 7.15289891e-01],
       [9.58240926e-01, 4.09574024e-02, 8.01660179e-04],
       [8.35558847e-02, 6.37535155e-01, 2.78908908e-01],
       [9.11576152e-01, 8.53549615e-02, 3.06887995e-03],
       [2.80173123e-02, 5.20794570e-01, 4.51188117e-01],
       [9.81949151e-01, 1.78453047e-02, 2.05568867e-04],
       [9.13475394e-01, 8.33630040e-02, 3.16164969e-03],
       [4.98204343e-02, 5.69957256e-01, 3.80222321e-01],
       [2.83193532e-02, 5.36988616e-01, 4.34692025e-01],
       [6.19469536e-03, 2.78104872e-01, 7.15700507e-01],
       [5.04648834e-02, 5.63345432e-01, 3.86189699e-01],
       [9.01798606e-01, 9.46312845e-02, 3.57011799e-03],
       [3.41202389e-03, 2.44403824e-01, 7.52184212e-01],
       [9.06935573e-01, 9.03311223e-02, 2.73334724e-03],
       [9.46662784e-01, 5.19549623e-02, 1.38220214e-03],
       [9.40084696e-01, 5.81936389e-02, 1.72167330e-03],
       [9.40235198e-01, 5.79454526e-02, 1.81935111e-03],
       [9.38879550e-01, 5.96898273e-02, 1.43059052e-03],
       [9.09764946e-01, 8.71445313e-02, 3.09041399e-03],
       [9.40479219e-01, 5.79398200e-02, 1.58103404e-03],
       [2.62060165e-01, 6.20771348e-01, 1.17168434e-01],
       [4.93753655e-03, 2.82564163e-01, 7.12498307e-01],
       [9.47779417e-01, 5.09882867e-02, 1.23227667e-03],
       [1.00235706e-02, 3.16601396e-01, 6.73375070e-01]], dtype=float32

Step 1 – What type of neural network are we building ?

model = keras.Sequential()

There are two types of Neural networks that can be build in Keras

  • Sequential
  • Functional

This classification is related to the structure of the Neural Network. However, most of the time you will be using Sequential model. It can solve most of the problems. In a sequential neural net, neurons are arranged in layers and in sequence . The firing and wiring happen in sequence, hence the name. Later in the course when we see an example of a functional neural net, the difference will be clear. Here is a quick visual of what we are building.

Finally, when the network is trained, the outputs corresponding to the species (for the corresponding data points) will light-up. When we look at the last step, we will understand what I meant by light-up.

Step 2 – How are the neurons connected ?

We are building a Dense neural network.

model.add(keras.layers.Dense(4,input_shape=(4,)))

A Dense neural network is one in which each neuron is connected to all other neurons in the previous and next layers.

You can see from the visual below that the arrows coming in to each neuron are connected to all the neurons in the previous layer. This is the most typical type of neural network.

Also, with this statement, we are just building the input layer. An input layer with 4 nodes, one node for each of the inputs. Naturally, the assumption at this point would be that there would be as many nodes in the input layer as the number of inputs. So, why specify the input_shape parameter again ? In later examples we will see that the input data shape need not always match with the input nodes. We specify the input_shape parameter as a tuple. In this case the input is a 1-d vector. Once again, later in the course we will see examples of 2-d data.

The parameter input_shape is only used when creating the first layer. The next set of steps (hidden layer and output layer) do not need this parameter.

Step 3 – Hidden Layers

This is where the magic happens. Let’s try it with just one hidden layer.

model.add(keras.layers.Dense(8,activation="relu"))

Irrespective of the layer (input, hidden or output), the way to add layers is using the add function. That should make things easy for us. The new parameter that you see in the hidden layer is the activation parameter.

In this case, we are using a relu activation function. ReLU stands for Rectified Linear Unit. The mathematical definition of relu is

The output of the activation function looks like this.

Step 4 – Output Layer

After the hidden layer is added, we add the output layer. Since we are doing a multi-class classification, the preferred activation function is called as a softmax – more on this later. A softmax activation function gives out multiple probability values and the one with the highest probability is the predicted output.

model.add(keras.layers.Dense(3,activation="softmax"))

Step 5 – Compile the model

So far, we have created the structure of the neural net – layer by layer. At each step, we have defined the number of nodes and the activation function to be used. Once we have completed it, we now have to compile the model.

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

So far, we have just defined how the neural net should look like. With the compile ( ) method, Keras translates the parameters you have specified into an optimized series of steps that can then be executed on the computer. Without the compile step, you cannot fit (train) the model. We will see how we can use metrics in a bit, but optimizer and loss parameters requrie quite a bit of explanation.

Typically, Machine Learning algorithm requires some kind of a loss function to be minimized. Gradient Descent is a commonly used loss function. For classification problems, a common loss function is Cross Entropy. Cross Entropy is also called as Log Loss. Mathematically, this is a how a cross entropy function can be defined for 2 classes.

Let’s look at an example below. Say, we are just looking at 2 species of iris flowers.

0 – setosa 1 – virginica

If the model has predicted the species to be setosa with a probability of 0.2, the loss function can be calculated as follows.

import numpy as np

p = np.array([0.0001,0.001,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99])

# when y = 1, what is the loss function ?
y = 1
l = - (y * np.log10(p) + (1-y) * np.log10(1-p)  )

# now plot it to see how the loss function decreases as the predicted value approaches the actual value (of y = 1)
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(p,l)
plt.xlabel("Different predicted values when the actual value is 1")
plt.ylabel("Loss function")
Text(0, 0.5, 'Loss function')

What this plot means is that the more the predicted value deviates from the actual value, the more the loss function is. For example, when the predicted value reaches close to the actual value (of 1 in this case), the loss function gets closer and closer to 0.

At this point, you can see a quick summary of the model you have created so far

model.summary()
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_8 (Dense)              (None, 4)                 20        
_________________________________________________________________
dense_9 (Dense)              (None, 8)                 40        
_________________________________________________________________
dense_10 (Dense)             (None, 3)                 27        
=================================================================
Total params: 87
Trainable params: 87
Non-trainable params: 0
_________________________________________________________________

Step 6 – Fit the model with training data

This is where we train the dataset. The word epoch represents one complete iteration over the training dataset. With each epoc (one pass over the entire dataset) the weights are adjusted and the accuracy slowly increases. Since you have accuracy as a metric in step 5, it is shown at each of the training epoch. That way you see how the accuracy increases with each epoch.

model.fit(X_train, y_train, epochs=100)

Epoch 1/100
120/120 [==============================] - 0s 446us/sample - loss: 1.6451 - acc: 0.2167
Epoch 2/100
120/120 [==============================] - 0s 33us/sample - loss: 1.5703 - acc: 0.2750
Epoch 3/100
120/120 [==============================] - 0s 29us/sample - loss: 1.5049 - acc: 0.3083
Epoch 4/100
120/120 [==============================] - 0s 29us/sample - loss: 1.4404 - acc: 0.3083
Epoch 5/100
120/120 [==============================] - 0s 29us/sample - loss: 1.3812 - acc: 0.3167
Epoch 6/100
120/120 [==============================] - 0s 41us/sample - loss: 1.3288 - acc: 0.3333
Epoch 7/100
120/120 [==============================] - 0s 33us/sample - loss: 1.2814 - acc: 0.3333
Epoch 8/100
120/120 [==============================] - 0s 37us/sample - loss: 1.2384 - acc: 0.3417
Epoch 9/100
120/120 [==============================] - 0s 33us/sample - loss: 1.2007 - acc: 0.3417
Epoch 10/100
120/120 [==============================] - 0s 29us/sample - loss: 1.1680 - acc: 0.3500
Epoch 11/100
120/120 [==============================] - 0s 33us/sample - loss: 1.1370 - acc: 0.3917
Epoch 12/100
120/120 [==============================] - 0s 37us/sample - loss: 1.1077 - acc: 0.4583
Epoch 13/100
120/120 [==============================] - 0s 45us/sample - loss: 1.0831 - acc: 0.5167
Epoch 14/100
120/120 [==============================] - 0s 45us/sample - loss: 1.0567 - acc: 0.5417
Epoch 15/100
120/120 [==============================] - 0s 37us/sample - loss: 1.0347 - acc: 0.5833
Epoch 16/100
120/120 [==============================] - 0s 33us/sample - loss: 1.0128 - acc: 0.6000
Epoch 17/100
120/120 [==============================] - 0s 29us/sample - loss: 0.9897 - acc: 0.6000
Epoch 18/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9706 - acc: 0.6167
Epoch 19/100
120/120 [==============================] - 0s 54us/sample - loss: 0.9528 - acc: 0.6167
Epoch 20/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9355 - acc: 0.6167
Epoch 21/100
120/120 [==============================] - 0s 37us/sample - loss: 0.9172 - acc: 0.6083
Epoch 22/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9005 - acc: 0.5917
Epoch 23/100
120/120 [==============================] - 0s 25us/sample - loss: 0.8838 - acc: 0.5833
Epoch 24/100
120/120 [==============================] - 0s 29us/sample - loss: 0.8681 - acc: 0.5917
Epoch 25/100
120/120 [==============================] - 0s 29us/sample - loss: 0.8518 - acc: 0.6417
Epoch 26/100
120/120 [==============================] - 0s 41us/sample - loss: 0.8370 - acc: 0.8000
Epoch 27/100
120/120 [==============================] - 0s 29us/sample - loss: 0.8212 - acc: 0.8500
Epoch 28/100
120/120 [==============================] - 0s 37us/sample - loss: 0.8063 - acc: 0.8750
Epoch 29/100
120/120 [==============================] - 0s 33us/sample - loss: 0.7920 - acc: 0.9000
Epoch 30/100
120/120 [==============================] - 0s 33us/sample - loss: 0.7775 - acc: 0.9000
Epoch 31/100
120/120 [==============================] - 0s 33us/sample - loss: 0.7639 - acc: 0.9083
Epoch 32/100
120/120 [==============================] - 0s 50us/sample - loss: 0.7499 - acc: 0.9000
Epoch 33/100
120/120 [==============================] - 0s 29us/sample - loss: 0.7373 - acc: 0.9000
Epoch 34/100
120/120 [==============================] - 0s 37us/sample - loss: 0.7235 - acc: 0.9000
Epoch 35/100
120/120 [==============================] - 0s 29us/sample - loss: 0.7112 - acc: 0.9000
Epoch 36/100
120/120 [==============================] - 0s 29us/sample - loss: 0.6985 - acc: 0.9000
Epoch 37/100
120/120 [==============================] - 0s 33us/sample - loss: 0.6864 - acc: 0.9083
Epoch 38/100
120/120 [==============================] - 0s 50us/sample - loss: 0.6744 - acc: 0.9250
Epoch 39/100
120/120 [==============================] - 0s 29us/sample - loss: 0.6631 - acc: 0.9333
Epoch 40/100
120/120 [==============================] - 0s 37us/sample - loss: 0.6517 - acc: 0.9333
Epoch 41/100
120/120 [==============================] - 0s 25us/sample - loss: 0.6411 - acc: 0.9417
Epoch 42/100
120/120 [==============================] - 0s 29us/sample - loss: 0.6303 - acc: 0.9333
Epoch 43/100
120/120 [==============================] - 0s 33us/sample - loss: 0.6197 - acc: 0.9333
Epoch 44/100
120/120 [==============================] - 0s 46us/sample - loss: 0.6098 - acc: 0.9333
Epoch 45/100
120/120 [==============================] - 0s 33us/sample - loss: 0.6000 - acc: 0.9333
Epoch 46/100
120/120 [==============================] - 0s 41us/sample - loss: 0.5902 - acc: 0.9333
Epoch 47/100
120/120 [==============================] - 0s 50us/sample - loss: 0.5814 - acc: 0.9333
Epoch 48/100
120/120 [==============================] - 0s 37us/sample - loss: 0.5721 - acc: 0.9333
Epoch 49/100
120/120 [==============================] - 0s 29us/sample - loss: 0.5637 - acc: 0.9417
Epoch 50/100
120/120 [==============================] - 0s 29us/sample - loss: 0.5550 - acc: 0.9417
Epoch 51/100
120/120 [==============================] - 0s 29us/sample - loss: 0.5469 - acc: 0.9417
Epoch 52/100
120/120 [==============================] - 0s 37us/sample - loss: 0.5393 - acc: 0.9667
Epoch 53/100
120/120 [==============================] - 0s 41us/sample - loss: 0.5317 - acc: 0.9667
Epoch 54/100
120/120 [==============================] - 0s 29us/sample - loss: 0.5239 - acc: 0.9667
Epoch 55/100
120/120 [==============================] - 0s 37us/sample - loss: 0.5167 - acc: 0.9667
Epoch 56/100
120/120 [==============================] - 0s 37us/sample - loss: 0.5096 - acc: 0.9667
Epoch 57/100
120/120 [==============================] - 0s 33us/sample - loss: 0.5027 - acc: 0.9667
Epoch 58/100
120/120 [==============================] - 0s 50us/sample - loss: 0.4964 - acc: 0.9667
Epoch 59/100
120/120 [==============================] - 0s 41us/sample - loss: 0.4897 - acc: 0.9667
Epoch 60/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4837 - acc: 0.9667
Epoch 61/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4774 - acc: 0.9667
Epoch 62/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4716 - acc: 0.9667
Epoch 63/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4660 - acc: 0.9667
Epoch 64/100
120/120 [==============================] - 0s 54us/sample - loss: 0.4602 - acc: 0.9667
Epoch 65/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4547 - acc: 0.9667
Epoch 66/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4495 - acc: 0.9667
Epoch 67/100
120/120 [==============================] - 0s 41us/sample - loss: 0.4440 - acc: 0.9667
Epoch 68/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4390 - acc: 0.9667
Epoch 69/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4339 - acc: 0.9667
Epoch 70/100
120/120 [==============================] - 0s 45us/sample - loss: 0.4292 - acc: 0.9667
Epoch 71/100
120/120 [==============================] - 0s 37us/sample - loss: 0.4243 - acc: 0.9667
Epoch 72/100
120/120 [==============================] - 0s 41us/sample - loss: 0.4196 - acc: 0.9667
Epoch 73/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4149 - acc: 0.9750
Epoch 74/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4105 - acc: 0.9750
Epoch 75/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4062 - acc: 0.9750
Epoch 76/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4018 - acc: 0.9750
Epoch 77/100
120/120 [==============================] - 0s 50us/sample - loss: 0.3976 - acc: 0.9750
Epoch 78/100
120/120 [==============================] - 0s 41us/sample - loss: 0.3936 - acc: 0.9750
Epoch 79/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3903 - acc: 0.9750
Epoch 80/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3852 - acc: 0.9750
Epoch 81/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3817 - acc: 0.9750
Epoch 82/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3781 - acc: 0.9750
Epoch 83/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3740 - acc: 0.9750
Epoch 84/100
120/120 [==============================] - 0s 41us/sample - loss: 0.3703 - acc: 0.9750
Epoch 85/100
120/120 [==============================] - 0s 37us/sample - loss: 0.3664 - acc: 0.9750
Epoch 86/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3630 - acc: 0.9750
Epoch 87/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3591 - acc: 0.9750
Epoch 88/100
120/120 [==============================] - 0s 37us/sample - loss: 0.3557 - acc: 0.9750
Epoch 89/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3523 - acc: 0.9750
Epoch 90/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3488 - acc: 0.9750
Epoch 91/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3454 - acc: 0.9750
Epoch 92/100
120/120 [==============================] - 0s 45us/sample - loss: 0.3422 - acc: 0.9750
Epoch 93/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3389 - acc: 0.9750
Epoch 94/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3357 - acc: 0.9750
Epoch 95/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3329 - acc: 0.9750
Epoch 96/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3296 - acc: 0.9750
Epoch 97/100
120/120 [==============================] - 0s 25us/sample - loss: 0.3267 - acc: 0.9750
Epoch 98/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3232 - acc: 0.9750
Epoch 99/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3203 - acc: 0.9750
Epoch 100/100
120/120 [==============================] - 0s 58us/sample - loss: 0.3171 - acc: 0.9750

After finishing 100 epochs, the accuracy is around 69% – not bad for our first attempt. We will enhance it shortly to 90% by just adding one more hidden layer.

Step 7 – Predict data

Now that the model is trained, you can start predicting your test data. This step is pretty straight forward if you have already used sklearn to predict test data based on any machine learning model.

y_pred = model.predict(X_test)
y_pred
array([[4.80523752e-03, 1.61355630e-01, 8.33839059e-01],
       [5.65165561e-03, 1.38419360e-01, 8.55928957e-01],
       [9.37631130e-01, 5.78489937e-02, 4.51981043e-03],
       [9.99230742e-02, 7.17553735e-01, 1.82523206e-01],
       [9.50973015e-04, 1.86273277e-01, 8.12775731e-01],
       [8.69464993e-01, 1.15551665e-01, 1.49833458e-02],
       [6.78034406e-03, 2.44014740e-01, 7.49204934e-01],
       [9.12608325e-01, 8.10388103e-02, 6.35283068e-03],
       [6.19196370e-02, 7.35043049e-01, 2.03037351e-01],
       [8.91312957e-01, 9.75011438e-02, 1.11858230e-02],
       [1.72961298e-02, 3.84219527e-01, 5.98484337e-01],
       [9.59164917e-01, 3.88094820e-02, 2.02561123e-03],
       [9.01787043e-01, 8.78127143e-02, 1.04002040e-02],
       [4.25860547e-02, 4.14889425e-01, 5.42524457e-01],
       [2.64684074e-02, 6.92561448e-01, 2.80970186e-01],
       [3.44479713e-03, 2.71285832e-01, 7.25269318e-01],
       [4.74239029e-02, 6.01970136e-01, 3.50605994e-01],
       [8.60095024e-01, 1.25155374e-01, 1.47496713e-02],
       [1.49409531e-03, 2.42252737e-01, 7.56253123e-01],
       [8.37987006e-01, 1.46805704e-01, 1.52072664e-02],
       [8.97249877e-01, 9.39026028e-02, 8.84753559e-03],
       [9.13183808e-01, 7.87739381e-02, 8.04229267e-03],
       [8.98115098e-01, 9.22952741e-02, 9.58956406e-03],
       [8.81850123e-01, 1.08397752e-01, 9.75214690e-03],
       [8.75269353e-01, 1.11916631e-01, 1.28139891e-02],
       [9.16219890e-01, 7.63572678e-02, 7.42286025e-03],
       [1.50582254e-01, 6.88981295e-01, 1.60436377e-01],
       [8.01152084e-03, 4.72976834e-01, 5.19011676e-01],
       [9.26478028e-01, 6.74546957e-02, 6.06729649e-03],
       [8.82636756e-03, 2.28457645e-01, 7.62715995e-01]], dtype=float32)

Since this is a multi-class output, what the neural net outputs are probabilities. The highest probability among the three elements is the predicted value. However, we need to convert these probabilities back to indices.

y_pred_class = np.argmax(y_pred,axis=1)
y_pred_class
array([2, 2, 0, 1, 2, 0, 2, 0, 1, 0, 2, 0, 0, 2, 1, 2, 1, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 1, 2, 0, 2], dtype=int64)

Step 8 – Evaluate Model

Since the output is categorical data, a quick confusion matrix will show use how far we are from the model. Use scikit learn’s confusion matrix should do.

from sklearn.metrics import confusion_matrix
 
cm = confusion_matrix(y_test, y_pred_class)
print ( cm )

[[15  0  0]
 [ 0  5  1]
 [ 0  0  9]]

And of course the final number – accuracy.

from sklearn.metrics import accuracy_score
 
accuracy_score(y_test,y_pred_class)

0.9666666666666667

That is pretty low by any Machine Learning standards for this dataset. Let’s optimize it.

Step 9 – Optimize Model

There are a couple of ways to optimize for higher accuracy. One way is to increase the nodes in the hidden layer. Let’s try to increase the number of nodes from 8 to 20 and see how the network performs.

model = keras.Sequential()
model.add(keras.layers.Dense(4,input_shape=(4,)))
# BEGIN change - increase the number of nodes from 8 to 20
model.add(keras.layers.Dense(20,activation="relu"))
# END change
model.add(keras.layers.Dense(3,activation="softmax"))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=100)

y_pred = model.predict(X_test)
y_pred_class = np.argmax(y_pred,axis=1)

from sklearn.metrics import accuracy_score
 
accuracy_score(y_test,y_pred_class)

Epoch 1/100
120/120 [==============================] - 0s 484us/sample - loss: 1.9111 - acc: 0.2917
Epoch 2/100
120/120 [==============================] - 0s 29us/sample - loss: 1.7657 - acc: 0.2917
Epoch 3/100
120/120 [==============================] - 0s 33us/sample - loss: 1.6348 - acc: 0.2917
Epoch 4/100
120/120 [==============================] - 0s 33us/sample - loss: 1.5117 - acc: 0.2917
Epoch 5/100
120/120 [==============================] - 0s 33us/sample - loss: 1.4100 - acc: 0.2917
Epoch 6/100
120/120 [==============================] - 0s 33us/sample - loss: 1.3202 - acc: 0.2917
Epoch 7/100
120/120 [==============================] - 0s 29us/sample - loss: 1.2427 - acc: 0.2917
Epoch 8/100
120/120 [==============================] - 0s 33us/sample - loss: 1.1696 - acc: 0.2917
Epoch 9/100
120/120 [==============================] - 0s 45us/sample - loss: 1.1183 - acc: 0.2917
Epoch 10/100
120/120 [==============================] - 0s 37us/sample - loss: 1.0729 - acc: 0.4083
Epoch 11/100
120/120 [==============================] - 0s 29us/sample - loss: 1.0347 - acc: 0.4667
Epoch 12/100
120/120 [==============================] - 0s 29us/sample - loss: 1.0049 - acc: 0.5000
Epoch 13/100
120/120 [==============================] - 0s 29us/sample - loss: 0.9793 - acc: 0.5500
Epoch 14/100
120/120 [==============================] - 0s 29us/sample - loss: 0.9566 - acc: 0.5917
Epoch 15/100
120/120 [==============================] - 0s 29us/sample - loss: 0.9369 - acc: 0.5833
Epoch 16/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9184 - acc: 0.6000
Epoch 17/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9012 - acc: 0.6000
Epoch 18/100
120/120 [==============================] - 0s 33us/sample - loss: 0.8860 - acc: 0.6083
Epoch 19/100
120/120 [==============================] - 0s 33us/sample - loss: 0.8717 - acc: 0.6083
Epoch 20/100
120/120 [==============================] - 0s 29us/sample - loss: 0.8571 - acc: 0.6083
Epoch 21/100
120/120 [==============================] - 0s 33us/sample - loss: 0.8440 - acc: 0.6083
Epoch 22/100
120/120 [==============================] - 0s 50us/sample - loss: 0.8301 - acc: 0.6250
Epoch 23/100
120/120 [==============================] - 0s 29us/sample - loss: 0.8142 - acc: 0.6333
Epoch 24/100
120/120 [==============================] - 0s 33us/sample - loss: 0.7965 - acc: 0.6417
Epoch 25/100
120/120 [==============================] - 0s 37us/sample - loss: 0.7782 - acc: 0.6417
Epoch 26/100
120/120 [==============================] - 0s 29us/sample - loss: 0.7605 - acc: 0.6417
Epoch 27/100
120/120 [==============================] - 0s 37us/sample - loss: 0.7451 - acc: 0.6583
Epoch 28/100
120/120 [==============================] - 0s 45us/sample - loss: 0.7296 - acc: 0.6667
Epoch 29/100
120/120 [==============================] - 0s 33us/sample - loss: 0.7159 - acc: 0.6917
Epoch 30/100
120/120 [==============================] - 0s 29us/sample - loss: 0.7041 - acc: 0.7417
Epoch 31/100
120/120 [==============================] - 0s 37us/sample - loss: 0.6930 - acc: 0.7833
Epoch 32/100
120/120 [==============================] - 0s 37us/sample - loss: 0.6817 - acc: 0.7833
Epoch 33/100
120/120 [==============================] - 0s 50us/sample - loss: 0.6706 - acc: 0.7750
Epoch 34/100
120/120 [==============================] - 0s 33us/sample - loss: 0.6605 - acc: 0.7833
Epoch 35/100
120/120 [==============================] - 0s 37us/sample - loss: 0.6506 - acc: 0.8250
Epoch 36/100
120/120 [==============================] - 0s 33us/sample - loss: 0.6410 - acc: 0.8083
Epoch 37/100
120/120 [==============================] - 0s 37us/sample - loss: 0.6306 - acc: 0.8000
Epoch 38/100
120/120 [==============================] - 0s 41us/sample - loss: 0.6215 - acc: 0.8500
Epoch 39/100
120/120 [==============================] - 0s 37us/sample - loss: 0.6125 - acc: 0.8583
Epoch 40/100
120/120 [==============================] - 0s 37us/sample - loss: 0.6036 - acc: 0.8833
Epoch 41/100
120/120 [==============================] - 0s 33us/sample - loss: 0.5949 - acc: 0.8833
Epoch 42/100
120/120 [==============================] - 0s 46us/sample - loss: 0.5869 - acc: 0.8750
Epoch 43/100
120/120 [==============================] - 0s 29us/sample - loss: 0.5784 - acc: 0.8833
Epoch 44/100
120/120 [==============================] - 0s 29us/sample - loss: 0.5705 - acc: 0.8917
Epoch 45/100
120/120 [==============================] - 0s 37us/sample - loss: 0.5629 - acc: 0.8917
Epoch 46/100
120/120 [==============================] - 0s 33us/sample - loss: 0.5553 - acc: 0.8917
Epoch 47/100
120/120 [==============================] - 0s 50us/sample - loss: 0.5480 - acc: 0.8917
Epoch 48/100
120/120 [==============================] - 0s 29us/sample - loss: 0.5409 - acc: 0.8833
Epoch 49/100
120/120 [==============================] - 0s 29us/sample - loss: 0.5341 - acc: 0.8917
Epoch 50/100
120/120 [==============================] - 0s 33us/sample - loss: 0.5273 - acc: 0.8833
Epoch 51/100
120/120 [==============================] - 0s 33us/sample - loss: 0.5213 - acc: 0.9167
Epoch 52/100
120/120 [==============================] - 0s 33us/sample - loss: 0.5150 - acc: 0.8917
Epoch 53/100
120/120 [==============================] - 0s 41us/sample - loss: 0.5080 - acc: 0.8917
Epoch 54/100
120/120 [==============================] - 0s 37us/sample - loss: 0.5011 - acc: 0.9167
Epoch 55/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4956 - acc: 0.9250
Epoch 56/100
120/120 [==============================] - 0s 37us/sample - loss: 0.4905 - acc: 0.9250
Epoch 57/100
120/120 [==============================] - 0s 37us/sample - loss: 0.4836 - acc: 0.9250
Epoch 58/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4779 - acc: 0.9250
Epoch 59/100
120/120 [==============================] - 0s 50us/sample - loss: 0.4727 - acc: 0.9250
Epoch 60/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4669 - acc: 0.9250
Epoch 61/100
120/120 [==============================] - 0s 25us/sample - loss: 0.4612 - acc: 0.9250
Epoch 62/100
120/120 [==============================] - 0s 41us/sample - loss: 0.4557 - acc: 0.9250
Epoch 63/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4508 - acc: 0.9250
Epoch 64/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4456 - acc: 0.9250
Epoch 65/100
120/120 [==============================] - 0s 46us/sample - loss: 0.4406 - acc: 0.9250
Epoch 66/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4357 - acc: 0.9250
Epoch 67/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4306 - acc: 0.9333
Epoch 68/100
120/120 [==============================] - 0s 46us/sample - loss: 0.4253 - acc: 0.9417
Epoch 69/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4216 - acc: 0.9333
Epoch 70/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4159 - acc: 0.9250
Epoch 71/100
120/120 [==============================] - 0s 41us/sample - loss: 0.4108 - acc: 0.9417
Epoch 72/100
120/120 [==============================] - 0s 41us/sample - loss: 0.4066 - acc: 0.9417
Epoch 73/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4021 - acc: 0.9417
Epoch 74/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3976 - acc: 0.9417
Epoch 75/100
120/120 [==============================] - 0s 37us/sample - loss: 0.3932 - acc: 0.9333
Epoch 76/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3888 - acc: 0.9417
Epoch 77/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3840 - acc: 0.9500
Epoch 78/100
120/120 [==============================] - 0s 49us/sample - loss: 0.3796 - acc: 0.9417
Epoch 79/100
120/120 [==============================] - 0s 37us/sample - loss: 0.3755 - acc: 0.9417
Epoch 80/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3714 - acc: 0.9417
Epoch 81/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3671 - acc: 0.9417
Epoch 82/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3628 - acc: 0.9500
Epoch 83/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3595 - acc: 0.9417
Epoch 84/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3549 - acc: 0.9500
Epoch 85/100
120/120 [==============================] - 0s 41us/sample - loss: 0.3507 - acc: 0.9417
Epoch 86/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3470 - acc: 0.9417
Epoch 87/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3444 - acc: 0.9500
Epoch 88/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3406 - acc: 0.9417
Epoch 89/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3351 - acc: 0.9583
Epoch 90/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3318 - acc: 0.9583
Epoch 91/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3281 - acc: 0.9583
Epoch 92/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3252 - acc: 0.9583
Epoch 93/100
120/120 [==============================] - 0s 41us/sample - loss: 0.3211 - acc: 0.9417
Epoch 94/100
120/120 [==============================] - 0s 45us/sample - loss: 0.3174 - acc: 0.9500
Epoch 95/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3142 - acc: 0.9500
Epoch 96/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3102 - acc: 0.9583
Epoch 97/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3081 - acc: 0.9500
Epoch 98/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3040 - acc: 0.9583
Epoch 99/100
120/120 [==============================] - 0s 33us/sample - loss: 0.2998 - acc: 0.9667
Epoch 100/100
120/120 [==============================] - 0s 29us/sample - loss: 0.2974 - acc: 0.9500
0.9666666666666667

We have now hit 90% accuracy. That’s pretty much close to what most ML models would achieve. Let’s try to keep the number of nodes the same, but add one more hidden layer.

model = keras.Sequential()
model.add(keras.layers.Dense(4,input_shape=(4,)))
model.add(keras.layers.Dense(8,activation="relu"))
# BEGIN Change - add one more hidden layer
model.add(keras.layers.Dense(8,activation="relu"))
# END Change

model.add(keras.layers.Dense(3,activation="softmax"))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=100)

y_pred = model.predict(X_test)
y_pred_class = np.argmax(y_pred,axis=1)

from sklearn.metrics import accuracy_score
 
accuracy_score(y_test,y_pred_class)
Epoch 1/100
120/120 [==============================] - 0s 574us/sample - loss: 1.9741 - acc: 0.2333
Epoch 2/100
120/120 [==============================] - 0s 33us/sample - loss: 1.8711 - acc: 0.3417
Epoch 3/100
120/120 [==============================] - 0s 33us/sample - loss: 1.7824 - acc: 0.3417
Epoch 4/100
120/120 [==============================] - 0s 33us/sample - loss: 1.7114 - acc: 0.3417
Epoch 5/100
120/120 [==============================] - 0s 46us/sample - loss: 1.6428 - acc: 0.3417
Epoch 6/100
120/120 [==============================] - 0s 37us/sample - loss: 1.5780 - acc: 0.3417
Epoch 7/100
120/120 [==============================] - 0s 33us/sample - loss: 1.5234 - acc: 0.3417
Epoch 8/100
120/120 [==============================] - 0s 37us/sample - loss: 1.4688 - acc: 0.3417
Epoch 9/100
120/120 [==============================] - 0s 54us/sample - loss: 1.4162 - acc: 0.3417
Epoch 10/100
120/120 [==============================] - 0s 29us/sample - loss: 1.3703 - acc: 0.3417
Epoch 11/100
120/120 [==============================] - 0s 37us/sample - loss: 1.3316 - acc: 0.3417
Epoch 12/100
120/120 [==============================] - 0s 45us/sample - loss: 1.2942 - acc: 0.3417
Epoch 13/100
120/120 [==============================] - 0s 33us/sample - loss: 1.2585 - acc: 0.3417
Epoch 14/100
120/120 [==============================] - 0s 37us/sample - loss: 1.2258 - acc: 0.3417
Epoch 15/100
120/120 [==============================] - 0s 33us/sample - loss: 1.2001 - acc: 0.3417
Epoch 16/100
120/120 [==============================] - 0s 54us/sample - loss: 1.1726 - acc: 0.3417
Epoch 17/100
120/120 [==============================] - 0s 37us/sample - loss: 1.1488 - acc: 0.3417
Epoch 18/100
120/120 [==============================] - 0s 33us/sample - loss: 1.1179 - acc: 0.3583
Epoch 19/100
120/120 [==============================] - 0s 37us/sample - loss: 1.0965 - acc: 0.3833
Epoch 20/100
120/120 [==============================] - 0s 45us/sample - loss: 1.0837 - acc: 0.4167
Epoch 21/100
120/120 [==============================] - 0s 37us/sample - loss: 1.0740 - acc: 0.4583
Epoch 22/100
120/120 [==============================] - 0s 37us/sample - loss: 1.0643 - acc: 0.4583
Epoch 23/100
120/120 [==============================] - 0s 33us/sample - loss: 1.0557 - acc: 0.4667
Epoch 24/100
120/120 [==============================] - 0s 37us/sample - loss: 1.0479 - acc: 0.4583
Epoch 25/100
120/120 [==============================] - 0s 38us/sample - loss: 1.0408 - acc: 0.4500
Epoch 26/100
120/120 [==============================] - 0s 33us/sample - loss: 1.0339 - acc: 0.4333
Epoch 27/100
120/120 [==============================] - 0s 50us/sample - loss: 1.0268 - acc: 0.4583
Epoch 28/100
120/120 [==============================] - 0s 33us/sample - loss: 1.0209 - acc: 0.4417
Epoch 29/100
120/120 [==============================] - 0s 33us/sample - loss: 1.0147 - acc: 0.4583
Epoch 30/100
120/120 [==============================] - 0s 41us/sample - loss: 1.0085 - acc: 0.4667
Epoch 31/100
120/120 [==============================] - 0s 41us/sample - loss: 1.0032 - acc: 0.4917
Epoch 32/100
120/120 [==============================] - 0s 45us/sample - loss: 0.9976 - acc: 0.5167
Epoch 33/100
120/120 [==============================] - 0s 37us/sample - loss: 0.9920 - acc: 0.5250
Epoch 34/100
120/120 [==============================] - 0s 37us/sample - loss: 0.9871 - acc: 0.5167
Epoch 35/100
120/120 [==============================] - 0s 37us/sample - loss: 0.9814 - acc: 0.5333
Epoch 36/100
120/120 [==============================] - 0s 45us/sample - loss: 0.9770 - acc: 0.5250
Epoch 37/100
120/120 [==============================] - 0s 37us/sample - loss: 0.9714 - acc: 0.5333
Epoch 38/100
120/120 [==============================] - 0s 29us/sample - loss: 0.9655 - acc: 0.5333
Epoch 39/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9607 - acc: 0.5250
Epoch 40/100
120/120 [==============================] - 0s 46us/sample - loss: 0.9554 - acc: 0.5333
Epoch 41/100
120/120 [==============================] - 0s 41us/sample - loss: 0.9513 - acc: 0.5333
Epoch 42/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9445 - acc: 0.5417
Epoch 43/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9397 - acc: 0.5333
Epoch 44/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9336 - acc: 0.5417
Epoch 45/100
120/120 [==============================] - 0s 54us/sample - loss: 0.9280 - acc: 0.5500
Epoch 46/100
120/120 [==============================] - 0s 29us/sample - loss: 0.9226 - acc: 0.5417
Epoch 47/100
120/120 [==============================] - 0s 37us/sample - loss: 0.9163 - acc: 0.5500
Epoch 48/100
120/120 [==============================] - 0s 45us/sample - loss: 0.9104 - acc: 0.5500
Epoch 49/100
120/120 [==============================] - 0s 33us/sample - loss: 0.9049 - acc: 0.5500
Epoch 50/100
120/120 [==============================] - 0s 50us/sample - loss: 0.8987 - acc: 0.5500
Epoch 51/100
120/120 [==============================] - 0s 29us/sample - loss: 0.8940 - acc: 0.5500
Epoch 52/100
120/120 [==============================] - 0s 33us/sample - loss: 0.8874 - acc: 0.5500
Epoch 53/100
120/120 [==============================] - 0s 33us/sample - loss: 0.8819 - acc: 0.5500
Epoch 54/100
120/120 [==============================] - 0s 37us/sample - loss: 0.8735 - acc: 0.5500
Epoch 55/100
120/120 [==============================] - 0s 33us/sample - loss: 0.8543 - acc: 0.5500
Epoch 56/100
120/120 [==============================] - 0s 54us/sample - loss: 0.8298 - acc: 0.6083
Epoch 57/100
120/120 [==============================] - 0s 29us/sample - loss: 0.8125 - acc: 0.8333
Epoch 58/100
120/120 [==============================] - 0s 33us/sample - loss: 0.7931 - acc: 0.8667
Epoch 59/100
120/120 [==============================] - 0s 41us/sample - loss: 0.7758 - acc: 0.9000
Epoch 60/100
120/120 [==============================] - 0s 29us/sample - loss: 0.7576 - acc: 0.8917
Epoch 61/100
120/120 [==============================] - 0s 50us/sample - loss: 0.7374 - acc: 0.9083
Epoch 62/100
120/120 [==============================] - 0s 33us/sample - loss: 0.7191 - acc: 0.9000
Epoch 63/100
120/120 [==============================] - 0s 33us/sample - loss: 0.7007 - acc: 0.9000
Epoch 64/100
120/120 [==============================] - 0s 45us/sample - loss: 0.6813 - acc: 0.9000
Epoch 65/100
120/120 [==============================] - 0s 33us/sample - loss: 0.6627 - acc: 0.9167
Epoch 66/100
120/120 [==============================] - 0s 41us/sample - loss: 0.6429 - acc: 0.9250
Epoch 67/100
120/120 [==============================] - 0s 45us/sample - loss: 0.6240 - acc: 0.9250
Epoch 68/100
120/120 [==============================] - 0s 33us/sample - loss: 0.6069 - acc: 0.9333
Epoch 69/100
120/120 [==============================] - 0s 29us/sample - loss: 0.5888 - acc: 0.9333
Epoch 70/100
120/120 [==============================] - 0s 33us/sample - loss: 0.5711 - acc: 0.9333
Epoch 71/100
120/120 [==============================] - 0s 29us/sample - loss: 0.5546 - acc: 0.9333
Epoch 72/100
120/120 [==============================] - 0s 33us/sample - loss: 0.5388 - acc: 0.9333
Epoch 73/100
120/120 [==============================] - 0s 37us/sample - loss: 0.5228 - acc: 0.9417
Epoch 74/100
120/120 [==============================] - 0s 50us/sample - loss: 0.5071 - acc: 0.9417
Epoch 75/100
120/120 [==============================] - 0s 37us/sample - loss: 0.4926 - acc: 0.9417
Epoch 76/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4793 - acc: 0.9333
Epoch 77/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4666 - acc: 0.9417
Epoch 78/100
120/120 [==============================] - 0s 37us/sample - loss: 0.4537 - acc: 0.9417
Epoch 79/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4419 - acc: 0.9417
Epoch 80/100
120/120 [==============================] - 0s 33us/sample - loss: 0.4310 - acc: 0.9417
Epoch 81/100
120/120 [==============================] - 0s 41us/sample - loss: 0.4205 - acc: 0.9417
Epoch 82/100
120/120 [==============================] - 0s 37us/sample - loss: 0.4103 - acc: 0.9500
Epoch 83/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4006 - acc: 0.9500
Epoch 84/100
120/120 [==============================] - 0s 37us/sample - loss: 0.3916 - acc: 0.9500
Epoch 85/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3833 - acc: 0.9500
Epoch 86/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3753 - acc: 0.9500
Epoch 87/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3673 - acc: 0.9500
Epoch 88/100
120/120 [==============================] - 0s 54us/sample - loss: 0.3601 - acc: 0.9667
Epoch 89/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3529 - acc: 0.9667
Epoch 90/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3459 - acc: 0.9500
Epoch 91/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3391 - acc: 0.9500
Epoch 92/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3339 - acc: 0.9583
Epoch 93/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3268 - acc: 0.9583
Epoch 94/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3209 - acc: 0.9500
Epoch 95/100
120/120 [==============================] - 0s 37us/sample - loss: 0.3152 - acc: 0.9500
Epoch 96/100
120/120 [==============================] - 0s 41us/sample - loss: 0.3095 - acc: 0.9500
Epoch 97/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3042 - acc: 0.9583
Epoch 98/100
120/120 [==============================] - 0s 37us/sample - loss: 0.2990 - acc: 0.9583
Epoch 99/100
120/120 [==============================] - 0s 29us/sample - loss: 0.2942 - acc: 0.9667
Epoch 100/100
120/120 [==============================] - 0s 29us/sample - loss: 0.2887 - acc: 0.9583
0.9333333333333333

That’s 80% accuracy. The immediate question you might have is – How should you choose the number of nodes or the number of hidden layers ? Unfortunately, the meaning of weights and outputs is essentially a blackbox to humans. Meaning, we cannot make any sense out of it.

Choosing the size and complexity of a neural network (like the numbner of nodes and the number of hidden layers) is more art than science

MNIST handwritten digits classification

If all we had to do in Neural Networks was classify iris data, we wouldn’t be needing Neural networks to start with. We need a more involved dataset to quality as a “Hello World” program in Neural Networks. Welcome the MNIST digits dataset. It is a dataset of handwritten images that are scanned, standardized and optimized for machine learning. Tensorflow comes built-in with this dataset. Let’s quickly load it to see how these images look like.

import tensorflow as tf
from tensorflow import keras

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# load the data
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Show the first picture in the training dataset.
plt.figure()
plt.imshow(train_images[0])

It is a handwritten digit – 5. Although the image looks in color, these are actually gray-scale images. All of the data is standardized into 28×28 pixels. And each pixel has an intensity value between 0 and 255 ( 0 – 28 ). Since this is a small image (just 28×28 pixels), we can write it to an excel file and see the numbers visually.

# write the first image to a csv file.
np.savetxt("image_1.csv",train_images[0],delimiter=",")

---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
<ipython-input-68-3e86563d8a79> in <module>
      1 # write the first image to a csv file.
----> 2 np.savetxt("image_1.csv",train_images[0],delimiter=",")

<__array_function__ internals> in savetxt(*args, **kwargs)

~\AppData\Roaming\Python\Python37\site-packages\numpy\lib\npyio.py in savetxt(fname, X, fmt, delimiter, newline, header, footer, comments, encoding)
   1362     if _is_string_like(fname):
   1363         # datasource doesn't support creating a new file ...
-> 1364         open(fname, 'wt').close()
   1365         fh = np.lib._datasource.open(fname, 'wt', encoding=encoding)
   1366         own_fh = True

PermissionError: [Errno 13] Permission denied: 'image_1.csv'

If you open the csv file in excel, and adjust the column size and change the number format to zero decimals, you should see a picture like this. Can you identify the digit 5 in there ?

Confirm that the image is in fact a 5.

# Print the first label
train_labels[0]

Prepare the model. The first layer in this case is slightly different from the first layer in the iris example above. As you can see from the input image data, it is a 28×28 dimension numpy array. However, we are going to be working with flat (a flat set of neurons in each layer). So, the first layer will be essentially a 784 (28 x 28 = 784) node layer that will be created automatically by flattening the input array.

model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=6)
test_loss, test_acc = model.evaluate(test_images, test_labels)

print('Test accuracy:', test_acc)
predictions = model.predict(test_images)

The prediction is a vector of 10 x 1 dimension of probabilities. The value with the highest probability in the output array is the predicted value. For example, let’s see what is the first image in the test_image.

test_labels[0]

It is a 7. Let’s see how the probabilities are predicted.

np.savetxt("predictions.csv",predictions[0:10],delimiter=",")
predicted_final[0:10]

You can see from the array above that the highest probability value is at location 7. Let’s apply the argmax function of numpy to just filter out the values with the highest probability into a new array. For example, the first output array has an argmax output of 7 ( argmax ( ) outputs the index of the highest element in a numpy array).

Let’s apply the argmax on the output array to predicted array.

predicted_final = np.argmax(predictions,axis=1)
from sklearn.metrics import accuracy_score
 
accuracy_score(test_labels,predicted_final)

That’s a 95% accuracy on the test datasets.

Great !! Our “Hello World” of Neural Networks is complete. The next couple of days we will focus on the moving parts of a neural network and how Gradient Descent is used in Neural networks to optimize the weights. This is how the neural network essentially learns.

R Interview Questions

R Interview Questions


  1. Data Frames
    1. How to “merge” data frames in R
    2. How do you search/replace text in a data frame
    3. How can you add columns to a data frame in R
    4. How can you add rows to a data frame in R
    5. How to delete columns from a data frame in R
    6. How to delete rows from a data frame in R
    7. How to remove all rows in a data frame with NAs in R
    8. How to remove particular values from a data frame in R
    9. How to remove duplicate rows from a data frame in R
  2. Vectors
    1. Vector Recycling in R
    2. How to find out elements in one vector that are not in another vector
  3. Graphics ( EDA )
    1. How to create a pie chart in R
    2. How to create a histogram in R
    3. How to create a box plot in R
    4. Why are box plots used
    5. How to view scatter plots for more than 2 variables in R
    6. How to add grids to plots in R
    7. How to make interactive plots in R
    8. How to visualize multi-dimensional data in R
  4. Data Manipulation
    1. what is ‘Stack’ing and ‘Unstack’ing in R
    2. What is the difference between sapply vs lapply functions
    3. What is the difference between apply vs for loop
  5. Subset data
    1. How to subset a Dataframe
    2. How to subset a Vector
    3. How to subset a List
    4. How to subset a Matrix
  6. Data Structures
    1. What are the basic data structures in R
    2. What is the difference between dataframe and matrix
    3. How to find out the unique elements in a vector
    4. What is the difference between data frame and data table
    5. What is the difference between a dataframe and a list
  7. Commonly used Statistics functions
    1. How is skewness of a distribution calculated in R
    2. How is kurtosis of a distribution calculated in R
    3. How do you calculate mean, median, MAD and SD in R
    4. How to get correlation matrix (or coefficients) for a set of variables in R
    5. How to generate the same set of random numbers in R
    6. What is the difference between runif () and rnorm () functions in R
    7. How to get samples from a data set
    8. What is confusion matrix
  8. Packages
    1. How to install a package in R
    2. What is the difference between ilbrary() and require() functions in R
    3. What are the R packages you used most in your previous projects
    4. How to search for packages in R that suit your requirements
  9. Data Loading
    1. What are the most used data import functions in R
    2. What are the most commonly found issues with data import
    3. How to read JSON files
    4. How to load HTML table into R
  10. Data Cleansing
    1. How did you eliminate outliers in R
    2. Why is which () function used
    3. How to convert a factor variable to a numeric variable
    4. What strategies did you use to eliminate NAs
  11. Modularization
    1. How to split your program across multiple files in R
    2. How to write a function in R
  12. Functions
    1. What is the use of … ( ellipsis ) in functions in R
    2. How to write recursive functions in R
    3. How to specify default values in function parameters in R
    4. How to see the source code of a function in R
  13. Version Management
    1. What kind of version management did you use in your previous projects
  14. ML – Regression
    1. How to do Linear Regression in R
    2. How to do multi linear regression in R
    3. How do you specify the formula for a logarithmic regression in R
    4. How to you specify powers in polynomial regression formula
    5. How did you split the data into training and test data
    6. What plots in R did you produce to communicate your linear regression test results
    7. What packages did you use in R to find out significant variables in multi linear regression
  15. ML – Classification
    1. What packages did you use for KNN classification in R
    2. How to do confusion matrix in R

Just enough SQL

Just enough SQL


  Machine Learning in Python

Contents

SQL

SQL stands for Structured Query Language. It is the language of Relational databases. So, before we understand SQL, let’s understand what a Relational database is. The next couple of pages would introduce databases at a very high level.

What is a Database

A database is a software that is used to store data. There are many types of databases (relational, graph, document etc) but we are going to focus only on relational database as it is the most popular database format. Also, SQL is specifically used for relational databases.

Relational databases comprise of data in tabular format. For example, let’s see how an employee’s data could be structured in a relational database.

This is how we humans would like to see data. However, it is not scalable to store large amounts of data in a way that is more suitable for fast read and writes. That is where the concept of data normalization comes in. Normalization is a subject in itself and we will not be discussing it here. However, the process is pretty intuitive overall. Here is how the same data would be structured in a good database.

There are many databases in use today. For example, MySQL, Oracle, Microsoft SQL Server, PostgreSQL etc. In this chapter, we will be focussing on the MySQL database. Here is a quick instruction set on How to install MySQL Database.

By default MySQL database comes with very few data tables. MySQL provides a big employee database (with 300K employee records) that you can download from the web and upload onto your database. You can use it to practice data extraction. The data above is taken from that employee database.

What is SQL

SQL is the language of relational databases. It can be used to create, extract, manipulate and delete data in the database. For example, here is a quick SQL to extract the first 10 rows of data from a database table EMPLOYEES.

SELECT * FROM employees LIMIT 10;

The “*” indicates that all columns need to be selected. As you can see, the syntax is pretty intuitive. It is deliberately designed to be like English.

SQL Connectors

Like discussed previously, there are a variety of databases. Each of these databases has a specific connector and you have to install it. Since we will be working with MySQL database, we will install MySQL connector.

pip install mysql-connector

Typically, you will be given the database details like below.

Server Address : xx.xx.xx.xx
port           : 33xx
schema         : xxxx

user id        : xxxx
password       : xxxx

import mysql.connector

db = mysql.connector.connect(
  host     = "localhost",
  user     = "ajaytech",
  password   = "ajaytech"
)

db

Database interactions are typically retrieved using something called a cursor. A Cursor is just a pointer to a set of data retrieved from the database. It is upto us to iterate over the retrieved data and get what we went. Typically this is done using a loop. So, this is basically a 2 step process

  1. Execute an SQL statement and get the result into a cursor
  2. Iterate over the cursor to get the data

For example, let’s do these 2 steps to list all the databases. Each database is essentially a collection of tables.

Step 1 – Get the list of tables into a cursor

cur = db.cursor()

cur.execute("SHOW DATABASES")

Step 2 – Iterate over the cursor to get the list of databases

for db in cur:
  print(db)
('information_schema',)
('mysql',)
('performance_schema',)
('sakila',)
('sys',)
('world',)

Once we know the list of databases, we have to select the database first. Once we do that, we can freely go about executing the select statements on that particular database.

cur.execute("use world")

List all the tables in the database.

cur.execute("show tables")

for table in cur:
  print(table)

('city',)
('country',)
('countrylanguage',)

Let’s pick a table – say country. Now, let’s extract all the columns in that table. They will become the columns of our Pandas dataframe.

cur.execute("show columns from country")
column_names = []
for column in cur:
  column_names.append(column[0])

column_names

['Code',
 'Name',
 'Continent',
 'Region',
 'SurfaceArea',
 'IndepYear',
 'Population',
 'LifeExpectancy',
 'GNP',
 'GNPOld',
 'LocalName',
 'GovernmentForm',
 'HeadOfState',
 'Capital',
 'Code2']

Once we got the column names, let’s get the actual data from the table.

cur.execute("select * from country")

import pandas as pd

country_data = pd.DataFrame(columns = column_names)

rows = []
for data in cur:
  rows.append(list(data))

country_data = pd.DataFrame(rows)

country_data.head()
0 	1 	2 	3 	4 	5 	6 	7 	8 	9 	10 	11 	12 	13 	14
0 	ABW 	Aruba 	North America 	Caribbean 	193.0 	NaN 	103000 	78.4 	828.0 	793.0 	Aruba 	Nonmetropolitan Territory of The Netherlands 	Beatrix 	129.0 	AW
1 	AFG 	Afghanistan 	Asia 	Southern and Central Asia 	652090.0 	1919.0 	22720000 	45.9 	5976.0 	NaN 	Afganistan/Afqanestan 	Islamic Emirate 	Mohammad Omar 	1.0 	AF
2 	AGO 	Angola 	Africa 	Central Africa 	1246700.0 	1975.0 	12878000 	38.3 	6648.0 	7984.0 	Angola 	Republic 	José Eduardo dos Santos 	56.0 	AO
3 	AIA 	Anguilla 	North America 	Caribbean 	96.0 	NaN 	8000 	76.1 	63.2 	NaN 	Anguilla 	Dependent Territory of the UK 	Elisabeth II 	62.0 	AI
4 	ALB 	Albania 	Europe 	Southern Europe 	28748.0 	1912.0 	3401200 	71.6 	3205.0 	2500.0 	Shqipëria 	Republic 	Rexhep Mejdani 	34.0 	AL

Great !!! We just need one last step before we finish creating the table into a Pandas dataframe. Set the column names that we have already extracted in a previous step.

country_data.columns = column_names

country_data.head()

Code 	Name 	Continent 	Region 	SurfaceArea 	IndepYear 	Population 	LifeExpectancy 	GNP 	GNPOld 	LocalName 	GovernmentForm 	HeadOfState 	Capital 	Code2
0 	ABW 	Aruba 	North America 	Caribbean 	193.0 	NaN 	103000 	78.4 	828.0 	793.0 	Aruba 	Nonmetropolitan Territory of The Netherlands 	Beatrix 	129.0 	AW
1 	AFG 	Afghanistan 	Asia 	Southern and Central Asia 	652090.0 	1919.0 	22720000 	45.9 	5976.0 	NaN 	Afganistan/Afqanestan 	Islamic Emirate 	Mohammad Omar 	1.0 	AF
2 	AGO 	Angola 	Africa 	Central Africa 	1246700.0 	1975.0 	12878000 	38.3 	6648.0 	7984.0 	Angola 	Republic 	José Eduardo dos Santos 	56.0 	AO
3 	AIA 	Anguilla 	North America 	Caribbean 	96.0 	NaN 	8000 	76.1 	63.2 	NaN 	Anguilla 	Dependent Territory of the UK 	Elisabeth II 	62.0 	AI
4 	ALB 	Albania 	Europe 	Southern Europe 	28748.0 	1912.0 	3401200 	71.6 	3205.0 	2500.0 	Shqipëria 	Republic 	Rexhep Mejdani 	34.0 	AL

Instead of going the cursor route, you can also choose an ORM that can abstract the meta associated with the database table. A detailed description of an ORM is once again a subject in itself and we will not be discussing it today.

ORM stands for Object Relational Mapper. Instead of just focusing on SQL, ORM lets us use any Object Oriented paradigm to extract data from a relational database. Just think of ORM as a OO wrapper around SQL. Here is a quick visual to show the same.

One such ORM is SQLAlchemy.

pip install SQLAlchemy
import pandas as pd
import mysql.connector

import sqlalchemy as db

engine = db.create_engine('mysql+pymysql://ajaytech:ajaytech@localhost:3306/world')
df = pd.read_sql_table("country",engine)
df.head()

Code 	Name 	Continent 	Region 	SurfaceArea 	IndepYear 	Population 	LifeExpectancy 	GNP 	GNPOld 	LocalName 	GovernmentForm 	HeadOfState 	Capital 	Code2
0 	ABW 	Aruba 	North America 	Caribbean 	193.0 	NaN 	103000 	78.4 	828.0 	793.0 	Aruba 	Nonmetropolitan Territory of The Netherlands 	Beatrix 	129.0 	AW
1 	AFG 	Afghanistan 	Asia 	Southern and Central Asia 	652090.0 	1919.0 	22720000 	45.9 	5976.0 	NaN 	Afganistan/Afqanestan 	Islamic Emirate 	Mohammad Omar 	1.0 	AF
2 	AGO 	Angola 	Africa 	Central Africa 	1246700.0 	1975.0 	12878000 	38.3 	6648.0 	7984.0 	Angola 	Republic 	José Eduardo dos Santos 	56.0 	AO
3 	AIA 	Anguilla 	North America 	Caribbean 	96.0 	NaN 	8000 	76.1 	63.2 	NaN 	Anguilla 	Dependent Territory of the UK 	Elisabeth II 	62.0 	AI
4 	ALB 	Albania 	Europe 	Southern Europe 	28748.0 	1912.0 	3401200 	71.6 	3205.0 	2500.0 	Shqipëria 	Republic 	Rexhep Mejdani 	34.0 	AL

If you wanted specific columns, you could specify them using the columns attribute.

df = pd.read_sql_table("country",engine, columns=["Code","Name"])
df.head()
        Code 	Name
0 	ABW 	Aruba
1 	AFG 	Afghanistan
2 	AGO 	Angola
3 	AIA 	Anguilla
4 	ALB 	Albania

If you want to go specifically with SQL, then you can use the read_sql_query ( ) method.

df = pd.read_sql_query("select code, name from country",engine)
df.head()
        Code 	Name
0 	ABW 	Aruba
1 	AFG 	Afghanistan
2 	AGO 	Angola
3 	AIA 	Anguilla
4 	ALB 	Albania

Essentially, the same method.

Loading Data

Instead of working on the sample tables provided as part of MySQL installation (like country,city etc), let’s load up a bigger dataset into the system. MySQL provides one such database right on their website – and loading it is just as easy. Just search google for “Employee database mysql” or click here. The link to download the actual data is available here. Once there, click on the Clone or Download button and then click on the Download Zip button as shown below.

Unzip the file and navigate into the directory.

In the address bar, type in cmd and that would open the command prompt with the current directory.

like so..

Once inside, type in

mysql -u ajaytech -p -t < employees.sql

where “ajaytech” is the user name you have chosen while installing MySQL – like so..

You have to use the full path of the mysql command, if it is not added to your path, like shown above

"C:\Program Files\MySQL\MySQL Server 8.0\bin\mysql"

Once the data hase been loaded, refresh your MySQL workbench and you should be able to see the database(schema).

You should be able to drill-down to find the tables. For example, here is a snapshot of the EMPLOYEES table.

This one is a pretty decent sized database – 300K odd employees as you can see from the screenshot below.

Selecting Data

For the purpose of extracting data for Data Science & Machine Learning, we will mostly be working with SELECTING data. So, our focus will mainly be on selecting data (as opposed to UPDATEs, DELETEs or CREATE or other administrative aspects of the database).

SELECT Statement

This is probably the most used SQL statement in Data Science or otherwise. We have seen an example of how to select data from the country table above. In the following sections, we will focus on the employees database and how to extract data from the tables inside it.

Here is a snapshot of the SELECT statement syntax in it’s most generic format. As you can see, it is pretty intense.

Luckily, we only have to deal with very few of these options. Let’s start with the basics.

df = pd.read_sql_query("select emp_no, first_name from employees.employees",engine)
df.head()

        emp_no 	first_name
0 	10001 	Georgi
1 	10002 	Bezalel
2 	10003 	Parto
3 	10004 	Chirstian
4 	10005 	Kyoichi

As you can see, this one took some time. That is because we have pulled in all the records. A quick look at the shape of the dataframe should make this clear.

df.shape
(300024, 2)

Let’s try to just pull in the first 100 records.

df = pd.read_sql_query("SELECT emp_no, first_name FROM employees.employees LIMIT 100",engine)
df.head()
        emp_no 	first_name
0 	10001 	Georgi
1 	10002 	Bezalel
2 	10003 	Parto
3 	10004 	Chirstian
4 	10005 	Kyoichi

This time, the select ran pretty quick, right ? That is because, we limited the rows retrieved to just 100. You can quickly verify this with the dataframe’s shape.

df.shape

(100, 2)

Just to make the syntax look simple, we have capitalized the SQL keywords in the select statement.

SELECT emp_no, first_name FROM employees.employees LIMIT 100

as opposed to

select emp_no, first_name from employees.employees limit 100

That way you know what the SQL statement is doing by quickly looking for the capitalized keywords. Also, there is a quick way to beautify SQL statements in the MySQL Workbench. Just type in the statement and click on the “Beautify/Format” button.

Look at the way the SQL Workbench has formatted it. It took care of capitalization, tabbing and formatting. This form of the SQL statement is more useful in cases where the statement is pretty large.

So far, we have seen 3 keywords ,

  • SELECT – The main keyword to select data from database tables.
  • FROM – used to specify the tables from which the data needs to be extracted.
  • LIMIT – used to limit the number of rows to extract.

Here is a quick visual

In case you are wondering why we are selecting the schema or database everytime we do a select, that is because there are multiple schemas/databases in any database system. To use a specific database by default without having to specify the database table everytime, use the following command.

USE employees

If you do this in the workbench, the corresponding database is highlighted.

In python, when you create the connection using the ORM, specify the database.

import pandas as pd
import mysql.connector

import sqlalchemy as db

engine = db.create_engine('mysql+pymysql://ajaytech:ajaytech@localhost:3306/employees'

And from that point on, you can just specify the table (without specifying the database).

df = pd.read_sql_query("SELECT emp_no, first_name FROM employees LIMIT 100",engine)
df.head()
emp_no 	first_name
0 	10001 	Georgi
1 	10002 	Bezalel
2 	10003 	Parto
3 	10004 	Chirstian
4 	10005 	Kyoichi

Great ! Let’s move on to the next clause – WHERE. Next to FROM, this is probably the next important clause in the SELECT statement.

df = pd.read_sql_query("SELECT emp_no, first_name, birth_date FROM employees WHERE emp_no = 10002",engine)
df.head()

emp_no 	first_name 	birth_date
0 	10002 	Bezalel 	1964-06-02

Let’s add some more filter conditions using the WHERE and AND clauses.

df = pd.read_sql_query("SELECT emp_no, first_name, gender  \
                       FROM employees \
                       WHERE gender = 'M' AND \
                             emp_no <= 10003",engine)
df.head()

emp_no 	first_name 	gender
0 	10001 	Georgi 	M
1 	10003 	Parto 	M

You can use many comparators that you are used to in any programming languages

  • Less than (<) , Greater than (>)
  • Greather than or equal to (>=) , Less than or equal to (<=)
  • Not equal to ( != or <> )

BETWEEN clause

df = pd.read_sql_query("SELECT emp_no, first_name, gender  \
                       FROM employees \
                       WHERE emp_no BETWEEN 10001 AND 10003",engine)
df.head()

        emp_no 	first_name 	gender
0 	10001 	Georgi 	        M
1 	10002 	Bezalel 	F
2 	10003 	Parto 	        M

IN Clause

What if you wanted to select specific employee numbers ?

df = pd.read_sql_query("SELECT *  \
                       FROM employees \
                       WHERE emp_no IN (10001, 10003,10004)",engine)
df.head()

emp_no 	birth_date 	first_name 	last_name 	gender 	hire_date
0 	10001 	1953-09-02 	Georgi 	Facello 	M 	1986-06-26
1 	10003 	1959-12-03 	Parto 	Bamford 	M 	1986-08-28
2 	10004 	1954-05-01 	Chirstian 	Koblick 	M 	1986-12-01

LIKE clause

df = pd.read_sql_query("SELECT *  FROM employees WHERE first_name LIKE 'G%%' AND emp_no < 10010 ",engine)
df.head()

emp_no 	birth_date 	first_name 	last_name 	gender 	hire_date
0 	10001 	1953-09-02 	Georgi 	Facello 	M 	1986-06-26

ORDER BY clause

df = pd.read_sql_query("select * from employees where emp_no < 10005 order by emp_no desc ",engine)
df.head()

emp_no 	birth_date 	first_name 	last_name 	gender 	hire_date
0 	10004 	1954-05-01 	Chirstian 	Koblick 	M 	1986-12-01
1 	10003 	1959-12-03 	Parto 	Bamford 	M 	1986-08-28
2 	10002 	1964-06-02 	Bezalel 	Simmel 	F 	1985-11-21
3 	10001 	1953-09-02 	Georgi 	Facello 	M 	1986-06-26

AS clause

df = pd.read_sql_query("SELECT  emp_no AS employee_number,first_name \
                        FROM employees WHERE emp_no < 10005 ORDER BY emp_no desc ",engine)
df.head()
    employee_number 	first_name
0 	10004 	Chirstian
1 	10003 	Parto
2 	10002 	Bezalel
3 	10001 	Georgi

Aggregate Functions

MIN function

df = pd.read_sql_query("SELECT  emp_no ,MIN(birth_date),first_name \
                        FROM employees WHERE emp_no < 10005 ORDER BY emp_no desc ",engine)
df.head()

        emp_no 	MIN(birth_date) first_name
0 	10001 	1953-09-02 	Georgi

The other aggregate functions supported by MySQL are

  • AVG () – Average
  • SUM () – Summation
  • COUNT () – Row count
  • MAX () – Maximum value

AVG aggregate function

df = pd.read_sql_query("SELECT  AVG(salary) FROM salaries \
                        WHERE emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()

       AVG(salary)
0 	69713.25

MAX aggregate function

df = pd.read_sql_query("SELECT  MAX(salary) FROM salaries \
                        WHERE emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()
      MAX(salary)
0 	88958

Table Joins

Suppose you wanted to find out the salary of employee 10001, and print the following information

  • emp_no
  • first_name, last_name
  • salary

How would you do it ? The data is available in 2 different tables

  • EMPLOYEES
  • SALARIES

You would have to join information from both of these tables to get the final output you need. Here is how you do it.

df = pd.read_sql_query("SELECT  first_name, last_name, salary \
                        FROM employees JOIN salaries \
                        WHERE employees.emp_no = salaries.emp_no AND \
                              employees.emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()

first_name 	last_name 	salary
0 	Georgi 	Facello 	88958
1 	Bezalel 	Simmel 	72527
2 	Parto 	Bamford 	43311
3 	Chirstian 	Koblick 	74057

Say, you wanted to output the employee number also – Try this.

df = pd.read_sql_query("SELECT  emp_no, first_name, last_name, salary \
                        FROM employees JOIN salaries \
                        WHERE employees.emp_no = salaries.emp_no AND \
                              employees.emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()

---------------------------------------------------------------------------
InternalError: (pymysql.err.InternalError) (1052, "Column 'emp_no' in field list is ambiguous")

SQL is complaining that Column ’emp_no’ in field list is ambiguous. That is because both the tables(employees,salaries) have a column called emp_no – and MySQL doesn’t understand which one we are referring to. Let’s make this clear for the database.

df = pd.read_sql_query("SELECT  employees.emp_no, first_name, last_name, salary \
                        FROM employees JOIN salaries \
                        WHERE employees.emp_no = salaries.emp_no AND \
                              employees.emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()

emp_no 	first_name 	last_name 	salary
0 	10001 	Georgi 	Facello 	88958
1 	10002 	Bezalel 	Simmel 	72527
2 	10003 	Parto 	Bamford 	43311
3 	10004 	Chirstian 	Koblick 	74057

Most of the time regular JOINs are good enough. The default JOIN that is used to join tables is an INNER JOIN. Most of the time, this join is good enough.