# Gradient Descent

## Contents

- What is Gradient Descent
- Gradient Descent for Linear Regression
- Math
- Gradient Descent for Linear Regression
- Stochastic Gradient Descent
- Mini-batch Gradient Descent
- Gradient Descent for Logistic Regression

### What is Gradient Descent

In simple terms, Gradient Descent is an algorithm to **compute the minimum of a function**.
OK – So, what is the big deal ? Well, most of the time in most machine
learning algorithms, there is always a cost function that needs to be
minimized. The best Machine Learning Algorithm is usually the one with
the most inclusive and simple cost function. Once a cost function is
defined, it is just a matter of solving for a minimum to arrive at the
solution. That is why *Gradient Descent* is extremely useful in the context of Machine learning. Let’s see an example.

### Gradient Descent for Linear Regression

Let’s start with the simplestML problem – Linear Regression. In the Machine Learning in Python Tutorial, we have covered Regression in Python in great detail.

Since the problem is simple enough to be solved mathematically, we have used the **OLS** (Ordinary Least Squares) technique to fit a straight line to the Linear Regression problem. You can view the equation for Ordinary Least Square to solve linear regression here. What is the cost function in this case?

picture here

**Cost function** = Sum of Squares of Residuals

The mathematical solution to minimize this cost function as derived by **OLS** is as follows.

where *x*¯ represents the average of x and *y*¯

represents the average of y

However, when the number of independent variables increase, **OLS** is not a good solution. That is where **Gradient Descent** shines. While OLS is an analytical solution, Gradient Descent is a numerical solution. However, to understand *Gradient Descent*, we have to be conversant with the following concepts in Math.

- Derivatives
- Partial Derivatives

### Math

#### Derivatives

A derivative is the slope of a function. Let’s take a simple straight line –

A simple dataset for this could be

- x = Number of DNA Samples
- y = Number of DNA pairs.

Let’s plot a sample dataset and try to compute the slope.

import matplotlib.pyplot as plt import numpy as np %matplotlib inline x = np.array([1,2,3,4,5,6,7,8,9,10]) y = x * 2 print ( "x = ",x) print ( "y = ",y) plt.plot(x,y) plt.plot(x[1], y[1], marker='o', markersize=10, color="red") plt.plot(x[3], y[3], marker='o', markersize=10, color="red") plt.hlines(y=y[1], xmin=x[1], xmax=x[3], color='b') plt.vlines(x=x[3], ymin=y[1], ymax=y[3], color='b') plt.text(4.2,5,(y[3] - y[1])) plt.text(3,3,(x[3] - x[1]))

x = [ 1 2 3 4 5 6 7 8 9 10] y = [ 2 4 6 8 10 12 14 16 18 20]

Text(3, 3, '2')

A simple dataset for this could be

- x = Reach of a product
- y = Sales of the product.

Let’s plot a sample dataset and try to compute the slope.

import matplotlib.pyplot as plt import numpy as np %matplotlib inline x = np.array([1,2,3,4,5,6,7,8,9,10]) y = x ** 2 print ( "x = ",x) print ( "y = ",y) plt.plot(x,y) plt.plot(x[1], y[1], marker='o', markersize=10, color="red") plt.plot(x[3], y[3], marker='o', markersize=10, color="red") plt.hlines(y=y[1], xmin=x[1], xmax=x[3], color='b') plt.vlines(x=x[3], ymin=y[1], ymax=y[3], color='b') plt.text(4.2,8,(y[3] - y[1])) plt.text(3,0.0,(x[3] - x[1])) plt.text(5,10,"slope = 12/2 = 6") plt.plot(x[5], y[5], marker='o', markersize=10, color="red") plt.plot(x[7], y[7], marker='o', markersize=10, color="red") plt.hlines(y=y[5], xmin=x[5], xmax=x[7], color='b') plt.vlines(x=x[7], ymin=y[5], ymax=y[7], color='b') plt.text(8.3,50,(y[7] - y[5])) plt.text(7,30,(x[7] - x[5])) plt.text(8.58,45,"slope = 14")

x = [ 1 2 3 4 5 6 7 8 9 10] y = [ 1 4 9 16 25 36 49 64 81 100]

Text(8.58, 45, 'slope = 14')

In this case, the slope is not constant as measured by same metric as we have done previously. The slope seems to be changing with x.

A correct way to define slope (or derivative) is to take an infinitesimally small increase in x and the corresponding value of y and divide them as before. Mathematically, it is defined as,

If *f*(*x*) is a function of x,

For example, if x = 4, increase x by a very small amount, say Δ=0.0001

. Now, let’s compute the value of y as well and plug them into the equation above

*x*= 4-
*dx*= 0.0001

so, the derivative of *f*(*x*)=*x*2 is 2*x*

. We have not derived this mathematically – instead, we are trying to understanding with numbers, how a derivative works.

Derivativerepresents the change in the value of a function with respect to the variable(with which the derivative is being applied)

#### Partial Derivatives

Partial derivatives are almost similar to regular derivatives – except that partial derivatives work only on a particular variable. For example, say the speed of a car is dependent on

- engine RPM
- slope of the road

you can also write it as

Now, how does the speed (*z*)
of the car vary with a unit increase in the engine RPM ? The answer is 8
– pretty straightforward. That is represented mathematically using

Let’s take another example – The equation of a 2-d plane can be generalized as below

You can visualize a plane like this –

As you can see, the plane intersects the z-axis at 2 ( Where the value of x & y are 0 ). Now, how far does the function vary, with a unit variation in x ?

Once again, I want you to take the intuitive meaning out of this –

For a unit change in x, the function changes by so much in the direction of x – That is a partial derivative.

A plane is simple to understand. However, the interpretation would be the same even if it were a complicated curve in a 3-d space – Like a hill.

### Gradient Descent for Linear Regression

Now that we understand derivaties (both regular and partial), we are ready to graduate to Gradient Descent. Imagine a set of data points like so.

import matplotlib.pyplot as plt import numpy as np %matplotlib inline x = np.array([1,3,4,5,6,7,8,10]) y = np.array([4,3,7,7,8,10,8,11]) plt.scatter(x,y)

Say, we want to fit a straight line to this data set using Linear Regression. How would you do it ? Very simple

from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(x.reshape(-1,1),y) slope = model.coef_ intercept = model.intercept_ print ( "slope = ", slope) print ( "intercept = ", intercept)

slope = [0.84482759] intercept = 2.6034482758620694

point_1 = slope*0 + intercept point_2 = slope*15 + intercept print ( point_1, point_2) plt.scatter( x,y,alpha=0.5) plt.plot([0,15], [point_1,point_2],color="red")

[2.60344828] [15.27586207]

plt.scatter( x,y,alpha=0.5) plt.plot([0,15], [point_1,point_2],color="red") y_actual = y y_predicted = model.predict(x.reshape(-1,1)) for index,x_count in enumerate(x) : if y_actual[index] > y_predicted[index] : plt.vlines(x=x_count, ymin=y_predicted[index], ymax=y_actual[index], color='b') if y_actual[index] <= y_predicted[index] : plt.vlines(x=x_count, ymin=y_actual[index], ymax=y_predicted[index], color='b')

The blue lines represent the residuals (or errors). We can calculate the slope (and intercept) of the fit using OLS (Ordinary Least Squares) or using Gradient Descent. We already know how OLS works in Linear Regression. We will see how Gradient Descent works. The equation for a straight line that would fit all the data points is some variation of

where

- m = slope
- b = intercept

Either way, we are minimizing the *Sum of Squares of Errors*. We started out with the definition of this at the beginning of the chapter.

### Cost Function

Just to make things simple, assume a value of intercept (b) to be fixed at 2.6 ( b = 2.6 as we have previously solved for it). Imagine that we chart the cost function with different values of slope(m).

n = len(y) cost_function = [] m = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0] for slope in m : cost = 0 for i in range(n): cost = cost + (1/(2*n)) * ( (y_actual[i] - slope * x[i] - 2.6) ** 2 ) cost_function.append(cost)

plt.scatter(m,cost_function)

Visually, we can eyeball the minimum value of the cost function to be somewhere around 0.8. This is inline with the scikit learn’s LinearRegression model that we have solved above.

But how to mathematically solve for this (without using Ordinary Least Squares) ? That is where Gradient Descent comes in.

Gradient Descent is a technique to find out the minimum of a function numerically.

Imagine a ball put at a random location on the cost curve shown above.

If you were to let the ball go, it would roll down to the bottom. Why does this happen ? Gravity moves the ball from a higher energy state to a lower energy state. You already know this. What is more interesting is the path it takes. The path should always be from a position of higher slope to a position of lower slope.

If you see the slope of the ball at each of the 4 positions highlighted above, it is pretty clear that the slope(dashed line) is decreasing with every move down the curve. Slope represents the derivative of the curve. In this case derivative of the cost function with respect to the slope (x-axis).

#### How much do you move by ?

The amount and the direction you move is controlled by how much the cost function changes.

How much we move is based on how fast the cost function changes with the slope (or intercept). The way to learn that is by finding out the derivative of the cost function with respect to the slope (and intercept). For now, just to make things simple and to be able to view things in 2D, we are only keeping the slope as the variable (and the intercept as constant). We will see in the next section how to work on minimizing the cost function for both slope and intercept.

x = x.astype(float) y = y.astype(float) print ( x ) print ( y ) steps = 5 m = 0 n = len(x) l_rate = 0.0001 # Start Gradient Descent for step in range(steps) : y_pred = m * x + 2.6 # Derivative of the cost function w.r.t m m_der = (-1/n) * sum( (y - y_pred) * x) # move m m = m - m_der print ( m)

[ 1. 3. 4. 5. 6. 7. 8. 10.] [ 4. 3. 7. 7. 8. 10. 8. 11.] 31.700000000000003 -1125.3500000000001 41106.975000000006 -1500372.8875000002 54763642.09375

The value of *m* oscillates hugely. That is because, *m*

gives a general sense of direction, but doesn’t tell us how far we need to go. You can’t go an infinite distance in the direction of m. We need to take baby steps.

Take a small step, evaluate slope, take another small step in the direction of the least slope. This is the essence of Gradient Descent.

It is like a baby learning to take steps. So, there is a concept called **learning rate** that controls how far we move in the direction of the most descent. Let’s rewrite the program with *learning rate*.

import time l_rate = 0.001 steps = 1000 m = 0 m_array = [] # Start Gradient Descent for step in range(steps) : y_pred = m * x + 2.6 # Derivative of the cost function w.r.t m m_der = (-1/n) * sum( (y - y_pred) * x) # move m m = m - l_rate * m_der m_array.append(m) print ( "optimum slope (m) = ", m)

optimum slope (m) = 0.8453333333333318

Let’s plot the journey of the ball down the cost curve.

# Cost Function n = len(y) y_actual = y cost_function = [] m = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0] for slope in m : cost = 0 for i in range(n): cost = cost + (1/(2*n)) * ( (y_actual[i] - slope * x[i] - 2.6) ** 2 ) cost_function.append(cost) # Steps taken n = len(y) cost_function_m = [] m_steps = m_array for slope in m_steps : cost = 0 for i in range(n): cost = cost + (1/(2*n)) * ( (y_actual[i] - slope * x[i] - 2.6) ** 2 ) cost_function_m.append(cost) plt.scatter(m_steps,cost_function_m) # steps taken. plt.scatter(m,cost_function) # cost function plt.xlabel("m - slope") plt.ylabel("cost function")

Text(0, 0.5, 'cost function')

As you can see, even after 500 steps, the ball was not able to roll down to it’s minimum. That is because of such a small learning rate – 0.0001.

### Learning Rate

How fast do you go down the path ? It depends on how fast you want to converge (without overshooting). An arbitrary parameter called learning rate( α ) would determine how fast you go down the path. If you want to converge fast, can you increase the learning rate ? Probably not. Here is why.

If you set a learning rate = 0.1, that is roughly how fast you move along the x-axis. However, if you set the learning rate to 0.7 (thinking you could move down the curve faster), here is what would happen – You essentially miss the minimum.

Here is a quick plot of how the ball moves with a learning rate of 0.05 within just 100 iterations. The ball is going back and forth because it is overshooting. However, it finally settles down at the minimum.

#### Optimize Gradient Descent for both Slope & Intercept

So far, we have optimized Gradient Descent for just *slope*. How about the intercept ? The moment we introduce the 2nd parameter – *intercept* , the cost function becomes 3d.

- x-axis =
**slope** - y-axis =
**intercept** - z-axis =
**cost function**

Matplotlib provides a rudementary 3d scatter plot. We will be using that to plot the 3d plot of the cost function.

import matplotlib.pyplot as plt import numpy as np x = np.array([1,3,4,5,6,7,8,10]) y = np.array([4,3,7,7,8,10,8,11]) # This import registers the 3D projection, but is otherwise unused. from mpl_toolkits.mplot3d import Axes3D # noqa: F401 unused import import matplotlib.pyplot as plt import numpy as np %matplotlib inline slope_values = np.arange(start=0,stop=5,step=0.05) intercept_values = np.arange(start=0,stop=5,step=0.05) # y_pred = slope * x + intercept n = len(y) cost_function = [] for index, slope in enumerate(slope_values) : cost = 0 for i in range(n): cost = cost + (1/(2*n)) * ( (y[i] - slope_values[index] * x[i] - intercept_values[index]) ** 2 ) cost_function.append(cost) fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(slope_values, intercept_values, cost_function,marker='o') plt.show()

Jupyter notebook doesn’t allo 3-d rotation, but if you try this in your standard IDE (say VS Code), you would be able to get a 3-d look at the plot.

Now, let’s optimize the cost function for both slope and intercept. Here are the partial derivatives of the cost function for the slope and intercept.

x = np.array([1,3,4,5,6,7,8,10]) y = np.array([4,3,7,7,8,10,8,11]) l_rate = 0.01 # Learning rate steps = 4000 # number of iterations ( steps ) m = 0 # initial slope b = 0 # initial intercept n = float(len(x)) m_array = [] b_array = [] # Start Gradient Descent for step in range(steps) : y_pred = m * x + b # Derivative of the cost function w.r.t slope (m) m_der = (-1/n) * sum( (y - y_pred) * x) # Derivative of the cost function w.r.t intercept (b) b_der = (-1/n) * sum( y-y_pred ) # move m m = m - l_rate * m_der b = b - l_rate * b_der # gather the slope and intercept in an array to plot later m_array.append(m) b_array.append(b) print (" optimim slope(m) = ", m) print ( "optimum intercept (m) = ", b)

optimim slope(m) = 0.8450107631510549 optimum intercept (m) = 2.6022056448336817

Now that we have the

slope_values = np.arange(start=0,stop=3,step=0.05) intercept_values = np.arange(start=0,stop=3,step=0.05) n = len(y) cost_function = [] for index, slope in enumerate(slope_values) : cost = 0 for i in range(n): cost = cost + (1/(2*n)) * ( (y[i] - slope_values[index] * x[i] - intercept_values[index]) ** 2 ) cost_function.append(cost) slope_values_new = m_array intercept_values_new = b_array cost_function_new = [] for index, slope in enumerate(slope_values_new) : cost = 0 for i in range(n): cost = cost + (1/(2*n)) * ( (y[i] - slope_values_new[index] * x[i] - intercept_values_new[index]) ** 2 ) cost_function_new.append(cost) fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(slope_values, intercept_values, cost_function,marker='o') ax.scatter(slope_values_new, intercept_values_new, cost_function_new,marker='o') plt.show()

### Stochastic Gradient Descent

For every iteration in Gradient Descent algo, the entire dataset is used to calculate the derivative of the cost function. This is very expensive if the dataset gets larger.

Imagine real world problems like image processing that has millions of pixels in a single image. Gradient Descent becomes almost impossible to compute if we don’t optimize.

One possible solution is to use **Stochastic Gradient Descent**. The word **Stochastic** stands for *random*.
Instead of using every observation ( rows in the dataset), just use a
random observation each time the derivative is being computed. In a
standard Gradient Descent, the derivative of the cost function w.r.t
slope is defined like this.

where i is the index of a random data row.

In Stochastic Gradient Descent, calculating the cost function is not done for the entire training set. Instead, you pick a random row in the dataset and calculate the cost function for that particular row only.

- Surprisingly, this gives pretty good results (given the compromise)
- This is computationally so much more efficient than doing the full dataset.

Let’s do this in python.

import numpy as np import matplotlib.pyplot as plt %matplotlib inline l_rate = 0.001 steps = 1000 m = 0 n = len(x) x = np.array([1,3,4,5,6,7,8,10]) y = np.array([4,3,7,7,8,10,8,11]) m_array = [] cost_function_m = [] # Start Gradient Descent for step in range(steps) : # CHANGE - At every step, get a new random number random_index = np.random.randint(0,len(x)) # CHANGE - only calculate the predicted "y" value for that particular data row y_pred = m * x[random_index] + 2.6 # Derivative of the cost function w.r.t m # CHANGE - calculate the derivative only for a particular row in the data m_der = (-1/n) * sum( (y[random_index] - y_pred) * x) # move m m = m - l_rate * m_der m_array.append(m) m_steps = m_array for slope in m_steps : cost = 0 for i in range(n): cost = cost + (1/(2*n)) * ( (y[i] - slope * x[i] - 2.6) ** 2 ) cost_function_m.append(cost) plt.scatter(m_steps,cost_function_m) # steps taken. plt.xlabel("m - slope") plt.ylabel("cost function") print ( "optimum slope (m) = ", m)

optimum slope (m) = 0.8793576691371693

That’s pretty close to the real value (as calculated by OLS), right ?

For just 10 observations, this is not a big deal. Imagine the performance gain if the number of rows were extremely large, as would happen in real datasets. However, there is a cost to this trade-off. The solution (optimum slope in this case) varies with each run. For example, try running the code above 4 or 5 times – each time you get a different solution. Although the difference is not a lot, still the stochastic gradient descent results in a slightly different solution with every run. How to counter this ? A compromise between standard gradient descent and stochastic gradient descent is possible – It is called Mini-batch gradient descent.

### Mini-batch Gradient Descent

In practice though, a technique called mini-batch Gradient Descent is used mostly for Gradient Descent problems. It is hybird solution between standard gradient descent and stochastic gradient descent. The following picture highlights the difference between standard vs stochastic vs mini-batch gradient descent methods.

Let’s program Stochastic Gradient Descent in Python.

import numpy as np import matplotlib.pyplot as plt %matplotlib inline l_rate = 0.001 steps = 1000 m = 0 n = len(x) x = np.array([1,3,4,5,6,7,8,10]) y = np.array([4,3,7,7,8,10,8,11]) m_array = [] cost_function_m = [] x_range = np.arange(len(x)) # Start Gradient Descent for step in range(steps) : # CHANGE - At every step, get a set of new random numbers random_index = np.random.choice(x_range, size=3, replace=False) # CHANGE - only calculate the predicted "y" value for that particular data row y_pred = m * x[random_index] + 2.6 # Derivative of the cost function w.r.t m # CHANGE - calculate the derivative only for a particular row in the data m_der = (-1/n) * sum( (y[random_index] - y_pred) * x[random_index]) # move m m = m - l_rate * m_der m_array.append(m) m_steps = m_array for slope in m_steps : cost = 0 for i in range(n): cost = cost + (1/(2*n)) * ( (y[i] - slope * x[i] - 2.6) ** 2 ) cost_function_m.append(cost) plt.scatter(m_steps,cost_function_m) # steps taken. plt.xlabel("m - slope") plt.ylabel("cost function") print ( "optimum slope (m) = ", m)

optimum slope (m) = 0.8472922870088795

This time the slope value is pretty steady.

Mini-batch gradient descent achieves a compromise between the time-consuming, but accurate Gradient Descent and a quick, but slighlty inaccurate Stochastic Gradient Descent.

Gradient Descent is a generic Cost Minimization algorithm. As long as there is a convex cost function. If there are multiple minima, Gradient Decent only arrives at a local minima.

### Gradient Descent for Logistic Regression

If you have been through the machine learning tutorial, you must have already seen what is logistic regression works and the math behind Logistic Regression. In the tutorial, we have used Scikit Learn’s LogisticRegression class to fit the data using Logistic Regression. However, we want to understand how Gradient Descent works for Logistic Regression.

In order to do that, we have to first understand 3 things

- Logistic Regression equation
- Cost function
- Partial Derivative of the Cost function w.r.t. x

The equation for logistic regression is

where

- w is a vector of numbers
- b is a number
- x is a vector of predictors

We haven’t seen w and b in Linear Regression that we saw previously, right ? Where have them sprung from ? Well, we are generalizing for the number of predictors a bit. For a single predictor, we can write the Logistic Regression as follows.

where

- m slope
- b intercept

The output of the Logistic Regression is actually a sigmoid curve.

You can get a visual like so.

from scipy.special import expit import numpy as np import matplotlib.pyplot as plt %matplotlib inline x = np.linspace(-10,10) y = expit (x) plt.plot(x,y) plt.grid()

The cost function for Logistic Regression can be formulated similar to a Linear Regression

However, this can result in a non-convex curve for logistic regression. So, instead of using Sum of Squares of Error, Logistic Regresion uses Cross Entropy for it’s cost function.

The partial derivative of the Cost function w.r.t (w,b) is

Now that we have the 3 required things, let’s write our gradient descent learning algorithm

where *α*

is the learning rate

For a single predictor, if you think of the slope/intercept (like the Linear Regression above), this would become

where i is the number of rows in the dataset

Now, let’s implement Logistic Regression in Python. Let’s take a simple dataset.

import numpy as np import matplotlib.pyplot as plt %matplotlib inline x = np.array([1,3,4,5,6,7,8,10]) y = np.array([0,0,0,1,1,1,1,1]) plt.scatter(x,y,c=y)

Seems simple enough, right ? Let’s first get a baseline using Scikit Learn’s LogisticRegression model.

from sklearn import linear_model from scipy.special import expit model_lr = linear_model.LogisticRegression(C=1e5, solver='lbfgs') model_lr.fit(x.reshape(-1,1), y)

LogisticRegression(C=100000.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

Let’s fit it visually.

x_test = np.linspace(1.0,10.0,100) # predict dummy y_test data based on the logistic model y_test = x_test * model_lr.coef_ + model_lr.intercept_ sigmoid = expit(y_test) plt.scatter(x,y, c=y) # ravel to convert the 2-d array to a flat array plt.plot(x_test,sigmoid.ravel(),c="green", label = "logistic fit") plt.yticks([0, 0.2, 0.4, 0.5, 0.6, 0.7, 1]) plt.axhline(.5, color="red", label="cutoff") plt.legend(loc="lower right")

Let’s try and fit this data using Logisic Regression based on Gradient Descent.

model_lr.intercept_

array([-67.57978497])

import numpy as np from math import exp import matplotlib.pyplot as plt %matplotlib inline l_rate = 0.001 steps = 1000 m = 0 n = len(x) x = np.array([1,3,4,5,6,7,8,10]) y = np.array([0,0,0,1,1,1,1,1]) import time l_rate = 1 steps = 1000 m = 0 m_array = [] # Start Gradient Descent for step in range(steps) : y_pred_log = m * x + (-67.57) y_pred = 1/(1 + np.exp(-y_pred_log)) # Derivative of the cost function w.r.t m m_der = (-1/n) * sum( (y - y_pred) * x) # move m m = m - l_rate * m_der m_array.append(m) print ( "optimum slope (m) = ", m)

optimum slope (m) = 15.017184591647284

Once again, we got pretty close to the slope as predicted by Scikit Learn’s LogisticRegression model.

model_lr.coef_

array([[15.03246322]])

Now that we understand how Gradient Descent works, let’s move on to the next important topic in Neural Networks – Back Propogation.