Machine Learning Lifecycle

Machine Learning Lifecycle


  Machine Learning in Python

Contents

§ – will be dealt with on Day 18
\*Modeling will be dealt with in week 3 and week 4
**will be dealt with on day 19

Machine Learning Lifecycle

Just like any project following the software engineering process, Machine Learning also has a lifecycle. Since Machine Learning is more data oriented, the bulk of the time is spent with data. At a high level, the machine learning lifecycle looks something like this.

We are not talking about some of the much higher level project activities like

  • Project Objectives
  • Staffing
  • Risk Management etc

Those will be talked about in the context of pure Project Management. In this section, we will be talking about the activities that you would have to be part of as either a Machine Learning Engineer or Project lead.

If you are wondering why the boxes are not even in size, it is signify the amount of time you will be spending in each of these activities. As you can see, the bulk of the activities are centered around the Data Ingestion process – and that will be the focus of this section. Modeling will be what the rest of this course will focus on. Deployment will focus on how the actual Machine Learning solution will be deployed in a live environment and how the results will be distributed to the users.

Data Ingestion

This is where you will be spending most of your time as an ML engineer. Data is messy – there is so many things to be done like finding the right data sources, cleansing, deduplication, validation etc. These are pretty broad topics that require a variety of skills like SQL, data pre-processing techniques, good excel skills and so on. We will not be discussing all of the steps in data ingestion. We will only be focusing on the following activities highlighted in bold, specifically in the context of NumPy, Pandas & Scikit Learn.

  • Data Import
    • Excel files
    • Flat files
    • Web Scraping
    • API
    • Databases
  • Feature Extraction
  • Data Preprocessing
    • Feature Scaling
    • Non-linear transformations
    • Encoding Categorical Features
      • Ordinal Encoding
      • One-Hot Encoding
  • Imputation of missing values
    • Simple Imputer
    • Predictive Imputer
  • Dimensionality Reduction*

* Will be dealt with on day 18

Data Import

Data import is not a tedious step by typically time consuming. Sourcing the data is not all that straight forward most of the time.

  • Easy – Sometimes, data is readily available. For example, if you were doing movie recommendations algorithm in Netflix, most of the data is readily available in their database.
  • Medium – Data is readily available but in different silos/formats. For example, in the same example as above, imagine you were to get data related to external movie ratings (on top of netflix’s own movie data). This would require some level of data mangling, munging, mixing etc.
  • Hard – Data is sometimes hard to get using regular methods. You might have to resort to special techniques like data scraping, write bulk downloaders using APIs etc . In some of these cases, the quality of data might also be questionable.

We will be dealing with some of the simpler methods of importing data.

Import Data from Excel files

Using NumPy

Numpy does not have functionality to upload data directly from excel (in .xls or .xlsx format). However, you can convert it to a CSV in excel and use the genfromtxt ( ) function.

import numpy as np

data = np.genfromtxt("./data/iris.csv",delimiter=",",skip_header=1)
data[0:4,:]
array([[5.1, 3.5, 1.4, 0.2, 0. ],
       [4.9, 3. , 1.4, 0.2, 0. ],
       [4.7, 3.2, 1.3, 0.2, 0. ],
       [4.6, 3.1, 1.5, 0.2, 0. ]])

Using Pandas

To read excel files, a python package xlrd is required. Once installed, you can use Pandas’ read_excel ( ) function.

> pip install xlrd
import pandas as pd

data = pd.read_excel("./data/shopping_cart.xlsx")
data.head()
InvoiceNo 	StockCode 	Description 	Quantity 	InvoiceDate 	UnitPrice 	CustomerID 	Country
0 	536365 	85123A 	WHITE HANGING HEART T-LIGHT HOLDER 	6 	2010-12-01 08:26:00 	2.55 	17850 	United Kingdom
1 	536365 	71053 	WHITE METAL LANTERN 	6 	2010-12-01 08:26:00 	3.39 	17850 	United Kingdom
2 	536365 	84406B 	CREAM CUPID HEARTS COAT HANGER 	8 	2010-12-01 08:26:00 	2.75 	17850 	United Kingdom
3 	536365 	84029G 	KNITTED UNION FLAG HOT WATER BOTTLE 	6 	2010-12-01 08:26:00 	3.39 	17850 	United Kingdom
4 	536365 	84029E 	RED WOOLLY HOTTIE WHITE HEART. 	6 	2010-12-01 08:26:00 	3.39 	17850 	United Kingdom

Import Data from Flat files

Using NumPy

We have already seen uploading data from CSV to array using numpy’s genfromtxt ( ) function. However, you can use any other delimiters like

  • tab ( \t )
  • pipe delimited ( | ) etc
import numpy as np

data = np.genfromtxt("./data/iris.txt",delimiter="\t",skip_header=1)
data[0:4,:]
array([[5.1, 3.5, 1.4, 0.2, 0. ],
       [4.9, 3. , 1.4, 0.2, 0. ],
       [4.7, 3.2, 1.3, 0.2, 0. ],
       [4.6, 3.1, 1.5, 0.2, 0. ]])

Using Pandas

Pandas has a function (read_csv) to load data with any kind of delimiter ( like tab delimited, pipe delimited etc).

  • tab ( \t )
  • pipe delimited ( | ) etc
import pandas as pd

data = pd.read_csv("./data/iris.csv")
data.head()
sepal_length 	sepal_width 	petal_length 	petal_width 	species
0 	5.1 	3.5 	1.4 	0.2 	0
1 	4.9 	3.0 	1.4 	0.2 	0
2 	4.7 	3.2 	1.3 	0.2 	0
3 	4.6 	3.1 	1.5 	0.2 	0
4 	5.0 	3.6 	1.4 	0.2 	0
import pandas as pd

data = pd.read_csv("./data/iris.txt",delimiter="\t")
data.head()

sepal_length 	sepal_width 	petal_length 	petal_width 	species
0 	5.1 	3.5 	1.4 	0.2 	0
1 	4.9 	3.0 	1.4 	0.2 	0
2 	4.7 	3.2 	1.3 	0.2 	0
3 	4.6 	3.1 	1.5 	0.2 	0
4 	5.0 	3.6 	1.4 	0.2 	0

Import Data using Web Scraping

Downloading HTML tables using Excel

Simple HTML tables on the web can be downloaded using Excel’s data function. For example, in some of the chapters of this course, I have downloaded population data from Wikipedia using Excel.

To download it from excel, go to the following menu location.

Enter the URL and click Import.

Data is downloaded into excel cells.

Scrape Websites

Sometimes the only form of data available is on the browser – for example, you are a third party aggregator trying to gather the best promotion on flight tickets from multiple websites. The actual website might not be willing to give you the data straight away. In cases like this, you have to literally scrape the price/discount off of their website.

Luckily, there are some libraries in Python that can do all the heavy lifting ( HTTP handshake, parsing, creating deep data structures etc). One such library is Beautiful Soup. Let’s see how to scrape

Install Beautiful Soup version 4

> pip install beautifulsoup4

Let’s find out the price of iphone Xs. Go to the apple website and navigate to the iphone Xs page. Set that page as a variable

url = "https://www.apple.com/shop/buy-iphone/iphone-xs"

Beautiful Soup does not actually go out to the web and get the web page. For that we have to use another Python standard library called requests. It is a basic HTTP request library that can go out and get content on the web for us. Once we get the actual content of the web page, Beautiful Soup can parse it and present it in a searchable object.

from bs4 import BeautifulSoup
import requests

Get the web page content and give it to Beautiful Soup to parse.

html = requests.get(url).content

soup = BeautifulSoup(html,'html.parser')

Now that we have the content, we have to figure out where exactly the prices are stored. In order to find out the tag where the price is stored, just right click on the web page in the browser and select View Page Source. In the page source, search for the price you are looking for. For example, the current price of iphone is 999 dollars. Search the page source with 999.

The prices are displayed using a span tag with class current_price. Pull out all the class tags with value of “current_price”. There are multiple ways to do it, but we will just look at one.

soup.select(".current_price")[0]

<span class="current_price">From <b>$549</b></span>

We are just looking at the first atttribute and there are many more prices ( based on the options selected).

Beautiful soup is good enough for low volume web scraping. For high volume web scraping (search engine level web scraping), use Scrapy

API

API stands for Application Programming Interface. It is a way to give programmatic access to a resource. For example, your Alexa machine goes out automatically(programmatically) and fetches the weather data for a particular zip code from weather.com. How does it do it ?

Weather.com provides an API to programmatically fetch weather data. Other examples could be xe.com providing API for exchange rates or Bloomberg providing API for stock tickets. etc.

In this section, let’s using Python to get the weather information on a particular zip code. In order to avoid abuse and keep track of requests, most of the time an API Key is provided. You can sign up for weather.com and a key will be provided to you. Without that key weather.com would not honour API requests.

APIs are typically exposed as URLs. For example, to get the weather by a city, use the following API.

Let’s use Python to extract weather for a city in India – say Hyderabad. Don’t forget to append the API key using the attribute appid. See the url formation below.

import requests

url = "http://api.openweathermap.org/data/2.5/weather?q=Hyderabad&amp;appid="
key = "37a81ae1e682ac******b0a3727080a6"

url = url + key

html = requests.get(url).content

html
b'{"coord":{"lon":78.47,"lat":17.36},"weather":[{"id":803,"main":"Clouds","description":"broken clouds","icon":"04d"}],"base":"stations","main":{"temp":303.29,"pressure":1008,"humidity":66,"temp_min":302.59,"temp_max":304.15},"visibility":6000,"wind":{"speed":5.7,"deg":250},"clouds":{"all":75},"dt":1561704820,"sys":{"type":1,"id":9214,"message":0.0071,"country":"IN","sunrise":1561680862,"sunset":1561728236},"timezone":19800,"id":1269843,"name":"Hyderabad","cod":200}'

Incidentally, weather.com provides data in a specific format called JSON. JSON stands for Java Script Object Notation. Once again, Python provides a standard library called json that can prase JSON data for us.

import json

data = json.loads(html)
data

{'coord': {'lon': 78.47, 'lat': 17.36},
 'weather': [{'id': 803,
   'main': 'Clouds',
   'description': 'broken clouds',
   'icon': '04d'}],
 'base': 'stations',
 'main': {'temp': 303.29,
  'pressure': 1008,
  'humidity': 66,
  'temp_min': 302.59,
  'temp_max': 304.15},
 'visibility': 6000,
 'wind': {'speed': 5.7, 'deg': 250},
 'clouds': {'all': 75},
 'dt': 1561704820,
 'sys': {'type': 1,
  'id': 9214,
  'message': 0.0071,
  'country': 'IN',
  'sunrise': 1561680862,
  'sunset': 1561728236},
 'timezone': 19800,
 'id': 1269843,
 'name': 'Hyderabad',
 'cod': 200}

Once you have the data in a JSON object, you can just use simple object notation to extract the data. For example, to get the city, use

data["name"]
'Hyderabad'

To get the minimum and maximum temperature, use

data["main"]["temp_min"]
302.59
data["main"]["temp_max"]
304.15

Just in case you are wondering why the temperature is so large, it is because the unit of temperature is Kelvin.

Database

Sometimes you might be asked to pick the data straight from an SQL database. To do this though, you will need to understand the language of any SQL database – SQL or Structured Query Language. And luckily, we can do all of this straight from Python or outside of Python.

Typically, you will be given the database details like below.

Server Address : xx.xx.xx.xx
port           : 33xx
schema         : xxxx

user id        : xxxx
password       : xxxx

For example, I have installed a simple MySql database on my local machine. I will be showing how to connect to the database right from inside the python environment. You can also use any other SQL interaction tools, like SQL Workbench etc.

To be able to connect to SQL server, you would need a Python connector. Installing it is pretty simple.

> pip install mysql-connector

import mysql.connector

db = mysql.connector.connect(
  host     = "localhost",
  user     = "root",
  passwd   = "xxxxxxxx"
)

db

Database interactions are typically retrieved using something called a cursor. A Cursor is just a pointer to a set of data retrieved from the database. It is upto us to iterate over the retrieved data and get what we went. Typically this is done using a loop. So, this is basically a 2 step process

  1. Execute an SQL statement and get the result into a cursor
  2. Iterate over the cursor to get the data

For example, let’s do these 2 steps to list all the databases. Each database is essentially a collection of tables.

Step 1 – Get the list of tables into a cursor

cur = db.cursor()

cur.execute("SHOW DATABASES")

Step 2 – Iterate over the cursor to get the list of databases

for db in cur:
  print(db)

('information_schema',)
('mysql',)
('performance_schema',)
('sakila',)
('sys',)
('world',)

Once we know the list of databases, we have to select the database first. Once we do that, we can freely go about executing the select statements on that particular database.

cur.execute("use world")

List all the tables in the database.

cur.execute("show tables")
for table in cur:
  print(table)
('city',)
('country',)
('countrylanguage',)

Let’s pick a table – say country. Now, let’s extract all the columns in that table. They will become the columns of our Pandas dataframe.

cur.execute("show columns from country")
column_names = []
for column in cur:
  column_names.append(column[0])

columns
['Code',
 'Name',
 'Continent',
 'Region',
 'SurfaceArea',
 'IndepYear',
 'Population',
 'LifeExpectancy',
 'GNP',
 'GNPOld',
 'LocalName',
 'GovernmentForm',
 'HeadOfState',
 'Capital',
 'Code2']

Once we got the column names, let’s get the actual data from the table.

cur.execute("select * from country")

import pandas as pd

country_data = pd.DataFrame(columns = column_names)

rows = []
for data in cur:
  rows.append(list(data))

country_data = pd.DataFrame(rows)
country_data.head()
0 	1 	2 	3 	4 	5 	6 	7 	8 	9 	10 	11 	12 	13 	14
0 	ABW 	Aruba 	North America 	Caribbean 	193.0 	NaN 	103000 	78.4 	828.0 	793.0 	Aruba 	Nonmetropolitan Territory of The Netherlands 	Beatrix 	129.0 	AW
1 	AFG 	Afghanistan 	Asia 	Southern and Central Asia 	652090.0 	1919.0 	22720000 	45.9 	5976.0 	NaN 	Afganistan/Afqanestan 	Islamic Emirate 	Mohammad Omar 	1.0 	AF
2 	AGO 	Angola 	Africa 	Central Africa 	1246700.0 	1975.0 	12878000 	38.3 	6648.0 	7984.0 	Angola 	Republic 	José Eduardo dos Santos 	56.0 	AO
3 	AIA 	Anguilla 	North America 	Caribbean 	96.0 	NaN 	8000 	76.1 	63.2 	NaN 	Anguilla 	Dependent Territory of the UK 	Elisabeth II 	62.0 	AI
4 	ALB 	Albania 	Europe 	Southern Europe 	28748.0 	1912.0 	3401200 	71.6 	3205.0 	2500.0 	Shqipëria 	Republic 	Rexhep Mejdani 	34.0 	AL

Great !!! We just need one last step before we finish creating the table into a Pandas dataframe. Set the column names that we have already extracted in a previous step.

country_data.columns = column_names

country_data.head()
Code 	Name 	Continent 	Region 	SurfaceArea 	IndepYear 	Population 	LifeExpectancy 	GNP 	GNPOld 	LocalName 	GovernmentForm 	HeadOfState 	Capital 	Code2
0 	ABW 	Aruba 	North America 	Caribbean 	193.0 	NaN 	103000 	78.4 	828.0 	793.0 	Aruba 	Nonmetropolitan Territory of The Netherlands 	Beatrix 	129.0 	AW
1 	AFG 	Afghanistan 	Asia 	Southern and Central Asia 	652090.0 	1919.0 	22720000 	45.9 	5976.0 	NaN 	Afganistan/Afqanestan 	Islamic Emirate 	Mohammad Omar 	1.0 	AF
2 	AGO 	Angola 	Africa 	Central Africa 	1246700.0 	1975.0 	12878000 	38.3 	6648.0 	7984.0 	Angola 	Republic 	José Eduardo dos Santos 	56.0 	AO
3 	AIA 	Anguilla 	North America 	Caribbean 	96.0 	NaN 	8000 	76.1 	63.2 	NaN 	Anguilla 	Dependent Territory of the UK 	Elisabeth II 	62.0 	AI
4 	ALB 	Albania 	Europe 	Southern Europe 	28748.0 	1912.0 	3401200 	71.6 	3205.0 	2500.0 	Shqipëria 	Republic 	Rexhep Mejdani 	34.0 	AL

Data Preprocessing

This is a pretty important step – Think of this as standardizing data into a format that is more useful for Machine Learning algorithms. We will discuss 3 important steps in data pre-processing.

Feature Scaling

We have already seen (in the Introduction to Classification that scaled data performs much better with Machine Learning Algorithms than un-scaled data. For example, let’s take quick sample and plot it before and after to understand what is happening visually.

# Generate 25 points of random data with mean of 10 and sd of 5

import numpy as np

x_unscaled = np.random.normal(loc=5, scale=1, size=25)
y_unscaled = np.random.normal(loc=5, scale=1, size=25)

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(x_unscaled, y_unscaled)

Now, let’s scale it.

from sklearn import preprocessing

x_scaled = preprocessing.scale(x_unscaled)
y_scaled = preprocessing.scale(y_unscaled)
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(x_scaled, y_scaled, color="green")
plt.scatter(x_unscaled, y_unscaled, color="red")

Scaled data has a mean of zero and a variance of 1.

x_scaled.mean()
-2.015054789694659e-16
x_scaled.std()
1.0

Scaling does not typically work if data

  • is Sparse
  • contains outliers.

In cases like these, sklearn provides special scalers.

Non-Linear Transformations

Sometimes, some of the features are not on a linear scale. One of the most frequently encountered example of this is logarithmic data. Here are some simple examples of such data.

  • Alexa Page rank or Google Domain Authority
  • Income data
  • Some Engineering data (for ex., hardness of material) etc

As you know by now (or are going to learn in the next sections when you learn more ML algorithms), most ML algorithms are based on finding out distance between points. Exponential data fits poorly with most ML algorithms. It is suggested that we transform exponential data using logarithmic functions. Numpy offers functions for the same. For example, look at the following data

x_unscaled = np.random.normal(loc=5, scale=1, size=25)
y_unscaled = np.exp(np.random.normal(loc=5, scale=1, size=25))
plt.scatter(x_unscaled,y_unscaled)

Just by looking at the data above, you will instantly realize that this data needs to be transformed. If not, the distances on the y-axis is so large (in comparision to the x-axis) that it will dominate the influence of x-axis. A simple solution in cases like this would be some kind of non-linear transformation like log.

y_scaled = np.log(y_unscaled)

plt.scatter(x_unscaled,y_scaled)

This looks much more balanced, isn’t it ?

There are other types of non-linear transforms like

  • Quantile Transforms
  • Power Transforms

that we will not be discussing in this section.

Encoding Categorical Features

Categorical features are by definition non-numeric. So, they are not ideal for most Machine Learning algorithms.

Most of the time, there are very few values to a categorical feature. for example,

  • sex
    • Male
    • Female
  • Browser
    • Chrome
    • Firefox
    • Edge

There are two types of encoders sklearn provides for encoding categorical values

  • Ordinal Encoder
  • One-Hot Encoder

Ordinal Encoder

import pandas as pd

user_id = [1,2,3,4,5,6,7,8,9,10]
sex     = ["Male","Male","Female","Male","Female","Female","Male","Female","Female","Female"]
browser = ["Chrome","Chrome","Chrome","Firefox","Edge","Firefox","Edge","Chrome","Firefox","Chrome"]

data_dict    = {"user_d": user_id , "sex": sex, "browser" : browser}

browser_data = pd.DataFrame(data_dict)

browser_data
user_d 	sex 	browser
0 	1 	Male 	Chrome
1 	2 	Male 	Chrome
2 	3 	Female 	Chrome
3 	4 	Male 	Firefox
4 	5 	Female 	Edge
5 	6 	Female 	Firefox
6 	7 	Male 	Edge
7 	8 	Female 	Chrome
8 	9 	Female 	Firefox
9 	10 	Female 	Chrome
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder().fit(browser_data)

encoder.transform(browser_data)

array([[0., 1., 0.],
       [1., 1., 0.],
       [2., 0., 0.],
       [3., 1., 2.],
       [4., 0., 1.],
       [5., 0., 2.],
       [6., 1., 1.],
       [7., 0., 0.],
       [8., 0., 2.],
       [9., 0., 0.]])

Basically OrdinalEncoder encodes categorical data into ordinal data. All it has done in this case is to transform the data based on the following simple assignment

  • Female = 0 , Male = 1
  • Chrome = 0 , Edge = 1, Firefox = 2

One-Hot Encoder

However, sometimes this kind of data is not suitable for some of the Machine Learning algorithms – simply because the numbers don’t represent actual value. Meaning, there is no numeric meaning in the transformation for sex.

Male = 1 does not mean, it is in anyway greater than Female = 0.

Scikit Learn provides another encoder for categorical data that can subvert this problem – One-Hot Encoder or Dummy encoder.

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder().fit(browser_data.iloc[:,1:3])

encoder.transform(browser_data.iloc[:,1:3]).toarray()

array([[0., 1., 1., 0., 0.],
       [0., 1., 1., 0., 0.],
       [1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0.],
       [1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 1., 0., 0.]])
encoder.categories_
[array(['Female', 'Male'], dtype=object),
 array(['Chrome', 'Edge', 'Firefox'], dtype=object)]

Missing Values

Missing values is a huge problem in real datasets. Most of the time, this happens with datasets that is manually collected. Sometimes this is a necessary evil when merging data.

# Lets introduct some missing values in the iris dataset.
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()

iris_small = iris.data[0:10,:]
iris_small[iris_small[:,:]>5.2  ] = np.nan

iris_small
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [nan, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])
iris_small[~np.isnan(iris_small).any(axis=1)]
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

Dropping unknown values (NaN) is typically a decent strategy when dealing unknown values. However, if the data itself is biased, the algorithm’s fit could also be biased. For example, imagine the person collecting the iris data could not find enough flowers of a particular species. In cases like that, missing values could be very skewed. Dropping NA’s sometimes could result in data that is biased and the resulting algorithm could also be biased.

An alternative way of dealing with NA’s is called imputation.

Imputation of missing values

Here is the definition of Impute according to the dictionary.

assign (a value) to something by inference from the value of the products or processes to which it contributes.

iris_small
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [nan, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])
iris_without_na = iris_small[~np.isnan(iris_small).any(axis=1)]
iris_without_na

# ~np.isnan(iris_small).any(axis=1)

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])
iris_without_na.mean(axis=0)
array([4.8       , 3.24444444, 1.42222222, 0.2       ])
import numpy as np
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(iris_small)
imputer.transform(iris_small)


iris_new = imputer.transform(iris_small)
iris_new
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [4.8, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

Simple Imputer

mean is just one of the strategy to fill NAs. SimpleImputer supports couple more strategies that are pretty obvious to understand.

  • median
  • most_frequent
  • constant

The trick to choosing the strategy is to understand the type of data we are trying to impute. Here are some scenarios that would require you to choose a specific strategy.

  • median – Income data or other data with large variance at the very end (or even large outliers).
  • most_frequent – This is essentially imputing with the mode of the feature.
  • constant – This is typically used in filling NAs of categorical variables ( say Male/Female & and another category say “Unknown” ).

Iterative Imputer

There is one more method for imputation that is supported by scikit learn – IterativeImputer. It is based on predicting the values by modeling each of the features based on the rest of the features.

iris_small_na = iris_small

iris_small_na[3,1] = np.nan
iris_small_na[7,2] = np.nan
iris_small_na[1,3] = np.nan

iris_small_na

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, nan],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, nan, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [nan, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, nan, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])
from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10, random_state=0)
imputer.fit(iris_small_na)

iris_small_no_na = imputer.transform(iris_small_na)
> pip install --user --upgrade scikit-learn

iris_small_no_na
array([[5.1       , 3.5       , 1.4       , 0.2       ],
       [4.9       , 3.        , 1.4       , 0.08307883],
       [4.7       , 3.2       , 1.3       , 0.2       ],
       [4.6       , 3.097572  , 1.5       , 0.2       ],
       [5.        , 3.6       , 1.4       , 0.2       ],
       [4.84755009, 3.9       , 1.7       , 0.4       ],
       [4.6       , 3.4       , 1.4       , 0.3       ],
       [5.        , 3.4       , 1.45167903, 0.2       ],
       [4.4       , 2.9       , 1.4       , 0.2       ],
       [4.9       , 3.1       , 1.5       , 0.1       ]])

Data Modeling

Modeling is the second step in the machine learning pipiline. There are many modeling algorithms and we will be studying most of them in Week 2/3/4. However, in this section, we will focus on how to validate models.

Model Evaluation Metrics

The actual metric used to evaluate the model varies based on the class of Machine Learning Algorithm used.

ClassificationConfusion Matrix, Area Under Curve, ROC
RegressionRoot Mean Squared Error (RMSE), Mean Absolute Error (MAE)

Here are the links to each of these topics.

ClassificationRegression
RMSE, ROC Curve, Area under ROC curve Confusion Matrix

Model Validation

The evaluation metrics discussed above are run on the test datasets. But that doesn’t necessarily mean that the accuracy metrics apply equally well to real data or even to the dataset at hand (just based on the train/test split). We have to ensure that the model didn’t just memorize the data (overfitting) or is biased (underfitting).

In case you are wondering why we are doing this, it is because almost all the time we only work with a subset of the real dataset. How do we know that our subset is a true representation of the real data ?

This is where validation comes in – Validation ensures that we average our model’s performance multiple times over different subsets of the data that we have so that we don’t just rely on one test. Think of Model Valuation as a Rinse and Repeat modeling to ensure we get a more realistic result.

Model Evaluation checks for model accuracy on a single test set. Model Validation runs the model evaluation multiple times over different subsets of the data available

There are many techniques to validate models. We will just learn about the following important ones that are widely used in the industry.

  • Hold-out
  • K-fold cross-validation
  • Bootstrapping

Hold-out

Hold-out is the simplest validation method. It is what we have used when we learnt about the basics of Regression and Classification in the previous chapters.

It is a very simple one step process.

from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

iris = datasets.load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=100)
model = LogisticRegression(solver="lbfgs",multi_class="auto", max_iter=200 )
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
y_predict = model.predict(X_test)

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

print ( "confusion matrix = \n" , confusion_matrix(y_test, y_predict) )
print ( "accuracy score = ",accuracy_score(y_test,y_predict) )

confusion matrix = 
 [[11  0  0]
 [ 0  5  1]
 [ 0  0 13]]
accuracy score =  0.9666666666666667

Cross-Validation

Think of Cross-validation as a recursive version of Hold-out method. For example, in hold-out we just hold-out a percentage of data ( say 20%), train the data on the remaining data. And finally test it on the held-out 20% of the data.

The problem with this approachis that it could result in over-fitting. To avoid this, we use cross-validation.

In cross-validation, we essentially do the same thing as the “hold-out” method – except that it is done in multiple iterations. The following picture shows a quick visual on how we would do this on just 10 rows of data. In this specific case, it is a 5 fold validation, because we have divided the entire dataset into 5 folds (2 rows each).

And finally, the test results over each of the iterations are averaged to provide a better estimation of the algorithm’s performance. That way, we get a more realistic performing algorithm (compared to the hold-out method).

Scikit Learn provides readymade implementation of Cross validation.

from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

iris = datasets.load_iris()
k_fold = model_selection.KFold(n_splits=10, random_state=100)
model = LogisticRegression(solver="lbfgs",multi_class="auto", max_iter=200 )
scores = model_selection.cross_val_score(model, iris.data, iris.target, cv=k_fold)

scores
array([1.        , 1.        , 1.        , 1.        , 0.93333333,
       0.86666667, 1.        , 0.86666667, 0.86666667, 0.93333333])
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.95 (+/- 0.12)

So, the real accuracy is 95% (as opposed to 96.6% as determined by a single train/test split using the hold-out method above). In this case, the data is pretty good and there is not a lot of difference between the hold-out vs cross validation accuracy. However, in real-world datasets where the variance is large, cross-validation does make a huge difference.

What have we achieved here ?

  • Reduced over-fitting
  • Reduce model bias

k-fold cross validation is the preferred method in cross validation. There are other validation methods like

  • LOOCV (Leave One Out Cross Validation)
  • LPOCV (Leave P Out Cross Validation)
  • Straified k-fold (A specific type of doing the folds in k-fold validation)
%d bloggers like this: