Machine Learning Lifecycle
Contents
§ – will be dealt with on Day 18
\* – Modeling will be dealt with in week 3 and week 4
** – will be dealt with on day 19
Machine Learning Lifecycle
Just like any project following the software engineering process, Machine Learning also has a lifecycle. Since Machine Learning is more data oriented, the bulk of the time is spent with data. At a high level, the machine learning lifecycle looks something like this.

We are not talking about some of the much higher level project activities like
- Project Objectives
- Staffing
- Risk Management etc
Those will be talked about in the context of pure Project Management. In this section, we will be talking about the activities that you would have to be part of as either a Machine Learning Engineer or Project lead.
If you are wondering why the boxes are not even in size, it is signify the amount of time you will be spending in each of these activities. As you can see, the bulk of the activities are centered around the Data Ingestion process – and that will be the focus of this section. Modeling will be what the rest of this course will focus on. Deployment will focus on how the actual Machine Learning solution will be deployed in a live environment and how the results will be distributed to the users.
Data Ingestion
This is where you will be spending most of your time as an ML engineer. Data is messy – there is so many things to be done like finding the right data sources, cleansing, deduplication, validation etc. These are pretty broad topics that require a variety of skills like SQL, data pre-processing techniques, good excel skills and so on. We will not be discussing all of the steps in data ingestion. We will only be focusing on the following activities highlighted in bold, specifically in the context of NumPy, Pandas & Scikit Learn.
- Data Import
- Excel files
- Flat files
- Web Scraping
- API
- Databases
- Feature Extraction
- Data Preprocessing
- Feature Scaling
- Non-linear transformations
- Encoding Categorical Features
- Ordinal Encoding
- One-Hot Encoding
- Imputation of missing values
- Simple Imputer
- Predictive Imputer
- Dimensionality Reduction*
* Will be dealt with on day 18
Data Import
Data import is not a tedious step by typically time consuming. Sourcing the data is not all that straight forward most of the time.
- Easy – Sometimes, data is readily available. For example, if you were doing movie recommendations algorithm in Netflix, most of the data is readily available in their database.
- Medium – Data is readily available but in different silos/formats. For example, in the same example as above, imagine you were to get data related to external movie ratings (on top of netflix’s own movie data). This would require some level of data mangling, munging, mixing etc.
- Hard – Data is sometimes hard to get using regular methods. You might have to resort to special techniques like data scraping, write bulk downloaders using APIs etc . In some of these cases, the quality of data might also be questionable.
We will be dealing with some of the simpler methods of importing data.
Import Data from Excel files
Using NumPy
Numpy does not have functionality to upload data directly from excel (in .xls or .xlsx format). However, you can convert it to a CSV in excel and use the genfromtxt ( ) function.
import numpy as np
data = np.genfromtxt("./data/iris.csv",delimiter=",",skip_header=1)
data[0:4,:]
array([[5.1, 3.5, 1.4, 0.2, 0. ],
[4.9, 3. , 1.4, 0.2, 0. ],
[4.7, 3.2, 1.3, 0.2, 0. ],
[4.6, 3.1, 1.5, 0.2, 0. ]])
Using Pandas
To read excel files, a python package xlrd is required. Once installed, you can use Pandas’ read_excel ( ) function.
> pip install xlrd
import pandas as pd
data = pd.read_excel("./data/shopping_cart.xlsx")
data.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
Import Data from Flat files
Using NumPy
We have already seen uploading data from CSV to array using numpy’s genfromtxt ( ) function. However, you can use any other delimiters like
- tab ( \t )
- pipe delimited ( | ) etc
import numpy as np
data = np.genfromtxt("./data/iris.txt",delimiter="\t",skip_header=1)
data[0:4,:]
array([[5.1, 3.5, 1.4, 0.2, 0. ],
[4.9, 3. , 1.4, 0.2, 0. ],
[4.7, 3.2, 1.3, 0.2, 0. ],
[4.6, 3.1, 1.5, 0.2, 0. ]])
Using Pandas
Pandas has a function (read_csv) to load data with any kind of delimiter ( like tab delimited, pipe delimited etc).
- tab ( \t )
- pipe delimited ( | ) etc
import pandas as pd
data = pd.read_csv("./data/iris.csv")
data.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
import pandas as pd
data = pd.read_csv("./data/iris.txt",delimiter="\t")
data.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
Import Data using Web Scraping
Downloading HTML tables using Excel
Simple HTML tables on the web can be downloaded using Excel’s data function. For example, in some of the chapters of this course, I have downloaded population data from Wikipedia using Excel.

To download it from excel, go to the following menu location.

Enter the URL and click Import.

Data is downloaded into excel cells.

Scrape Websites
Sometimes the only form of data available is on the browser – for example, you are a third party aggregator trying to gather the best promotion on flight tickets from multiple websites. The actual website might not be willing to give you the data straight away. In cases like this, you have to literally scrape the price/discount off of their website.
Luckily, there are some libraries in Python that can do all the heavy lifting ( HTTP handshake, parsing, creating deep data structures etc). One such library is Beautiful Soup. Let’s see how to scrape
Install Beautiful Soup version 4
> pip install beautifulsoup4
Let’s find out the price of iphone Xs. Go to the apple website and navigate to the iphone Xs page. Set that page as a variable
url = "https://www.apple.com/shop/buy-iphone/iphone-xs"
Beautiful Soup does not actually go out to the web and get the web page. For that we have to use another Python standard library called requests. It is a basic HTTP request library that can go out and get content on the web for us. Once we get the actual content of the web page, Beautiful Soup can parse it and present it in a searchable object.
from bs4 import BeautifulSoup
import requests
Get the web page content and give it to Beautiful Soup to parse.
html = requests.get(url).content
soup = BeautifulSoup(html,'html.parser')
Now that we have the content, we have to figure out where exactly the prices are stored. In order to find out the tag where the price is stored, just right click on the web page in the browser and select View Page Source. In the page source, search for the price you are looking for. For example, the current price of iphone is 999 dollars. Search the page source with 999.

The prices are displayed using a span tag with class current_price. Pull out all the class tags with value of “current_price”. There are multiple ways to do it, but we will just look at one.
soup.select(".current_price")[0]
<span class="current_price">From <b>$549</b></span>
We are just looking at the first atttribute and there are many more prices ( based on the options selected).
Beautiful soup is good enough for low volume web scraping. For high volume web scraping (search engine level web scraping), use Scrapy
API
API stands for Application Programming Interface. It is a way to give programmatic access to a resource. For example, your Alexa machine goes out automatically(programmatically) and fetches the weather data for a particular zip code from weather.com. How does it do it ?
Weather.com provides an API to programmatically fetch weather data. Other examples could be xe.com providing API for exchange rates or Bloomberg providing API for stock tickets. etc.
In this section, let’s using Python to get the weather information on a particular zip code. In order to avoid abuse and keep track of requests, most of the time an API Key is provided. You can sign up for weather.com and a key will be provided to you. Without that key weather.com would not honour API requests.

APIs are typically exposed as URLs. For example, to get the weather by a city, use the following API.

Let’s use Python to extract weather for a city in India – say Hyderabad. Don’t forget to append the API key using the attribute appid. See the url formation below.
import requests
url = "http://api.openweathermap.org/data/2.5/weather?q=Hyderabad&appid="
key = "37a81ae1e682ac******b0a3727080a6"
url = url + key
html = requests.get(url).content
html
b'{"coord":{"lon":78.47,"lat":17.36},"weather":[{"id":803,"main":"Clouds","description":"broken clouds","icon":"04d"}],"base":"stations","main":{"temp":303.29,"pressure":1008,"humidity":66,"temp_min":302.59,"temp_max":304.15},"visibility":6000,"wind":{"speed":5.7,"deg":250},"clouds":{"all":75},"dt":1561704820,"sys":{"type":1,"id":9214,"message":0.0071,"country":"IN","sunrise":1561680862,"sunset":1561728236},"timezone":19800,"id":1269843,"name":"Hyderabad","cod":200}'
Incidentally, weather.com provides data in a specific format called JSON. JSON stands for Java Script Object Notation. Once again, Python provides a standard library called json that can prase JSON data for us.
import json
data = json.loads(html)
data
{'coord': {'lon': 78.47, 'lat': 17.36},
'weather': [{'id': 803,
'main': 'Clouds',
'description': 'broken clouds',
'icon': '04d'}],
'base': 'stations',
'main': {'temp': 303.29,
'pressure': 1008,
'humidity': 66,
'temp_min': 302.59,
'temp_max': 304.15},
'visibility': 6000,
'wind': {'speed': 5.7, 'deg': 250},
'clouds': {'all': 75},
'dt': 1561704820,
'sys': {'type': 1,
'id': 9214,
'message': 0.0071,
'country': 'IN',
'sunrise': 1561680862,
'sunset': 1561728236},
'timezone': 19800,
'id': 1269843,
'name': 'Hyderabad',
'cod': 200}
Once you have the data in a JSON object, you can just use simple object notation to extract the data. For example, to get the city, use
data["name"]
'Hyderabad'
To get the minimum and maximum temperature, use
data["main"]["temp_min"]
302.59
data["main"]["temp_max"]
304.15
Just in case you are wondering why the temperature is so large, it is because the unit of temperature is Kelvin.
Database
Sometimes you might be asked to pick the data straight from an SQL database. To do this though, you will need to understand the language of any SQL database – SQL or Structured Query Language. And luckily, we can do all of this straight from Python or outside of Python.
Typically, you will be given the database details like below.
Server Address : xx.xx.xx.xx
port : 33xx
schema : xxxx
user id : xxxx
password : xxxx
For example, I have installed a simple MySql database on my local machine. I will be showing how to connect to the database right from inside the python environment. You can also use any other SQL interaction tools, like SQL Workbench etc.
To be able to connect to SQL server, you would need a Python connector. Installing it is pretty simple.
> pip install mysql-connector
import mysql.connector
db = mysql.connector.connect(
host = "localhost",
user = "root",
passwd = "xxxxxxxx"
)
db
Database interactions are typically retrieved using something called a cursor. A Cursor is just a pointer to a set of data retrieved from the database. It is upto us to iterate over the retrieved data and get what we went. Typically this is done using a loop. So, this is basically a 2 step process
- Execute an SQL statement and get the result into a cursor
- Iterate over the cursor to get the data
For example, let’s do these 2 steps to list all the databases. Each database is essentially a collection of tables.
Step 1 – Get the list of tables into a cursor
cur = db.cursor()
cur.execute("SHOW DATABASES")
Step 2 – Iterate over the cursor to get the list of databases
for db in cur:
print(db)
('information_schema',)
('mysql',)
('performance_schema',)
('sakila',)
('sys',)
('world',)
Once we know the list of databases, we have to select the database first. Once we do that, we can freely go about executing the select statements on that particular database.
cur.execute("use world")
List all the tables in the database.
cur.execute("show tables")
for table in cur:
print(table)
('city',)
('country',)
('countrylanguage',)
Let’s pick a table – say country. Now, let’s extract all the columns in that table. They will become the columns of our Pandas dataframe.
cur.execute("show columns from country")
column_names = []
for column in cur:
column_names.append(column[0])
columns
['Code',
'Name',
'Continent',
'Region',
'SurfaceArea',
'IndepYear',
'Population',
'LifeExpectancy',
'GNP',
'GNPOld',
'LocalName',
'GovernmentForm',
'HeadOfState',
'Capital',
'Code2']
Once we got the column names, let’s get the actual data from the table.
cur.execute("select * from country")
import pandas as pd
country_data = pd.DataFrame(columns = column_names)
rows = []
for data in cur:
rows.append(list(data))
country_data = pd.DataFrame(rows)
country_data.head()
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 ABW Aruba North America Caribbean 193.0 NaN 103000 78.4 828.0 793.0 Aruba Nonmetropolitan Territory of The Netherlands Beatrix 129.0 AW
1 AFG Afghanistan Asia Southern and Central Asia 652090.0 1919.0 22720000 45.9 5976.0 NaN Afganistan/Afqanestan Islamic Emirate Mohammad Omar 1.0 AF
2 AGO Angola Africa Central Africa 1246700.0 1975.0 12878000 38.3 6648.0 7984.0 Angola Republic José Eduardo dos Santos 56.0 AO
3 AIA Anguilla North America Caribbean 96.0 NaN 8000 76.1 63.2 NaN Anguilla Dependent Territory of the UK Elisabeth II 62.0 AI
4 ALB Albania Europe Southern Europe 28748.0 1912.0 3401200 71.6 3205.0 2500.0 Shqipëria Republic Rexhep Mejdani 34.0 AL
Great !!! We just need one last step before we finish creating the table into a Pandas dataframe. Set the column names that we have already extracted in a previous step.
country_data.columns = column_names
country_data.head()
Code Name Continent Region SurfaceArea IndepYear Population LifeExpectancy GNP GNPOld LocalName GovernmentForm HeadOfState Capital Code2
0 ABW Aruba North America Caribbean 193.0 NaN 103000 78.4 828.0 793.0 Aruba Nonmetropolitan Territory of The Netherlands Beatrix 129.0 AW
1 AFG Afghanistan Asia Southern and Central Asia 652090.0 1919.0 22720000 45.9 5976.0 NaN Afganistan/Afqanestan Islamic Emirate Mohammad Omar 1.0 AF
2 AGO Angola Africa Central Africa 1246700.0 1975.0 12878000 38.3 6648.0 7984.0 Angola Republic José Eduardo dos Santos 56.0 AO
3 AIA Anguilla North America Caribbean 96.0 NaN 8000 76.1 63.2 NaN Anguilla Dependent Territory of the UK Elisabeth II 62.0 AI
4 ALB Albania Europe Southern Europe 28748.0 1912.0 3401200 71.6 3205.0 2500.0 Shqipëria Republic Rexhep Mejdani 34.0 AL
Data Preprocessing
This is a pretty important step – Think of this as standardizing data into a format that is more useful for Machine Learning algorithms. We will discuss 3 important steps in data pre-processing.
Feature Scaling
We have already seen (in the Introduction to Classification that scaled data performs much better with Machine Learning Algorithms than un-scaled data. For example, let’s take quick sample and plot it before and after to understand what is happening visually.
# Generate 25 points of random data with mean of 10 and sd of 5
import numpy as np
x_unscaled = np.random.normal(loc=5, scale=1, size=25)
y_unscaled = np.random.normal(loc=5, scale=1, size=25)
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x_unscaled, y_unscaled)

Now, let’s scale it.
from sklearn import preprocessing
x_scaled = preprocessing.scale(x_unscaled)
y_scaled = preprocessing.scale(y_unscaled)
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x_scaled, y_scaled, color="green")
plt.scatter(x_unscaled, y_unscaled, color="red")

Scaled data has a mean of zero and a variance of 1.
x_scaled.mean()
-2.015054789694659e-16
x_scaled.std()
1.0
Scaling does not typically work if data
- is Sparse
- contains outliers.
In cases like these, sklearn provides special scalers.
Non-Linear Transformations
Sometimes, some of the features are not on a linear scale. One of the most frequently encountered example of this is logarithmic data. Here are some simple examples of such data.
- Alexa Page rank or Google Domain Authority
- Income data
- Some Engineering data (for ex., hardness of material) etc
As you know by now (or are going to learn in the next sections when you learn more ML algorithms), most ML algorithms are based on finding out distance between points. Exponential data fits poorly with most ML algorithms. It is suggested that we transform exponential data using logarithmic functions. Numpy offers functions for the same. For example, look at the following data
x_unscaled = np.random.normal(loc=5, scale=1, size=25)
y_unscaled = np.exp(np.random.normal(loc=5, scale=1, size=25))
plt.scatter(x_unscaled,y_unscaled)

Just by looking at the data above, you will instantly realize that this data needs to be transformed. If not, the distances on the y-axis is so large (in comparision to the x-axis) that it will dominate the influence of x-axis. A simple solution in cases like this would be some kind of non-linear transformation like log.
y_scaled = np.log(y_unscaled)
plt.scatter(x_unscaled,y_scaled)

This looks much more balanced, isn’t it ?
There are other types of non-linear transforms like
- Quantile Transforms
- Power Transforms
that we will not be discussing in this section.
Encoding Categorical Features
Categorical features are by definition non-numeric. So, they are not ideal for most Machine Learning algorithms.
Most of the time, there are very few values to a categorical feature. for example,
- sex
- Male
- Female
- Browser
- Chrome
- Firefox
- Edge
There are two types of encoders sklearn provides for encoding categorical values
- Ordinal Encoder
- One-Hot Encoder
Ordinal Encoder
import pandas as pd
user_id = [1,2,3,4,5,6,7,8,9,10]
sex = ["Male","Male","Female","Male","Female","Female","Male","Female","Female","Female"]
browser = ["Chrome","Chrome","Chrome","Firefox","Edge","Firefox","Edge","Chrome","Firefox","Chrome"]
data_dict = {"user_d": user_id , "sex": sex, "browser" : browser}
browser_data = pd.DataFrame(data_dict)
browser_data
user_d sex browser
0 1 Male Chrome
1 2 Male Chrome
2 3 Female Chrome
3 4 Male Firefox
4 5 Female Edge
5 6 Female Firefox
6 7 Male Edge
7 8 Female Chrome
8 9 Female Firefox
9 10 Female Chrome
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder().fit(browser_data)
encoder.transform(browser_data)
array([[0., 1., 0.],
[1., 1., 0.],
[2., 0., 0.],
[3., 1., 2.],
[4., 0., 1.],
[5., 0., 2.],
[6., 1., 1.],
[7., 0., 0.],
[8., 0., 2.],
[9., 0., 0.]])
Basically OrdinalEncoder encodes categorical data into ordinal data. All it has done in this case is to transform the data based on the following simple assignment
- Female = 0 , Male = 1
- Chrome = 0 , Edge = 1, Firefox = 2
One-Hot Encoder
However, sometimes this kind of data is not suitable for some of the Machine Learning algorithms – simply because the numbers don’t represent actual value. Meaning, there is no numeric meaning in the transformation for sex.
Male = 1 does not mean, it is in anyway greater than Female = 0.
Scikit Learn provides another encoder for categorical data that can subvert this problem – One-Hot Encoder or Dummy encoder.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder().fit(browser_data.iloc[:,1:3])
encoder.transform(browser_data.iloc[:,1:3]).toarray()
array([[0., 1., 1., 0., 0.],
[0., 1., 1., 0., 0.],
[1., 0., 1., 0., 0.],
[0., 1., 0., 0., 1.],
[1., 0., 0., 1., 0.],
[1., 0., 0., 0., 1.],
[0., 1., 0., 1., 0.],
[1., 0., 1., 0., 0.],
[1., 0., 0., 0., 1.],
[1., 0., 1., 0., 0.]])
encoder.categories_
[array(['Female', 'Male'], dtype=object),
array(['Chrome', 'Edge', 'Firefox'], dtype=object)]
Missing Values
Missing values is a huge problem in real datasets. Most of the time, this happens with datasets that is manually collected. Sometimes this is a necessary evil when merging data.
# Lets introduct some missing values in the iris dataset.
from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
iris_small = iris.data[0:10,:]
iris_small[iris_small[:,:]>5.2 ] = np.nan
iris_small
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[nan, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])

iris_small[~np.isnan(iris_small).any(axis=1)]
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])
Dropping unknown values (NaN) is typically a decent strategy when dealing unknown values. However, if the data itself is biased, the algorithm’s fit could also be biased. For example, imagine the person collecting the iris data could not find enough flowers of a particular species. In cases like that, missing values could be very skewed. Dropping NA’s sometimes could result in data that is biased and the resulting algorithm could also be biased.
An alternative way of dealing with NA’s is called imputation.
Imputation of missing values
Here is the definition of Impute according to the dictionary.
assign (a value) to something by inference from the value of the products or processes to which it contributes.
iris_small
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[nan, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])

iris_without_na = iris_small[~np.isnan(iris_small).any(axis=1)]
iris_without_na
# ~np.isnan(iris_small).any(axis=1)
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])
iris_without_na.mean(axis=0)
array([4.8 , 3.24444444, 1.42222222, 0.2 ])
import numpy as np
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(iris_small)
imputer.transform(iris_small)
iris_new = imputer.transform(iris_small)
iris_new
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[4.8, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])
Simple Imputer
mean is just one of the strategy to fill NAs. SimpleImputer supports couple more strategies that are pretty obvious to understand.
- median
- most_frequent
- constant
The trick to choosing the strategy is to understand the type of data we are trying to impute. Here are some scenarios that would require you to choose a specific strategy.
- median – Income data or other data with large variance at the very end (or even large outliers).
- most_frequent – This is essentially imputing with the mode of the feature.
- constant – This is typically used in filling NAs of categorical variables ( say Male/Female & and another category say “Unknown” ).
Iterative Imputer
There is one more method for imputation that is supported by scikit learn – IterativeImputer. It is based on predicting the values by modeling each of the features based on the rest of the features.

iris_small_na = iris_small
iris_small_na[3,1] = np.nan
iris_small_na[7,2] = np.nan
iris_small_na[1,3] = np.nan
iris_small_na
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, nan],
[4.7, 3.2, 1.3, 0.2],
[4.6, nan, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[nan, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, nan, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=0)
imputer.fit(iris_small_na)
iris_small_no_na = imputer.transform(iris_small_na)
> pip install --user --upgrade scikit-learn
iris_small_no_na
array([[5.1 , 3.5 , 1.4 , 0.2 ],
[4.9 , 3. , 1.4 , 0.08307883],
[4.7 , 3.2 , 1.3 , 0.2 ],
[4.6 , 3.097572 , 1.5 , 0.2 ],
[5. , 3.6 , 1.4 , 0.2 ],
[4.84755009, 3.9 , 1.7 , 0.4 ],
[4.6 , 3.4 , 1.4 , 0.3 ],
[5. , 3.4 , 1.45167903, 0.2 ],
[4.4 , 2.9 , 1.4 , 0.2 ],
[4.9 , 3.1 , 1.5 , 0.1 ]])
Data Modeling
Modeling is the second step in the machine learning pipiline. There are many modeling algorithms and we will be studying most of them in Week 2/3/4. However, in this section, we will focus on how to validate models.
Model Evaluation Metrics
The actual metric used to evaluate the model varies based on the class of Machine Learning Algorithm used.
Classification – Confusion Matrix, Area Under Curve, ROC
Regression – Root Mean Squared Error (RMSE), Mean Absolute Error (MAE)
Here are the links to each of these topics.
Classification | Regression |
---|---|
RMSE, ROC Curve, Area under ROC curve | Confusion Matrix |
Model Validation
The evaluation metrics discussed above are run on the test datasets. But that doesn’t necessarily mean that the accuracy metrics apply equally well to real data or even to the dataset at hand (just based on the train/test split). We have to ensure that the model didn’t just memorize the data (overfitting) or is biased (underfitting).
In case you are wondering why we are doing this, it is because almost all the time we only work with a subset of the real dataset. How do we know that our subset is a true representation of the real data ?
This is where validation comes in – Validation ensures that we average our model’s performance multiple times over different subsets of the data that we have so that we don’t just rely on one test. Think of Model Valuation as a Rinse and Repeat modeling to ensure we get a more realistic result.
Model Evaluation checks for model accuracy on a single test set. Model Validation runs the model evaluation multiple times over different subsets of the data available
There are many techniques to validate models. We will just learn about the following important ones that are widely used in the industry.
- Hold-out
- K-fold cross-validation
- Bootstrapping
Hold-out
Hold-out is the simplest validation method. It is what we have used when we learnt about the basics of Regression and Classification in the previous chapters.
It is a very simple one step process.

from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=100)
model = LogisticRegression(solver="lbfgs",multi_class="auto", max_iter=200 )
model.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=200,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
y_predict = model.predict(X_test)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
print ( "confusion matrix = \n" , confusion_matrix(y_test, y_predict) )
print ( "accuracy score = ",accuracy_score(y_test,y_predict) )
confusion matrix =
[[11 0 0]
[ 0 5 1]
[ 0 0 13]]
accuracy score = 0.9666666666666667
Cross-Validation
Think of Cross-validation as a recursive version of Hold-out method. For example, in hold-out we just hold-out a percentage of data ( say 20%), train the data on the remaining data. And finally test it on the held-out 20% of the data.
The problem with this approachis that it could result in over-fitting. To avoid this, we use cross-validation.
In cross-validation, we essentially do the same thing as the “hold-out” method – except that it is done in multiple iterations. The following picture shows a quick visual on how we would do this on just 10 rows of data. In this specific case, it is a 5 fold validation, because we have divided the entire dataset into 5 folds (2 rows each).
And finally, the test results over each of the iterations are averaged to provide a better estimation of the algorithm’s performance. That way, we get a more realistic performing algorithm (compared to the hold-out method).

Scikit Learn provides readymade implementation of Cross validation.
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
iris = datasets.load_iris()
k_fold = model_selection.KFold(n_splits=10, random_state=100)
model = LogisticRegression(solver="lbfgs",multi_class="auto", max_iter=200 )
scores = model_selection.cross_val_score(model, iris.data, iris.target, cv=k_fold)
scores
array([1. , 1. , 1. , 1. , 0.93333333,
0.86666667, 1. , 0.86666667, 0.86666667, 0.93333333])
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.95 (+/- 0.12)
So, the real accuracy is 95% (as opposed to 96.6% as determined by a single train/test split using the hold-out method above). In this case, the data is pretty good and there is not a lot of difference between the hold-out vs cross validation accuracy. However, in real-world datasets where the variance is large, cross-validation does make a huge difference.
What have we achieved here ?
- Reduced over-fitting
- Reduce model bias
k-fold cross validation is the preferred method in cross validation. There are other validation methods like
- LOOCV (Leave One Out Cross Validation)
- LPOCV (Leave P Out Cross Validation)
- Straified k-fold (A specific type of doing the folds in k-fold validation)