Hello World in Neural Networks

Hello World in Neural Networks


  Machine Learning in Python

Contents

iris Classification using Tensorflow

Many times, the MNIST database (a database of handwritten digits) is typically the Hello World application when introducing Neural Networks for the first time. However, we are going to make it even simpler by taking the iris dataset and create a keras based tensorflow Neural Network to classify species. Please refer to Iris Data to understand more about the dataset we will be working on. You can also refer to Classification in Python to understand more about a non-neural network based classification approach to classifying the species in the iris dataset.

Once you understand how to solve the iris classification problem in Neural Networks, we will move to image recognition. As you will see, structurally there is not a lot of difference in the way we build the neural net for both of these problems.

This is just a “Hello World” tutorial. It is not intended to teach you the internals of Neural Networks. With that background, we are now ready to say hello to Neural Networks using Tensorflow.

import tensorflow as tf
from   tensorflow import keras

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Tensorflow is probably the most popular, open source library from Google that is used to implement Deep Learning. You can build neural networks of any complexity using Tensorflow. However, building a neural net from scratch typically involves defining

  • Layers
  • Linking the layers
  • loss function
  • weight adjustments etc

Defining these manually is very time consuming and daunting for newbies. What is needed is an abstract layer above Tensorflow, that makes building neural nets much quicker and easier.

Keras is the answer. Keras is a high level Python based API that can be used to build neural nets by leveraging Tensorflow. By the way, Tensorflow is not the only deep learning package out there. Here is a quick visual that shows you where Keras and Tensorflow stand in the hierarchy.

# load iris dataset
from sklearn import datasets
iris = datasets.load_iris()

# preview the iris data
print ( iris.data[0:5,:]  ) # data
print ( iris.target[0:5]  ) # target species

# train/test split @ 20% test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data , iris.target, test_size=0.2)  
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0 0 0 0 0]

The following 8 lines of code is all you need to solve the problem. Quickly execute it to see the output for yourself. However, there is quite a lot of explantion to be done here. Let’s take it step by step.

model = keras.Sequential()
model.add(keras.layers.Dense(4,input_shape=(4,)))
model.add(keras.layers.Dense(8,activation="relu"))
model.add(keras.layers.Dense(8,activation="relu"))
model.add(keras.layers.Dense(3,activation="softmax"))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=100)

y_pred = model.predict(X_test)
y_pred
Epoch 1/100
120/120 [==============================] - 0s 521us/sample - loss: 1.6050 - acc: 0.2917
Epoch 2/100
120/120 [==============================] - 0s 37us/sample - loss: 1.5040 - acc: 0.2917
... ... 
... ...
Epoch 98/100
120/120 [==============================] - 0s 45us/sample - loss: 0.4054 - acc: 0.8917
Epoch 99/100
120/120 [==============================] - 0s 29us/sample - loss: 0.4006 - acc: 0.8917
Epoch 100/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3962 - acc: 0.8917
array([[1.20811760e-02, 3.91645938e-01, 5.96272826e-01],
       [1.69933531e-02, 3.97484392e-01, 5.85522234e-01],
       [9.59960818e-01, 3.92647162e-02, 7.74465443e-04],
       [1.90705255e-01, 6.64598465e-01, 1.44696265e-01],
       ... ... ... ...
       [9.47779417e-01, 5.09882867e-02, 1.23227667e-03],
       [1.00235706e-02, 3.16601396e-01, 6.73375070e-01]], dtype=float32

Step 1 – What type of neural network are we building ?

model = keras.Sequential()

There are two types of Neural networks that can be build in Keras

  • Sequential
  • Functional

This classification is related to the structure of the Neural Network. However, most of the time you will be using Sequential model. It can solve most of the problems. In a sequential neural net, neurons are arranged in layers and in sequence . The firing and wiring happen in sequence, hence the name. Later in the course when we see an example of a functional neural net, the difference will be clear. Here is a quick visual of what we are building.

Finally, when the network is trained, the outputs corresponding to the species (for the corresponding data points) will light-up. When we look at the last step, we will understand what I meant by light-up.

Step 2 – How are the neurons connected ?

We are building a Dense neural network.

model.add(keras.layers.Dense(4,input_shape=(4,)))

A Dense neural network is one in which each neuron is connected to all other neurons in the previous and next layers.

You can see from the visual below that the arrows coming in to each neuron are connected to all the neurons in the previous layer. This is the most typical type of neural network.

Also, with this statement, we are just building the input layer. An input layer with 4 nodes, one node for each of the inputs. Naturally, the assumption at this point would be that there would be as many nodes in the input layer as the number of inputs. So, why specify the input_shape parameter again ? In later examples we will see that the input data shape need not always match with the input nodes. We specify the input_shape parameter as a tuple. In this case the input is a 1-d vector. Once again, later in the course we will see examples of 2-d data.

The parameter input_shape is only used when creating the first layer. The next set of steps (hidden layer and output layer) do not need this parameter.

Step 3 – Hidden Layers

This is where the magic happens. Let’s try it with just one hidden layer.

model.add(keras.layers.Dense(8,activation="relu"))

Irrespective of the layer (input, hidden or output), the way to add layers is using the add function. That should make things easy for us. The new parameter that you see in the hidden layer is the activation parameter.

In this case, we are using a relu activation function. ReLU stands for Rectified Linear Unit. The mathematical definition of relu is

The output of the activation function looks like this.

Step 4 – Output Layer

After the hidden layer is added, we add the output layer. Since we are doing a multi-class classification, the preferred activation function is called as a softmax – more on this later. A softmax activation function gives out multiple probability values and the one with the highest probability is the predicted output.

model.add(keras.layers.Dense(3,activation="softmax"))

Step 5 – Compile the model

So far, we have created the structure of the neural net – layer by layer. At each step, we have defined the number of nodes and the activation function to be used. Once we have completed it, we now have to compile the model.

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

So far, we have just defined how the neural net should look like. With the compile ( ) method, Keras translates the parameters you have specified into an optimized series of steps that can then be executed on the computer. Without the compile step, you cannot fit (train) the model. We will see how we can use metrics in a bit, but optimizer and loss parameters requrie quite a bit of explanation.

Typically, Machine Learning algorithm requires some kind of a loss function to be minimized. Gradient Descent is a commonly used loss function. For classification problems, a common loss function is Cross Entropy. Cross Entropy is also called as Log Loss. Mathematically, this is a how a cross entropy function can be defined for 2 classes.

Let’s look at an example below. Say, we are just looking at 2 species of iris flowers.

0 – setosa 1 – virginica

If the model has predicted the species to be setosa with a probability of 0.2, the loss function can be calculated as follows.

import numpy as np

p = np.array([0.0001,0.001,0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99])

# when y = 1, what is the loss function ?
y = 1
l = - (y * np.log10(p) + (1-y) * np.log10(1-p)  )

# now plot it to see how the loss function decreases as the predicted value approaches the actual value (of y = 1)
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(p,l)
plt.xlabel("Different predicted values when the actual value is 1")
plt.ylabel("Loss function")
Text(0, 0.5, 'Loss function')

What this plot means is that the more the predicted value deviates from the actual value, the more the loss function is. For example, when the predicted value reaches close to the actual value (of 1 in this case), the loss function gets closer and closer to 0.

At this point, you can see a quick summary of the model you have created so far

model.summary()
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_8 (Dense)              (None, 4)                 20        
_________________________________________________________________
dense_9 (Dense)              (None, 8)                 40        
_________________________________________________________________
dense_10 (Dense)             (None, 3)                 27        
=================================================================
Total params: 87
Trainable params: 87
Non-trainable params: 0
_________________________________________________________________

Step 6 – Fit the model with training data

This is where we train the dataset. The word epoch represents one complete iteration over the training dataset. With each epoc (one pass over the entire dataset) the weights are adjusted and the accuracy slowly increases. Since you have accuracy as a metric in step 5, it is shown at each of the training epoch. That way you see how the accuracy increases with each epoch.

model.fit(X_train, y_train, epochs=100)

Epoch 1/100
120/120 [==============================] - 0s 446us/sample - loss: 1.6451 - acc: 0.2167
Epoch 2/100
120/120 [==============================] - 0s 33us/sample - loss: 1.5703 - acc: 0.2750

...
...
Epoch 98/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3232 - acc: 0.9750
Epoch 99/100
120/120 [==============================] - 0s 33us/sample - loss: 0.3203 - acc: 0.9750
Epoch 100/100
120/120 [==============================] - 0s 58us/sample - loss: 0.3171 - acc: 0.9750

After finishing 100 epochs, the accuracy is around 69% – not bad for our first attempt. We will enhance it shortly to 90% by just adding one more hidden layer.

Step 7 – Predict data

Now that the model is trained, you can start predicting your test data. This step is pretty straight forward if you have already used sklearn to predict test data based on any machine learning model.

y_pred = model.predict(X_test)
y_pred
array([[4.80523752e-03, 1.61355630e-01, 8.33839059e-01],
       [5.65165561e-03, 1.38419360e-01, 8.55928957e-01],
       ... ... ...
       [9.26478028e-01, 6.74546957e-02, 6.06729649e-03],
       [8.82636756e-03, 2.28457645e-01, 7.62715995e-01]], dtype=float32)

Since this is a multi-class output, what the neural net outputs are probabilities. The highest probability among the three elements is the predicted value. However, we need to convert these probabilities back to indices.

y_pred_class = np.argmax(y_pred,axis=1)
y_pred_class
array([2, 2, 0, 1, 2, 0, 2, 0, 1, 0, 2, 0, 0, 2, 1, 2, 1, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 1, 2, 0, 2], dtype=int64)

Step 8 – Evaluate Model

Since the output is categorical data, a quick confusion matrix will show use how far we are from the model. Use scikit learn’s confusion matrix should do.

from sklearn.metrics import confusion_matrix
 
cm = confusion_matrix(y_test, y_pred_class)
print ( cm )

[[15  0  0]
 [ 0  5  1]
 [ 0  0  9]]

And of course the final number – accuracy.

from sklearn.metrics import accuracy_score
 
accuracy_score(y_test,y_pred_class)

0.9666666666666667

That is pretty low by any Machine Learning standards for this dataset. Let’s optimize it.

Step 9 – Optimize Model

There are a couple of ways to optimize for higher accuracy. One way is to increase the nodes in the hidden layer. Let’s try to increase the number of nodes from 8 to 20 and see how the network performs.

model = keras.Sequential()
model.add(keras.layers.Dense(4,input_shape=(4,)))
# BEGIN change - increase the number of nodes from 8 to 20
model.add(keras.layers.Dense(20,activation="relu"))
# END change
model.add(keras.layers.Dense(3,activation="softmax"))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=100)

y_pred = model.predict(X_test)
y_pred_class = np.argmax(y_pred,axis=1)

from sklearn.metrics import accuracy_score
 
accuracy_score(y_test,y_pred_class)

Epoch 1/100
120/120 [==============================] - 0s 484us/sample - loss: 1.9111 - acc: 0.2917
Epoch 2/100
120/120 [==============================] - 0s 29us/sample - loss: 1.7657 - acc: 0.2917
Epoch 3/100
120/120 [==============================] - 0s 33us/sample - loss: 1.6348 - acc: 0.2917
... ...
... ...
Epoch 98/100
120/120 [==============================] - 0s 29us/sample - loss: 0.3040 - acc: 0.9583
Epoch 99/100
120/120 [==============================] - 0s 33us/sample - loss: 0.2998 - acc: 0.9667
Epoch 100/100
120/120 [==============================] - 0s 29us/sample - loss: 0.2974 - acc: 0.9500
0.9666666666666667

We have now hit 90% accuracy. That’s pretty much close to what most ML models would achieve. Let’s try to keep the number of nodes the same, but add one more hidden layer.

model = keras.Sequential()
model.add(keras.layers.Dense(4,input_shape=(4,)))
model.add(keras.layers.Dense(8,activation="relu"))
# BEGIN Change - add one more hidden layer
model.add(keras.layers.Dense(8,activation="relu"))
# END Change

model.add(keras.layers.Dense(3,activation="softmax"))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=100)

y_pred = model.predict(X_test)
y_pred_class = np.argmax(y_pred,axis=1)

from sklearn.metrics import accuracy_score
 
accuracy_score(y_test,y_pred_class)
Epoch 1/100
120/120 [==============================] - 0s 574us/sample - loss: 1.9741 - acc: 0.2333
Epoch 2/100
120/120 [==============================] - 0s 33us/sample - loss: 1.8711 - acc: 0.3417
Epoch 3/100
120/120 [==============================] - 0s 33us/sample - loss: 1.7824 - acc: 0.3417
...  ...
...  ...
Epoch 99/100
120/120 [==============================] - 0s 29us/sample - loss: 0.2942 - acc: 0.9667
Epoch 100/100
120/120 [==============================] - 0s 29us/sample - loss: 0.2887 - acc: 0.9583
0.9333333333333333

That’s 80% accuracy. The immediate question you might have is – How should you choose the number of nodes or the number of hidden layers ? Unfortunately, the meaning of weights and outputs is essentially a blackbox to humans. Meaning, we cannot make any sense out of it.

Choosing the size and complexity of a neural network (like the numbner of nodes and the number of hidden layers) is more art than science

MNIST handwritten digits classification

If all we had to do in Neural Networks was classify iris data, we wouldn’t be needing Neural networks to start with. We need a more involved dataset to quality as a “Hello World” program in Neural Networks. Welcome the MNIST digits dataset. It is a dataset of handwritten images that are scanned, standardized and optimized for machine learning. Tensorflow comes built-in with this dataset. Let’s quickly load it to see how these images look like.

import tensorflow as tf
from tensorflow import keras

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# load the data
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Show the first picture in the training dataset.
plt.figure()
plt.imshow(train_images[0])

It is a handwritten digit – 5. Although the image looks in color, these are actually gray-scale images. All of the data is standardized into 28×28 pixels. And each pixel has an intensity value between 0 and 255 ( 0 – 28 ). Since this is a small image (just 28×28 pixels), we can write it to an excel file and see the numbers visually.

# write the first image to a csv file.
np.savetxt("image_1.csv",train_images[0],delimiter=",")

---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
<ipython-input-68-3e86563d8a79> in <module>
      1 # write the first image to a csv file.
----> 2 np.savetxt("image_1.csv",train_images[0],delimiter=",")

<__array_function__ internals> in savetxt(*args, **kwargs)

~\AppData\Roaming\Python\Python37\site-packages\numpy\lib\npyio.py in savetxt(fname, X, fmt, delimiter, newline, header, footer, comments, encoding)
   1362     if _is_string_like(fname):
   1363         # datasource doesn't support creating a new file ...
-> 1364         open(fname, 'wt').close()
   1365         fh = np.lib._datasource.open(fname, 'wt', encoding=encoding)
   1366         own_fh = True

PermissionError: [Errno 13] Permission denied: 'image_1.csv'

If you open the csv file in excel, and adjust the column size and change the number format to zero decimals, you should see a picture like this. Can you identify the digit 5 in there ?

Confirm that the image is in fact a 5.

# Print the first label
train_labels[0]

Prepare the model. The first layer in this case is slightly different from the first layer in the iris example above. As you can see from the input image data, it is a 28×28 dimension numpy array. However, we are going to be working with flat (a flat set of neurons in each layer). So, the first layer will be essentially a 784 (28 x 28 = 784) node layer that will be created automatically by flattening the input array.

model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=6)
test_loss, test_acc = model.evaluate(test_images, test_labels)

print('Test accuracy:', test_acc)
predictions = model.predict(test_images)

The prediction is a vector of 10 x 1 dimension of probabilities. The value with the highest probability in the output array is the predicted value. For example, let’s see what is the first image in the test_image.

test_labels[0]

It is a 7. Let’s see how the probabilities are predicted.

np.savetxt("predictions.csv",predictions[0:10],delimiter=",")
predicted_final[0:10]

You can see from the array above that the highest probability value is at location 7. Let’s apply the argmax function of numpy to just filter out the values with the highest probability into a new array. For example, the first output array has an argmax output of 7 ( argmax ( ) outputs the index of the highest element in a numpy array).

Let’s apply the argmax on the output array to predicted array.

predicted_final = np.argmax(predictions,axis=1)
from sklearn.metrics import accuracy_score
 
accuracy_score(test_labels,predicted_final)

That’s a 95% accuracy on the test datasets.

Great !! Our “Hello World” of Neural Networks is complete. The next couple of days we will focus on the moving parts of a neural network and how Gradient Descent is used in Neural networks to optimize the weights. This is how the neural network essentially learns.

%d bloggers like this: