Just enough Linear Algebra
Intro
What is Linear Algebra
When you want to work with huge volumes of data that is similar in nature, Linear Algebra is helpful. Data Science & Machine Learning just happens to be some of the areas where we need to deal with high volumes of homogeneous data. However, these are not the only areas that use Linear Algebra. Digital Signal Processing, Computer Graphics, structural engineering are some of the other areas in engineering that use Linear Algebra very heavily.
Why should you learn Linear Algebra
Everybody starting off with Machine Learning should have atleast a fair idea of Linear Algebra. In Machine Learning, here are some of the areas where Linear Algebra is used. In addition, Linear Albegra is almost exclusively the run-time math engine behind Deep Learning.
For starters, here are some uses cases where Linear Algebra is used in ML.
Image Processing – An Image is represented in the computer using a sequence of numbers. Each of these numbers in the sequence could represent the intensity of the pixel or the color of the pixel. So, just to represent this kind of data you need matrices. And any further processing of this data ( like transformations, summations etc ) require linear algebra.
Linear Regression – Linear regression is an ML technique used to fit (approximate) data points on a linear line/plane. For example, if you want to find out how the median price of a house is dependent on parameters like crime rate, pollution, tax rates etc, simple linear algebra techniques like matrix inverse, matrix transpose and dot product can solve the problem.
crim zn indus chas nox rm age dis rad t ax ptratio b lstat medv
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 8.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7

Language Processing – Sparse matrices are extensively used in language processing to represent word count. With the sample along the x-axis and the word count across the y-axis, even for a small data set the size of the sparse matrix could get as big as 10000 x 10000. However, most of the matrix is just zeros ( and hence called sparse matrix ). So, some of the techniques from linear algebra are used to efficient work with these matrices to avoid space/time complexity that might otherwise result from these huge matrices.
There are many more examples, that we will see later likePrincipal Component Analyzis ( PCA ) for dimensionality reduction of higher dimensional dataSingular Value Decomposition ( SVD ) for processing sparse matrices in recommender systems and NLPOne Hot Encoding for encoding categorical data to numeric data etc , not to mention deep learning of course.
Our approach to learn Linear Algebra
Linear algebra is a very broad subject that has applications across a wide variety of science and engineering areas. It is generally taken as a formal course in atleast 1 semester to cover it generically. However, we will do a shallow learning of just enough Linear Algebra to cover only those topics that are necessary to understand and solve the particular ML problem.
For example, we will not learn about SVD until we come to Naive Bayes or some other Language Processing Algorithms. Similarly, we will lean inverse and transpose of a matrix only when we solve linear regression problems.
For now, we will just be learning the basics of Linear Algebra.
Linear Algebra Basics
We will cover the fundamental building blocks of linear algebra, which are vectors and matrices. Higher dimensional data structures like tensors are also important, but for now, we will do with just vectors and matrices.
Vectors
Before we understand a vector, we have to understand what a scalar is. Let’s take a simple example.
Say a bunch of students are taking home schooling and they are trying to decide which teacher to select. Each teacher has specific skillsets and have proven to boost the performance of students by a certain percentage. Let’s take the simplest case here. A single student “Ajay” gets an average CGPA of 3.0 . And a teacher t1 can boost the performance by 20 % .
What is the expected CGPA of Ajay, after getting coached from teacher 1 ?

That’s right – 3.6 . It is a simple multiplication of 2 numbers. How do you program it ?
student_1 = 3.0
teacher_1 = 1.2
student_1 = student_1 * teacher_1
round( student_1, 2 )
3.6
Say, the student Ajay has 3 subjects – Math, Physics and Chemistry. What will be the individual performance boost to each of the subjects ?

This seems simple enough as well. However, there is a concept called Broadcasting that is happening here. Although the multiplication process seems intuitive enough, we have to understand the way the multiplication is happening here.
Broadcasting
Look at the picture below. Check out how the scalar is actually extended ( broadcasted ) to become a vector.

A vector can also be multiplied by a vector. For example, if the teacher can be ranked based on their specific boost scores across subjects, then we essentially have a vector x vector multiplication.

import numpy as np
student_1 = np.array([3.0,2.5,2.0])
teacher_1 = np.array([1.2,1.1,1.3])
result = student_1 * teacher_1
result
array([3.6 , 2.75, 2.6 ])
Matrix
Matrices extend vectors across another dimension. In our example, say it is not just 1 student, but a bunch of them. How do you represent data now ?


Now, let’s go back to the simpler case of a teacher boosting the performance of any student in any subject by say 20 % . How do you represent it in linear algebra ?

Now, how do you do this multiplication ? Same as before – except the broadcasting happens across the entire matrix.

students = np.array([[3.0,2.5,2.0],
[3.1,2.0,2.5],
[2.5,3.0,3.1]])
teacher_1 = 1.2
results = students * teacher_1
results
array([[3.6 , 3. , 2.4 ],
[3.72, 2.4 , 3. ],
[3. , 3.6 , 3.72]])
Operations on vectors and matrices
While we have been doing just multiplications all along, the same logic applies to additions, substractions and divisions. Moving along, let’s take some more examples.
Say, there are 3 students and the teacher can have separate rating for each subject. In which case, the scenario becomes a matrix x vector multiplication.

students = np.array([[3.0,2.5,2.0],
[3.1,2.0,2.5],
[2.5,3.0,3.1]])
teacher_1 = np.array([1.2,1.1,1.3])
results = students * teacher_1
results
array([[3.6 , 2.75, 2.6 ],
[3.72, 2.2 , 3.25],
[3. , 3.3 , 4.03]])
And finally, we can very well multiply a matrix with a matrix. For example, there are 3 students and 3 teachers each with individual subject rating. How do we calculate the net results ?

students = np.array([[3.0,2.5,2.0],
[3.1,2.0,2.5],
[2.5,3.0,3.1]])
teachers = np.array([[1.5,1.1,1.3],
[1.3,1.1,1.6],
[1.3,1.1,1.2]])
results = students * teacher_1
results
array([[3.6 , 2.75, 2.6 ],
[3.72, 2.2 , 3.25],
[3. , 3.3 , 4.03]])
Dot Product
Dot product is a bit difficult to grasp initially. Let’s drop down to a single dimension to understand the concept better.
Scenario 1 – For example, think about a tennis ball moving at 30 km/hr and you are applying a force that will increase its speed by 4 times. What will be the speed of the ball, after applying the force ?

Simple, you multiply the 2.
speed = 30
mult = 4
net_speed = speed * mult
net_speed
120
If we can vectorize this, we can put the same calculation as
speed = np.array([30,0])
mult = np.array([4 ,0])
net_speed = speed.dot(mult)
net_speed
120
Essentially, we are saying that speed is 30 kmph in the x direction and 0 in the y direction. Similarly, force multiplier is 4 times in the x direction and 0 in the y direction. So, a dot product is essentially multiplication, but we are going to see how it is different.
Projection
Scenario 2 Say the force is being applied on the ball at a 45 o angle. What is the net speed of the ball now ? In order to answer this question, we have to look at projections.What is a Projection
In simple terms,it is quite literally the projection of a line onto another.

So, in this case, a force multiplier of 4 times at a 45 o angle to the x-axis, will result in 4 x cos(45) = 4 x 0.707 = 2.828 in the x-axisDirectional Multiplication
Directional Multiplication
When the ball is hit at an angle, you would want to find out the effect of it in a particular direction. Dot Product is essentially this – Directional Multiplication
a . b = |a| x |b| x cos(θ)
Where
a, b are vectors
|a| represents the magnitude of a
θ is the angle between a and bGemoetrically Speaking
Geometric Interpretation
The same dot product can be defined geometrically using co-ordinates.

In this case, the vectors become
speed = np.array([30,0])
mult = np.array([2.828 ,2.828])
net_speed = speed.dot(mult)
net_speed
84.83999999999999
So, the net speed now becomes 84.83 . Whereas, if the force was in the same direction, the net speed is 120. 120 is the full effect of the force. 84.83 is a partial effect of the force. So, the dot product of 2 vectors gives the directional multiplication of one vector with another.
Scenario 3 What if the ball is moving at an angle – say 30 o ? What would be the net speed based on the new vectors ?

Dot Product of Matrices
We can evaluate it in 2 ways.
- Using cosines
- Using co-ordinates
import math
abs_speed = 30
abs_mult = 4
angle = 15
net_speed = abs_speed * abs_mult * math.cos(math.radians(angle))
net_speed
115.9110991546882
speed = np.array ( [25.98, 15])
mult = np.array ( [2.828, 2.828])
net_speed = speed.dot(mult)
net_speed
115.9110991546882
As you can see, both of them should produce the same results.Dot Products of Matrices
When you evaluate the dot product of multiple vectors together, it is a dot product of matrices. Let’s revisit the students and teachers example to see how a dot product of matrices look like.
What if you want to evaluate the overall boost ( across all subjects ) by teacher ? This is where dot product comes in.

As you can see, the resulting matrix shows the overall improvement across subjects for each student by teacher. Since, we are looking at the overall performance, the “subjects” dimension is gone and we only have 2 dimensions now in the resultant matrix – student and teacher. A better way to represent this is with the following picture.

students = np.array([[3.0,2.5,2.0],
[3.1,2.0,2.5],
[2.5,3.0,3.1]])
teachers = np.array([[1.5,1.3],
[1.1,1.1],
[1.3,1.6]])
results = np.dot(students,teachers)
results
array([[ 9.85, 9.85],
[10.1 , 10.23],
[11.08, 11.51]])
You cannot perform dot product across all different matrices. For example, in this case, if you did not have the performance metrics across all the subjects ( say you just had math and physics for each of the teachers), you would not be able to perform the dot product. Numpy would complain that the shapes of the source and target matrices are not aligned
students = np.array([[3.0,2.5,2.0],
[3.1,2.0,2.5],
[2.5,3.0,3.1]])
teachers = np.array([[1.5,1.3],
[1.1,1.1]])
results = np.dot(students,teachers)
results
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-b77490298670> in <module>
5 [1.1,1.1]])
6
----> 7 results = np.dot(students,teachers)
8 results
ValueError: shapes (3,3) and (2,2) not aligned: 3 (dim 1) != 2 (dim 0)
Dot products are used extensively in machine learning and it makes sense to spend some time understanding the meaning of dot product. There are a couple of ways to interpret dot product in layman terms. Let’s see some examples.
Dot Product in Machine Learning
Imagine the US army, trying to ascertain if an enemy location is in a friendly neighbourhood or a hostile neighbourhood. There would already be a mapping available of the location and the drone identifies the target location sends it back to the HQ. Now, what is the probability that the target is a friendly or a hostile ?
Or, say we are mapping urban and sub-urban areas of a city. Given a new set of co-ordinates, can you identify if the location is urban or sub-urban ?

Finding the dot product between 2 vectors tell us a lot about how similar they are.
Smaller the angle, more the similarity.
Miscelleneous Topics
Linear Equations
Linear equations can be very easily represented using matrices. Let’s see a couple of examples. Continuing with the same students example above, say they apply to the University and each University has their own system of giving weightage to each of the subjects. For example, a certain University gives 30% weightage to math 40% to Physics and 30% to Chemistry. Finally, they give a consolidated CGPA that determines entry to that university.
For example,

From this picture it is obvious that this university is giving weightages as follows
- Math – 25 %
- Phys – 35 %
- Chem – 50 %
What if we didn’t know the weightages, but knew the final cumulative CGPA ? Like so..

Our goal is to now calculate the weightages by subject. In terms of linear equations, it can be put as follows.
How do you solve this equation ? Manually, using elimination. Or using matrices. We are more interested in the matrix model because it is more scalable. Let me show you how.

In terms of matrices, this equation can be written as
A . X = C
where
- A = CGPA matrix
- X = Weightage matrix
- C = Cumulative matrix
X = A -1 . C
Matrix Inverse
If A is a matrix, A -1 is called the inverse of a matrix. Think of matrix inverse , mathematically, as a reciprocal. If you take a number say 7, the reciprocal is 1 divided by that number.
Similarly, in linear algebra, the reciprocal of a matrix is called the inverse of a matrix. Numpy gives you readymade function for this as linalg.inv ( )
A = np.array([[3.0,2.5,2.0],
[3.1,2.0,2.5],
[2.5,3.0,3.1]])
C = np.array([2.425,2.475,2.915])
X = np.linalg.inv(A).dot(C)
X
array([0.25, 0.35, 0.4 ])
Further Topics
There are many other matrix operations like
- Transpose
- Trace
- Norm
- Determinant
- Rank
- Extensions
and many different types of matrices like
- Sparse Matrices
- Square and Symmetric Matrices
- Triangular, diagonal and Orthogonal Matrices
- Identity Matrix
We will also look at a couple of factorization methods on matrices like
- Matrix Decomposition
- Eigen Decomposition
- Singular Value Decomposition ( SVD )
and some other techniques like
- Principal Component Analysis ( PCA )
- Solving Linear Regression using
- Inverse
- QR Decomposition and SVD
Since some of the techniques are used specifically in certain machine learning algorithms, we will learn more about each of these in that algorithm’s specific context.