Scatter Plot

Scatter Plot


  Data visualization

Contents

Scatter Plot

Scatter plots are used to represent the relation between two variables, one variable plotted along the x-axis and the other plotted along the y-axis. Scatter plots are used to spot trends and the correlation between two variables i.e., how much one variable is affected by another. In a scatter plot, a dot or small circle represents a single data point. If the data points are clustered together, it means the relation between the two variables is strong. If the data points are widely spread in the plot then the variables have a weak relation.

2D Scatter Plot

Correlation

In the example below, a simple scatter plot is plotted with two variables x,y using the ‘scatter’ function defined in the pyplot module. The data points are represented with dots. It can be seen that the two variables have a positive correlation, which means if the value of one variable increases, the value of the other variable increases as well and vice versa.

%matplotlib inline
from matplotlib import pyplot as plt 
import numpy as np
x = np.random.randint(0,50,200)
y = x**2+x+2
plt.scatter(x,y,c='magenta',linewidth=1,edgecolor='green')
plt.title('2D Scatter Plot')
plt.xlabel('x-->')
plt.ylabel('y-->')
plt.show()

Analysing the Iris dataset using Scatter Plots

Visualizing Patterns

In the next example, we will analyse the iris dataset using a scatter plot. The iris data set contains measurements in centimeters for the characteristics – sepal length and width, and petal length and width, for 150 flowers from 3 species of iris flowers – Iris setosa, versicolor and virginica. The dataset contains 50 instances for each species. The data was collected over several years by Edgar Anderson, a biologist, who used the data to show that the measurements could be used to differentiate between different species of irises. The iris flower dataset is now widely used for testing purposes in computer science. In order to plot a scatter plot, we will use the characteristics – petal length and width as variables.

# read the data file
import csv
with open (r'C:\Users\Ajay Tech\Documents\Matplotlib\iris.csv') as csv_file:
    input_file = csv.reader(csv_file,delimiter = ',')
    Header = next(input_file)
    sepal_length = []
    sepal_width = []
    petal_length = []
    petal_width = []
    for row in input_file:
        sepal_length.append(float(row[0]))
        sepal_width.append(float(row[1]))
        petal_length.append(float(row[2]))
        petal_width.append(float(row[3]))

# plot the data        
plt.scatter(petal_length[ :51],petal_width[ :51],c='red',label='Iris-setosa')
plt.scatter(petal_length[51:101],petal_width[51:101],c='green',label='Iris-versicolor')
plt.scatter(petal_length[101:],petal_width[101:],c='blue',label='Iris-virginica')
plt.title('Iris Data')
plt.xlabel('petal_length')
plt.ylabel('petal_width')
plt.legend()
plt.show()

We have used different colours to differentiate between the three species in the plot. We can clearly see from the plot that the data points are separated into three different groups despite the slight overlap between Versicolor and Virginica. From the distribution of data points on the plot we can infer that the three species have different petal sizes. Flowers belonging to the Setosa species can be easily distinguished from the other two species, as the petal size of the Setosa species do not overlap with the other two. By examining the above scatter plot we see an overall positive correlation between petal length and petal width for the three species.

Multidimensional Scatter Plot

Scatter plot is a two dimensional visualization tool, but we can easily add another dimension to the 2D plot using the visual variables such as the color, size and shape. Say for example, you want to see the correlation between three variables then you can map the third variable to the marker size of each data point in the plot. So the marker size represents an additional third dimension.

We can plot all the data points with the same color by specifying a color name or we can plot data points in varying colors. For example, we can change the color intensity of the data points from bright to dark, in this case color for each data point can be retrieved from a color map. Color map also called a color look up table, is a three-column matrix whose length is equal to the number of colors it defines. Each row of the matrix defines a particular color by specifying three values in the range 0 to 1. These three values define the intensities of the red, green and blue components respectively. Color maps are usually used to highlight the data or apply effects to your plots. The default colormap in Matplotlib is “viridis”.

Sample color map:

RedGreenBlueColor
000black
111white
100red
010green
001blue
110yellow
101magenta
011cyan
10.250.25coral red
110.19daffodil
0.50.50.5grey

To plot the data points in varying colors we need to map values in our data to colors in a plot. In order to see the range of colors in the colormap and the assignment of numerical values in the third dimension we can use a colorbar. A colorbar displays the current colormap along with numerical rulings so that the color scale can be interpreted.

4D Scatter Plot

In the example below, we have a scatter plot with the first two dimensions as the variables x and y. The third variable (z) is represented in the third dimension by mapping it to the marker size. So the marker size corresponds to the different values in the variable ‘z’. We have also added a fourth dimension to the plot which is the color of the data points, the colors correspond to the values in the numpy array ‘color’.

import numpy as np
x = np.random.randint(1,100,100)
y = np.random.randint(1,100,100)
z = 5 * np.random.randint(1,100,100)
color = np.random.randint(1,100,100)

plt.scatter(x,y,s=z,c=color,alpha=0.5,cmap='ocean')
plt.title('4D Scatter Plot')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.colorbar()
plt.show()

The above scatter plot does not show any evident relationship between the variables as the dots are scattered around the entire plot area. So the variables have zero correlation.

%d bloggers like this: