Just enough SQL

Just enough SQL


  Machine Learning in Python

Contents

SQL

SQL stands for Structured Query Language. It is the language of Relational databases. So, before we understand SQL, let’s understand what a Relational database is. The next couple of pages would introduce databases at a very high level.

What is a Database

A database is a software that is used to store data. There are many types of databases (relational, graph, document etc) but we are going to focus only on relational database as it is the most popular database format. Also, SQL is specifically used for relational databases.

Relational databases comprise of data in tabular format. For example, let’s see how an employee’s data could be structured in a relational database.

This is how we humans would like to see data. However, it is not scalable to store large amounts of data in a way that is more suitable for fast read and writes. That is where the concept of data normalization comes in. Normalization is a subject in itself and we will not be discussing it here. However, the process is pretty intuitive overall. Here is how the same data would be structured in a good database.

There are many databases in use today. For example, MySQL, Oracle, Microsoft SQL Server, PostgreSQL etc. In this chapter, we will be focussing on the MySQL database. Here is a quick instruction set on How to install MySQL Database.

By default MySQL database comes with very few data tables. MySQL provides a big employee database (with 300K employee records) that you can download from the web and upload onto your database. You can use it to practice data extraction. The data above is taken from that employee database.

What is SQL

SQL is the language of relational databases. It can be used to create, extract, manipulate and delete data in the database. For example, here is a quick SQL to extract the first 10 rows of data from a database table EMPLOYEES.

SELECT * FROM employees LIMIT 10;

The “*” indicates that all columns need to be selected. As you can see, the syntax is pretty intuitive. It is deliberately designed to be like English.

SQL Connectors

Like discussed previously, there are a variety of databases. Each of these databases has a specific connector and you have to install it. Since we will be working with MySQL database, we will install MySQL connector.

pip install mysql-connector

Typically, you will be given the database details like below.

Server Address : xx.xx.xx.xx
port           : 33xx
schema         : xxxx

user id        : xxxx
password       : xxxx

import mysql.connector

db = mysql.connector.connect(
  host     = "localhost",
  user     = "ajaytech",
  password   = "ajaytech"
)

db

Database interactions are typically retrieved using something called a cursor. A Cursor is just a pointer to a set of data retrieved from the database. It is upto us to iterate over the retrieved data and get what we went. Typically this is done using a loop. So, this is basically a 2 step process

  1. Execute an SQL statement and get the result into a cursor
  2. Iterate over the cursor to get the data

For example, let’s do these 2 steps to list all the databases. Each database is essentially a collection of tables.

Step 1 – Get the list of tables into a cursor

cur = db.cursor()

cur.execute("SHOW DATABASES")

Step 2 – Iterate over the cursor to get the list of databases

for db in cur:
  print(db)
('information_schema',)
('mysql',)
('performance_schema',)
('sakila',)
('sys',)
('world',)

Once we know the list of databases, we have to select the database first. Once we do that, we can freely go about executing the select statements on that particular database.

cur.execute("use world")

List all the tables in the database.

cur.execute("show tables")

for table in cur:
  print(table)

('city',)
('country',)
('countrylanguage',)

Let’s pick a table – say country. Now, let’s extract all the columns in that table. They will become the columns of our Pandas dataframe.

cur.execute("show columns from country")
column_names = []
for column in cur:
  column_names.append(column[0])

column_names

['Code',
 'Name',
 'Continent',
 'Region',
 'SurfaceArea',
 'IndepYear',
 'Population',
 'LifeExpectancy',
 'GNP',
 'GNPOld',
 'LocalName',
 'GovernmentForm',
 'HeadOfState',
 'Capital',
 'Code2']

Once we got the column names, let’s get the actual data from the table.

cur.execute("select * from country")

import pandas as pd

country_data = pd.DataFrame(columns = column_names)

rows = []
for data in cur:
  rows.append(list(data))

country_data = pd.DataFrame(rows)

country_data.head()
0 	1 	2 	3 	4 	5 	6 	7 	8 	9 	10 	11 	12 	13 	14
0 	ABW 	Aruba 	North America 	Caribbean 	193.0 	NaN 	103000 	78.4 	828.0 	793.0 	Aruba 	Nonmetropolitan Territory of The Netherlands 	Beatrix 	129.0 	AW
1 	AFG 	Afghanistan 	Asia 	Southern and Central Asia 	652090.0 	1919.0 	22720000 	45.9 	5976.0 	NaN 	Afganistan/Afqanestan 	Islamic Emirate 	Mohammad Omar 	1.0 	AF
2 	AGO 	Angola 	Africa 	Central Africa 	1246700.0 	1975.0 	12878000 	38.3 	6648.0 	7984.0 	Angola 	Republic 	José Eduardo dos Santos 	56.0 	AO
3 	AIA 	Anguilla 	North America 	Caribbean 	96.0 	NaN 	8000 	76.1 	63.2 	NaN 	Anguilla 	Dependent Territory of the UK 	Elisabeth II 	62.0 	AI
4 	ALB 	Albania 	Europe 	Southern Europe 	28748.0 	1912.0 	3401200 	71.6 	3205.0 	2500.0 	Shqipëria 	Republic 	Rexhep Mejdani 	34.0 	AL

Great !!! We just need one last step before we finish creating the table into a Pandas dataframe. Set the column names that we have already extracted in a previous step.

country_data.columns = column_names

country_data.head()

Code 	Name 	Continent 	Region 	SurfaceArea 	IndepYear 	Population 	LifeExpectancy 	GNP 	GNPOld 	LocalName 	GovernmentForm 	HeadOfState 	Capital 	Code2
0 	ABW 	Aruba 	North America 	Caribbean 	193.0 	NaN 	103000 	78.4 	828.0 	793.0 	Aruba 	Nonmetropolitan Territory of The Netherlands 	Beatrix 	129.0 	AW
1 	AFG 	Afghanistan 	Asia 	Southern and Central Asia 	652090.0 	1919.0 	22720000 	45.9 	5976.0 	NaN 	Afganistan/Afqanestan 	Islamic Emirate 	Mohammad Omar 	1.0 	AF
2 	AGO 	Angola 	Africa 	Central Africa 	1246700.0 	1975.0 	12878000 	38.3 	6648.0 	7984.0 	Angola 	Republic 	José Eduardo dos Santos 	56.0 	AO
3 	AIA 	Anguilla 	North America 	Caribbean 	96.0 	NaN 	8000 	76.1 	63.2 	NaN 	Anguilla 	Dependent Territory of the UK 	Elisabeth II 	62.0 	AI
4 	ALB 	Albania 	Europe 	Southern Europe 	28748.0 	1912.0 	3401200 	71.6 	3205.0 	2500.0 	Shqipëria 	Republic 	Rexhep Mejdani 	34.0 	AL

Instead of going the cursor route, you can also choose an ORM that can abstract the meta associated with the database table. A detailed description of an ORM is once again a subject in itself and we will not be discussing it today.

ORM stands for Object Relational Mapper. Instead of just focusing on SQL, ORM lets us use any Object Oriented paradigm to extract data from a relational database. Just think of ORM as a OO wrapper around SQL. Here is a quick visual to show the same.

One such ORM is SQLAlchemy.

pip install SQLAlchemy
import pandas as pd
import mysql.connector

import sqlalchemy as db

engine = db.create_engine('mysql+pymysql://ajaytech:ajaytech@localhost:3306/world')
df = pd.read_sql_table("country",engine)
df.head()

Code 	Name 	Continent 	Region 	SurfaceArea 	IndepYear 	Population 	LifeExpectancy 	GNP 	GNPOld 	LocalName 	GovernmentForm 	HeadOfState 	Capital 	Code2
0 	ABW 	Aruba 	North America 	Caribbean 	193.0 	NaN 	103000 	78.4 	828.0 	793.0 	Aruba 	Nonmetropolitan Territory of The Netherlands 	Beatrix 	129.0 	AW
1 	AFG 	Afghanistan 	Asia 	Southern and Central Asia 	652090.0 	1919.0 	22720000 	45.9 	5976.0 	NaN 	Afganistan/Afqanestan 	Islamic Emirate 	Mohammad Omar 	1.0 	AF
2 	AGO 	Angola 	Africa 	Central Africa 	1246700.0 	1975.0 	12878000 	38.3 	6648.0 	7984.0 	Angola 	Republic 	José Eduardo dos Santos 	56.0 	AO
3 	AIA 	Anguilla 	North America 	Caribbean 	96.0 	NaN 	8000 	76.1 	63.2 	NaN 	Anguilla 	Dependent Territory of the UK 	Elisabeth II 	62.0 	AI
4 	ALB 	Albania 	Europe 	Southern Europe 	28748.0 	1912.0 	3401200 	71.6 	3205.0 	2500.0 	Shqipëria 	Republic 	Rexhep Mejdani 	34.0 	AL

If you wanted specific columns, you could specify them using the columns attribute.

df = pd.read_sql_table("country",engine, columns=["Code","Name"])
df.head()
        Code 	Name
0 	ABW 	Aruba
1 	AFG 	Afghanistan
2 	AGO 	Angola
3 	AIA 	Anguilla
4 	ALB 	Albania

If you want to go specifically with SQL, then you can use the read_sql_query ( ) method.

df = pd.read_sql_query("select code, name from country",engine)
df.head()
        Code 	Name
0 	ABW 	Aruba
1 	AFG 	Afghanistan
2 	AGO 	Angola
3 	AIA 	Anguilla
4 	ALB 	Albania

Essentially, the same method.

Loading Data

Instead of working on the sample tables provided as part of MySQL installation (like country,city etc), let’s load up a bigger dataset into the system. MySQL provides one such database right on their website – and loading it is just as easy. Just search google for “Employee database mysql” or click here. The link to download the actual data is available here. Once there, click on the Clone or Download button and then click on the Download Zip button as shown below.

Unzip the file and navigate into the directory.

In the address bar, type in cmd and that would open the command prompt with the current directory.

like so..

Once inside, type in

mysql -u ajaytech -p -t < employees.sql

where “ajaytech” is the user name you have chosen while installing MySQL – like so..

You have to use the full path of the mysql command, if it is not added to your path, like shown above

"C:\Program Files\MySQL\MySQL Server 8.0\bin\mysql"

Once the data hase been loaded, refresh your MySQL workbench and you should be able to see the database(schema).

You should be able to drill-down to find the tables. For example, here is a snapshot of the EMPLOYEES table.

This one is a pretty decent sized database – 300K odd employees as you can see from the screenshot below.

Selecting Data

For the purpose of extracting data for Data Science & Machine Learning, we will mostly be working with SELECTING data. So, our focus will mainly be on selecting data (as opposed to UPDATEs, DELETEs or CREATE or other administrative aspects of the database).

SELECT Statement

This is probably the most used SQL statement in Data Science or otherwise. We have seen an example of how to select data from the country table above. In the following sections, we will focus on the employees database and how to extract data from the tables inside it.

Here is a snapshot of the SELECT statement syntax in it’s most generic format. As you can see, it is pretty intense.

Luckily, we only have to deal with very few of these options. Let’s start with the basics.

df = pd.read_sql_query("select emp_no, first_name from employees.employees",engine)
df.head()

        emp_no 	first_name
0 	10001 	Georgi
1 	10002 	Bezalel
2 	10003 	Parto
3 	10004 	Chirstian
4 	10005 	Kyoichi

As you can see, this one took some time. That is because we have pulled in all the records. A quick look at the shape of the dataframe should make this clear.

df.shape
(300024, 2)

Let’s try to just pull in the first 100 records.

df = pd.read_sql_query("SELECT emp_no, first_name FROM employees.employees LIMIT 100",engine)
df.head()
        emp_no 	first_name
0 	10001 	Georgi
1 	10002 	Bezalel
2 	10003 	Parto
3 	10004 	Chirstian
4 	10005 	Kyoichi

This time, the select ran pretty quick, right ? That is because, we limited the rows retrieved to just 100. You can quickly verify this with the dataframe’s shape.

df.shape

(100, 2)

Just to make the syntax look simple, we have capitalized the SQL keywords in the select statement.

SELECT emp_no, first_name FROM employees.employees LIMIT 100

as opposed to

select emp_no, first_name from employees.employees limit 100

That way you know what the SQL statement is doing by quickly looking for the capitalized keywords. Also, there is a quick way to beautify SQL statements in the MySQL Workbench. Just type in the statement and click on the “Beautify/Format” button.

Look at the way the SQL Workbench has formatted it. It took care of capitalization, tabbing and formatting. This form of the SQL statement is more useful in cases where the statement is pretty large.

So far, we have seen 3 keywords ,

  • SELECT – The main keyword to select data from database tables.
  • FROM – used to specify the tables from which the data needs to be extracted.
  • LIMIT – used to limit the number of rows to extract.

Here is a quick visual

In case you are wondering why we are selecting the schema or database everytime we do a select, that is because there are multiple schemas/databases in any database system. To use a specific database by default without having to specify the database table everytime, use the following command.

USE employees

If you do this in the workbench, the corresponding database is highlighted.

In python, when you create the connection using the ORM, specify the database.

import pandas as pd
import mysql.connector

import sqlalchemy as db

engine = db.create_engine('mysql+pymysql://ajaytech:ajaytech@localhost:3306/employees'

And from that point on, you can just specify the table (without specifying the database).

df = pd.read_sql_query("SELECT emp_no, first_name FROM employees LIMIT 100",engine)
df.head()
emp_no 	first_name
0 	10001 	Georgi
1 	10002 	Bezalel
2 	10003 	Parto
3 	10004 	Chirstian
4 	10005 	Kyoichi

Great ! Let’s move on to the next clause – WHERE. Next to FROM, this is probably the next important clause in the SELECT statement.

df = pd.read_sql_query("SELECT emp_no, first_name, birth_date FROM employees WHERE emp_no = 10002",engine)
df.head()

emp_no 	first_name 	birth_date
0 	10002 	Bezalel 	1964-06-02

Let’s add some more filter conditions using the WHERE and AND clauses.

df = pd.read_sql_query("SELECT emp_no, first_name, gender  \
                       FROM employees \
                       WHERE gender = 'M' AND \
                             emp_no <= 10003",engine)
df.head()

emp_no 	first_name 	gender
0 	10001 	Georgi 	M
1 	10003 	Parto 	M

You can use many comparators that you are used to in any programming languages

  • Less than (<) , Greater than (>)
  • Greather than or equal to (>=) , Less than or equal to (<=)
  • Not equal to ( != or <> )

BETWEEN clause

df = pd.read_sql_query("SELECT emp_no, first_name, gender  \
                       FROM employees \
                       WHERE emp_no BETWEEN 10001 AND 10003",engine)
df.head()

        emp_no 	first_name 	gender
0 	10001 	Georgi 	        M
1 	10002 	Bezalel 	F
2 	10003 	Parto 	        M

IN Clause

What if you wanted to select specific employee numbers ?

df = pd.read_sql_query("SELECT *  \
                       FROM employees \
                       WHERE emp_no IN (10001, 10003,10004)",engine)
df.head()

emp_no 	birth_date 	first_name 	last_name 	gender 	hire_date
0 	10001 	1953-09-02 	Georgi 	Facello 	M 	1986-06-26
1 	10003 	1959-12-03 	Parto 	Bamford 	M 	1986-08-28
2 	10004 	1954-05-01 	Chirstian 	Koblick 	M 	1986-12-01

LIKE clause

df = pd.read_sql_query("SELECT *  FROM employees WHERE first_name LIKE 'G%%' AND emp_no < 10010 ",engine)
df.head()

emp_no 	birth_date 	first_name 	last_name 	gender 	hire_date
0 	10001 	1953-09-02 	Georgi 	Facello 	M 	1986-06-26

ORDER BY clause

df = pd.read_sql_query("select * from employees where emp_no < 10005 order by emp_no desc ",engine)
df.head()

emp_no 	birth_date 	first_name 	last_name 	gender 	hire_date
0 	10004 	1954-05-01 	Chirstian 	Koblick 	M 	1986-12-01
1 	10003 	1959-12-03 	Parto 	Bamford 	M 	1986-08-28
2 	10002 	1964-06-02 	Bezalel 	Simmel 	F 	1985-11-21
3 	10001 	1953-09-02 	Georgi 	Facello 	M 	1986-06-26

AS clause

df = pd.read_sql_query("SELECT  emp_no AS employee_number,first_name \
                        FROM employees WHERE emp_no < 10005 ORDER BY emp_no desc ",engine)
df.head()
    employee_number 	first_name
0 	10004 	Chirstian
1 	10003 	Parto
2 	10002 	Bezalel
3 	10001 	Georgi

Aggregate Functions

MIN function

df = pd.read_sql_query("SELECT  emp_no ,MIN(birth_date),first_name \
                        FROM employees WHERE emp_no < 10005 ORDER BY emp_no desc ",engine)
df.head()

        emp_no 	MIN(birth_date) first_name
0 	10001 	1953-09-02 	Georgi

The other aggregate functions supported by MySQL are

  • AVG () – Average
  • SUM () – Summation
  • COUNT () – Row count
  • MAX () – Maximum value

AVG aggregate function

df = pd.read_sql_query("SELECT  AVG(salary) FROM salaries \
                        WHERE emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()

       AVG(salary)
0 	69713.25

MAX aggregate function

df = pd.read_sql_query("SELECT  MAX(salary) FROM salaries \
                        WHERE emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()
      MAX(salary)
0 	88958

Table Joins

Suppose you wanted to find out the salary of employee 10001, and print the following information

  • emp_no
  • first_name, last_name
  • salary

How would you do it ? The data is available in 2 different tables

  • EMPLOYEES
  • SALARIES

You would have to join information from both of these tables to get the final output you need. Here is how you do it.

df = pd.read_sql_query("SELECT  first_name, last_name, salary \
                        FROM employees JOIN salaries \
                        WHERE employees.emp_no = salaries.emp_no AND \
                              employees.emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()

first_name 	last_name 	salary
0 	Georgi 	Facello 	88958
1 	Bezalel 	Simmel 	72527
2 	Parto 	Bamford 	43311
3 	Chirstian 	Koblick 	74057

Say, you wanted to output the employee number also – Try this.

df = pd.read_sql_query("SELECT  emp_no, first_name, last_name, salary \
                        FROM employees JOIN salaries \
                        WHERE employees.emp_no = salaries.emp_no AND \
                              employees.emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()

---------------------------------------------------------------------------
InternalError: (pymysql.err.InternalError) (1052, "Column 'emp_no' in field list is ambiguous")

SQL is complaining that Column ’emp_no’ in field list is ambiguous. That is because both the tables(employees,salaries) have a column called emp_no – and MySQL doesn’t understand which one we are referring to. Let’s make this clear for the database.

df = pd.read_sql_query("SELECT  employees.emp_no, first_name, last_name, salary \
                        FROM employees JOIN salaries \
                        WHERE employees.emp_no = salaries.emp_no AND \
                              employees.emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()

emp_no 	first_name 	last_name 	salary
0 	10001 	Georgi 	Facello 	88958
1 	10002 	Bezalel 	Simmel 	72527
2 	10003 	Parto 	Bamford 	43311
3 	10004 	Chirstian 	Koblick 	74057

Most of the time regular JOINs are good enough. The default JOIN that is used to join tables is an INNER JOIN. Most of the time, this join is good enough.