Just enough SQL
Contents
SQL
SQL stands for Structured Query Language. It is the language of Relational databases. So, before we understand SQL, let’s understand what a Relational database is. The next couple of pages would introduce databases at a very high level.
What is a Database
A database is a software that is used to store data. There are many types of databases (relational, graph, document etc) but we are going to focus only on relational database as it is the most popular database format. Also, SQL is specifically used for relational databases.
Relational databases comprise of data in tabular format. For example, let’s see how an employee’s data could be structured in a relational database.

This is how we humans would like to see data. However, it is not scalable to store large amounts of data in a way that is more suitable for fast read and writes. That is where the concept of data normalization comes in. Normalization is a subject in itself and we will not be discussing it here. However, the process is pretty intuitive overall. Here is how the same data would be structured in a good database.

There are many databases in use today. For example, MySQL, Oracle, Microsoft SQL Server, PostgreSQL etc. In this chapter, we will be focussing on the MySQL database. Here is a quick instruction set on How to install MySQL Database.
By default MySQL database comes with very few data tables. MySQL provides a big employee database (with 300K employee records) that you can download from the web and upload onto your database. You can use it to practice data extraction. The data above is taken from that employee database.
What is SQL
SQL is the language of relational databases. It can be used to create, extract, manipulate and delete data in the database. For example, here is a quick SQL to extract the first 10 rows of data from a database table EMPLOYEES.
SELECT * FROM employees LIMIT 10;

The “*” indicates that all columns need to be selected. As you can see, the syntax is pretty intuitive. It is deliberately designed to be like English.
SQL Connectors
Like discussed previously, there are a variety of databases. Each of these databases has a specific connector and you have to install it. Since we will be working with MySQL database, we will install MySQL connector.
pip install mysql-connector
Typically, you will be given the database details like below.
Server Address : xx.xx.xx.xx
port : 33xx
schema : xxxx
user id : xxxx
password : xxxx
import mysql.connector
db = mysql.connector.connect(
host = "localhost",
user = "ajaytech",
password = "ajaytech"
)
db
Database interactions are typically retrieved using something called a cursor. A Cursor is just a pointer to a set of data retrieved from the database. It is upto us to iterate over the retrieved data and get what we went. Typically this is done using a loop. So, this is basically a 2 step process
- Execute an SQL statement and get the result into a cursor
- Iterate over the cursor to get the data
For example, let’s do these 2 steps to list all the databases. Each database is essentially a collection of tables.
Step 1 – Get the list of tables into a cursor
cur = db.cursor()
cur.execute("SHOW DATABASES")
Step 2 – Iterate over the cursor to get the list of databases
for db in cur:
print(db)
('information_schema',)
('mysql',)
('performance_schema',)
('sakila',)
('sys',)
('world',)
Once we know the list of databases, we have to select the database first. Once we do that, we can freely go about executing the select statements on that particular database.
cur.execute("use world")
List all the tables in the database.
cur.execute("show tables")
for table in cur:
print(table)
('city',)
('country',)
('countrylanguage',)
Let’s pick a table – say country. Now, let’s extract all the columns in that table. They will become the columns of our Pandas dataframe.
cur.execute("show columns from country")
column_names = []
for column in cur:
column_names.append(column[0])
column_names
['Code',
'Name',
'Continent',
'Region',
'SurfaceArea',
'IndepYear',
'Population',
'LifeExpectancy',
'GNP',
'GNPOld',
'LocalName',
'GovernmentForm',
'HeadOfState',
'Capital',
'Code2']
Once we got the column names, let’s get the actual data from the table.
cur.execute("select * from country")
import pandas as pd
country_data = pd.DataFrame(columns = column_names)
rows = []
for data in cur:
rows.append(list(data))
country_data = pd.DataFrame(rows)
country_data.head()
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 ABW Aruba North America Caribbean 193.0 NaN 103000 78.4 828.0 793.0 Aruba Nonmetropolitan Territory of The Netherlands Beatrix 129.0 AW
1 AFG Afghanistan Asia Southern and Central Asia 652090.0 1919.0 22720000 45.9 5976.0 NaN Afganistan/Afqanestan Islamic Emirate Mohammad Omar 1.0 AF
2 AGO Angola Africa Central Africa 1246700.0 1975.0 12878000 38.3 6648.0 7984.0 Angola Republic José Eduardo dos Santos 56.0 AO
3 AIA Anguilla North America Caribbean 96.0 NaN 8000 76.1 63.2 NaN Anguilla Dependent Territory of the UK Elisabeth II 62.0 AI
4 ALB Albania Europe Southern Europe 28748.0 1912.0 3401200 71.6 3205.0 2500.0 Shqipëria Republic Rexhep Mejdani 34.0 AL
Great !!! We just need one last step before we finish creating the table into a Pandas dataframe. Set the column names that we have already extracted in a previous step.
country_data.columns = column_names
country_data.head()
Code Name Continent Region SurfaceArea IndepYear Population LifeExpectancy GNP GNPOld LocalName GovernmentForm HeadOfState Capital Code2
0 ABW Aruba North America Caribbean 193.0 NaN 103000 78.4 828.0 793.0 Aruba Nonmetropolitan Territory of The Netherlands Beatrix 129.0 AW
1 AFG Afghanistan Asia Southern and Central Asia 652090.0 1919.0 22720000 45.9 5976.0 NaN Afganistan/Afqanestan Islamic Emirate Mohammad Omar 1.0 AF
2 AGO Angola Africa Central Africa 1246700.0 1975.0 12878000 38.3 6648.0 7984.0 Angola Republic José Eduardo dos Santos 56.0 AO
3 AIA Anguilla North America Caribbean 96.0 NaN 8000 76.1 63.2 NaN Anguilla Dependent Territory of the UK Elisabeth II 62.0 AI
4 ALB Albania Europe Southern Europe 28748.0 1912.0 3401200 71.6 3205.0 2500.0 Shqipëria Republic Rexhep Mejdani 34.0 AL
Instead of going the cursor route, you can also choose an ORM that can abstract the meta associated with the database table. A detailed description of an ORM is once again a subject in itself and we will not be discussing it today.
ORM stands for Object Relational Mapper. Instead of just focusing on SQL, ORM lets us use any Object Oriented paradigm to extract data from a relational database. Just think of ORM as a OO wrapper around SQL. Here is a quick visual to show the same.

One such ORM is SQLAlchemy.
pip install SQLAlchemy
import pandas as pd
import mysql.connector
import sqlalchemy as db
engine = db.create_engine('mysql+pymysql://ajaytech:ajaytech@localhost:3306/world')
df = pd.read_sql_table("country",engine)
df.head()
Code Name Continent Region SurfaceArea IndepYear Population LifeExpectancy GNP GNPOld LocalName GovernmentForm HeadOfState Capital Code2
0 ABW Aruba North America Caribbean 193.0 NaN 103000 78.4 828.0 793.0 Aruba Nonmetropolitan Territory of The Netherlands Beatrix 129.0 AW
1 AFG Afghanistan Asia Southern and Central Asia 652090.0 1919.0 22720000 45.9 5976.0 NaN Afganistan/Afqanestan Islamic Emirate Mohammad Omar 1.0 AF
2 AGO Angola Africa Central Africa 1246700.0 1975.0 12878000 38.3 6648.0 7984.0 Angola Republic José Eduardo dos Santos 56.0 AO
3 AIA Anguilla North America Caribbean 96.0 NaN 8000 76.1 63.2 NaN Anguilla Dependent Territory of the UK Elisabeth II 62.0 AI
4 ALB Albania Europe Southern Europe 28748.0 1912.0 3401200 71.6 3205.0 2500.0 Shqipëria Republic Rexhep Mejdani 34.0 AL
If you wanted specific columns, you could specify them using the columns attribute.
df = pd.read_sql_table("country",engine, columns=["Code","Name"])
df.head()
Code Name
0 ABW Aruba
1 AFG Afghanistan
2 AGO Angola
3 AIA Anguilla
4 ALB Albania
If you want to go specifically with SQL, then you can use the read_sql_query ( ) method.
df = pd.read_sql_query("select code, name from country",engine)
df.head()
Code Name
0 ABW Aruba
1 AFG Afghanistan
2 AGO Angola
3 AIA Anguilla
4 ALB Albania
Essentially, the same method.
Loading Data
Instead of working on the sample tables provided as part of MySQL installation (like country,city etc), let’s load up a bigger dataset into the system. MySQL provides one such database right on their website – and loading it is just as easy. Just search google for “Employee database mysql” or click here. The link to download the actual data is available here. Once there, click on the Clone or Download button and then click on the Download Zip button as shown below.

Unzip the file and navigate into the directory.

In the address bar, type in cmd and that would open the command prompt with the current directory.

like so..

Once inside, type in
mysql -u ajaytech -p -t < employees.sql
where “ajaytech” is the user name you have chosen while installing MySQL – like so..

You have to use the full path of the mysql command, if it is not added to your path, like shown above
"C:\Program Files\MySQL\MySQL Server 8.0\bin\mysql"
Once the data hase been loaded, refresh your MySQL workbench and you should be able to see the database(schema).

You should be able to drill-down to find the tables. For example, here is a snapshot of the EMPLOYEES table.

This one is a pretty decent sized database – 300K odd employees as you can see from the screenshot below.

Selecting Data
For the purpose of extracting data for Data Science & Machine Learning, we will mostly be working with SELECTING data. So, our focus will mainly be on selecting data (as opposed to UPDATEs, DELETEs or CREATE or other administrative aspects of the database).
SELECT Statement
This is probably the most used SQL statement in Data Science or otherwise. We have seen an example of how to select data from the country table above. In the following sections, we will focus on the employees database and how to extract data from the tables inside it.
Here is a snapshot of the SELECT statement syntax in it’s most generic format. As you can see, it is pretty intense.

Luckily, we only have to deal with very few of these options. Let’s start with the basics.
df = pd.read_sql_query("select emp_no, first_name from employees.employees",engine)
df.head()
emp_no first_name
0 10001 Georgi
1 10002 Bezalel
2 10003 Parto
3 10004 Chirstian
4 10005 Kyoichi
As you can see, this one took some time. That is because we have pulled in all the records. A quick look at the shape of the dataframe should make this clear.
df.shape
(300024, 2)
Let’s try to just pull in the first 100 records.
df = pd.read_sql_query("SELECT emp_no, first_name FROM employees.employees LIMIT 100",engine)
df.head()
emp_no first_name
0 10001 Georgi
1 10002 Bezalel
2 10003 Parto
3 10004 Chirstian
4 10005 Kyoichi
This time, the select ran pretty quick, right ? That is because, we limited the rows retrieved to just 100. You can quickly verify this with the dataframe’s shape.
df.shape
(100, 2)
Just to make the syntax look simple, we have capitalized the SQL keywords in the select statement.
SELECT emp_no, first_name FROM employees.employees LIMIT 100
as opposed to
select emp_no, first_name from employees.employees limit 100
That way you know what the SQL statement is doing by quickly looking for the capitalized keywords. Also, there is a quick way to beautify SQL statements in the MySQL Workbench. Just type in the statement and click on the “Beautify/Format” button.

Look at the way the SQL Workbench has formatted it. It took care of capitalization, tabbing and formatting. This form of the SQL statement is more useful in cases where the statement is pretty large.

So far, we have seen 3 keywords ,
- SELECT – The main keyword to select data from database tables.
- FROM – used to specify the tables from which the data needs to be extracted.
- LIMIT – used to limit the number of rows to extract.
Here is a quick visual

In case you are wondering why we are selecting the schema or database everytime we do a select, that is because there are multiple schemas/databases in any database system. To use a specific database by default without having to specify the database table everytime, use the following command.
USE employees
If you do this in the workbench, the corresponding database is highlighted.

In python, when you create the connection using the ORM, specify the database.
import pandas as pd
import mysql.connector
import sqlalchemy as db
engine = db.create_engine('mysql+pymysql://ajaytech:ajaytech@localhost:3306/employees'
And from that point on, you can just specify the table (without specifying the database).
df = pd.read_sql_query("SELECT emp_no, first_name FROM employees LIMIT 100",engine)
df.head()
emp_no first_name
0 10001 Georgi
1 10002 Bezalel
2 10003 Parto
3 10004 Chirstian
4 10005 Kyoichi
Great ! Let’s move on to the next clause – WHERE. Next to FROM, this is probably the next important clause in the SELECT statement.
df = pd.read_sql_query("SELECT emp_no, first_name, birth_date FROM employees WHERE emp_no = 10002",engine)
df.head()
emp_no first_name birth_date
0 10002 Bezalel 1964-06-02
Let’s add some more filter conditions using the WHERE and AND clauses.
df = pd.read_sql_query("SELECT emp_no, first_name, gender \
FROM employees \
WHERE gender = 'M' AND \
emp_no <= 10003",engine)
df.head()
emp_no first_name gender
0 10001 Georgi M
1 10003 Parto M

You can use many comparators that you are used to in any programming languages
- Less than (<) , Greater than (>)
- Greather than or equal to (>=) , Less than or equal to (<=)
- Not equal to ( != or <> )
BETWEEN clause

df = pd.read_sql_query("SELECT emp_no, first_name, gender \
FROM employees \
WHERE emp_no BETWEEN 10001 AND 10003",engine)
df.head()
emp_no first_name gender
0 10001 Georgi M
1 10002 Bezalel F
2 10003 Parto M
IN Clause
What if you wanted to select specific employee numbers ?
df = pd.read_sql_query("SELECT * \
FROM employees \
WHERE emp_no IN (10001, 10003,10004)",engine)
df.head()
emp_no birth_date first_name last_name gender hire_date
0 10001 1953-09-02 Georgi Facello M 1986-06-26
1 10003 1959-12-03 Parto Bamford M 1986-08-28
2 10004 1954-05-01 Chirstian Koblick M 1986-12-01

LIKE clause

df = pd.read_sql_query("SELECT * FROM employees WHERE first_name LIKE 'G%%' AND emp_no < 10010 ",engine)
df.head()
emp_no birth_date first_name last_name gender hire_date
0 10001 1953-09-02 Georgi Facello M 1986-06-26
ORDER BY clause

df = pd.read_sql_query("select * from employees where emp_no < 10005 order by emp_no desc ",engine)
df.head()
emp_no birth_date first_name last_name gender hire_date
0 10004 1954-05-01 Chirstian Koblick M 1986-12-01
1 10003 1959-12-03 Parto Bamford M 1986-08-28
2 10002 1964-06-02 Bezalel Simmel F 1985-11-21
3 10001 1953-09-02 Georgi Facello M 1986-06-26
AS clause

df = pd.read_sql_query("SELECT emp_no AS employee_number,first_name \
FROM employees WHERE emp_no < 10005 ORDER BY emp_no desc ",engine)
df.head()
employee_number first_name
0 10004 Chirstian
1 10003 Parto
2 10002 Bezalel
3 10001 Georgi
Aggregate Functions
MIN function

df = pd.read_sql_query("SELECT emp_no ,MIN(birth_date),first_name \
FROM employees WHERE emp_no < 10005 ORDER BY emp_no desc ",engine)
df.head()
emp_no MIN(birth_date) first_name
0 10001 1953-09-02 Georgi
The other aggregate functions supported by MySQL are
- AVG () – Average
- SUM () – Summation
- COUNT () – Row count
- MAX () – Maximum value
AVG aggregate function

df = pd.read_sql_query("SELECT AVG(salary) FROM salaries \
WHERE emp_no < 10005 AND \
to_date = '9999-01-01'",engine)
df.head()
AVG(salary)
0 69713.25
MAX aggregate function

df = pd.read_sql_query("SELECT MAX(salary) FROM salaries \
WHERE emp_no < 10005 AND \
to_date = '9999-01-01'",engine)
df.head()
MAX(salary)
0 88958
Table Joins
Suppose you wanted to find out the salary of employee 10001, and print the following information
- emp_no
- first_name, last_name
- salary
How would you do it ? The data is available in 2 different tables
- EMPLOYEES
- SALARIES
You would have to join information from both of these tables to get the final output you need. Here is how you do it.

df = pd.read_sql_query("SELECT first_name, last_name, salary \
FROM employees JOIN salaries \
WHERE employees.emp_no = salaries.emp_no AND \
employees.emp_no < 10005 AND \
to_date = '9999-01-01'",engine)
df.head()
first_name last_name salary
0 Georgi Facello 88958
1 Bezalel Simmel 72527
2 Parto Bamford 43311
3 Chirstian Koblick 74057
Say, you wanted to output the employee number also – Try this.
df = pd.read_sql_query("SELECT emp_no, first_name, last_name, salary \
FROM employees JOIN salaries \
WHERE employees.emp_no = salaries.emp_no AND \
employees.emp_no < 10005 AND \
to_date = '9999-01-01'",engine)
df.head()
---------------------------------------------------------------------------
InternalError: (pymysql.err.InternalError) (1052, "Column 'emp_no' in field list is ambiguous")
SQL is complaining that Column ’emp_no’ in field list is ambiguous. That is because both the tables(employees,salaries) have a column called emp_no – and MySQL doesn’t understand which one we are referring to. Let’s make this clear for the database.

df = pd.read_sql_query("SELECT employees.emp_no, first_name, last_name, salary \
FROM employees JOIN salaries \
WHERE employees.emp_no = salaries.emp_no AND \
employees.emp_no < 10005 AND \
to_date = '9999-01-01'",engine)
df.head()
emp_no first_name last_name salary
0 10001 Georgi Facello 88958
1 10002 Bezalel Simmel 72527
2 10003 Parto Bamford 43311
3 10004 Chirstian Koblick 74057
Most of the time regular JOINs are good enough. The default JOIN that is used to join tables is an INNER JOIN. Most of the time, this join is good enough.