Chapter 23. Databases and SQL

Memory is man’s greatest friend and worst enemy.

Gilbert Parker

The data you need will often live in databases, systems designed for efficiently storing and querying data. The bulk of these are relational databases, such as Oracle, MySQL, and SQL Server, which store data in tables and are typically queried using Structured Query Language (SQL), a declarative language for manipulating data.

SQL is a pretty essential part of the data scientist’s toolkit. In this chapter, we’ll create NotQuiteABase, a Python implementation of something that’s not quite a database. We’ll also cover the basics of SQL while showing how they work in our not-quite database, which is the most “from scratch” way I could think of to help you understand what they’re doing. My hope is that solving problems in NotQuiteABase will give you a good sense of how you might solve the same problems using SQL.


A relational database is a collection of tables (and of relationships among them). A table is simply a collection of rows, not unlike the matrices we’ve been working with. However, a table also has associated with it a fixed schema consisting of column names and column types.

For example, imagine a users data set containing for each user her user_id, name, and num_friends:

users = [[0, "Hero", 0],
         [1, "Dunn", 2],
         [2, "Sue", 3],
         [3, "Chi", 3]]

In SQL, we might create this table with:

    user_id INT NOT NULL,
    name VARCHAR(200),
    num_friends ...

Get Data Science from Scratch now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.