SQL & Optional Data

SQL includes null, which is used to indicate no value. The problem is that it doesn’t explain why there is no value. Here we look at adding more meaning to null to allow for optional data.

Here are some possibile interpretions of null:

The value is unknown
The value is not important
There is no value yet

To test for null, you cannot use something=null. Instead, you must use something is null. This reflects an interpretation of null as being unknown.

An Example

For example, suppose you have a table of authors, and you choose to record the date of birth and the date of death.

CREATE TABLE authors (
    id INT PRIMARY KEY,
    name VARCHAR(40),
    born varchar,
    died varchar
);

(The born and died values are varchar here to simplify the example; in reality, they should be date types or similar).

What does it mean for the date of death to be null? What about the date of birth?

For the date of birth, it is easy to interpret a null as unknown.

For the date of death, however, there are two possible interpretations:

The author hasn’t died (yet)
The author has died, but the actual data is unknown (or irrelevant)

SQL does not, of itself, accommodate alternative interpretations of null, so the reason remains a mystery.

One-Maybe Relationship

(This is my own term; many call it a one-to-zero-or-one, which is not quite as cool).

One-Many Relationship

Normally, tables are related using a Foreign Key to Primary Key relationship, from a child to parent table:

CREATE TABLE child (
    id INT PRIMARY KEY,
    parentid INT REFERENCES parent(id)
    --  etc
);

Since many child rows can reference the same parent row, we say that there is a one to many relationship between the parent and child.

One-Maybe Relationship

A one to one relationship is that of a single row in one table to a single row to another. Generally, it is between their primary keys.

Very few true one to one relationships exist, as that would involve creating both rows at the same time. In reality, One table is the main table, while the other is a secondary table, providing additional, possibly optional, columns.

You can create this relationship without changing your main table by creating a secondary table like this:

CREATE TABLE secondary (
    id INT PRIMARY KEY REFERENCES main(id)
    --  etc
);

Note that the id is both a primary key and a foreign key. This is the key, so to speak, of this type of relationship.

If you are tempted to auto-increment your primary key, don’t. The whole point is that it matches one of rows of the main table. The primary key nature ensures that it doesn’t happen twice.

The Example Again

Here is a possible authors table, without a date of death:

CREATE TABLE authors (
    id INT PRIMARY KEY,
    name VARCHAR(40),
    born varchar
);
INSERT INTO authors(id,name,born) VALUES
    (1,'Fred','now'),
    (2,'Wilma','then'),
    (3,'Betty','whenever'),
    (4,'Barney','sometime');

Here is a possible secondary table which will record only deaths, if any:

CREATE TABLE deaths (
    id INT PRIMARY KEY REFERENCES authors(id),
    died varchar
);
INSERT INTO deaths(id,died) VALUES
    (1,null),
    (3,'yesterday'),
    (4,'anotherday');

Note:

id 1 has died, but the date is unknown
id 2 is missing, and can be presumed living

Fetching the data

To fetch the full data, we will need a JOIN, and since there may be missing rows in the secondary table, we will need an OUTER JOIN; in this case it will be a LEFT JOIN.

SELECT *
FROM authors a LEFT JOIN deaths d ON a.id=d.id;

You will get something like this:

a.id	a.name	a.born	d.id	d.died
1	Freddy	now	1	NULL
2	Wilma	then	NULL	NULL
3	Betty	whenever	3	yesterday
4	Barney	sometime	4	anotherday

Missing vs Empty

The problem is with the above result set is the d.died column. For id 1, the d.died column is null because the date is unknown. For id 2, it is null because there is no entry in the deaths table.

How do we distinguish between them? The solution is in the d.id column. Because the row is missing, it is null for id 2.

Knowing this, we can implement the following logic:

if d.id is null, then the row is missing, and the author is living.
else the author has died:
- if d.died is not null, we can use the date
- else the date is unknown

In SQL, this can be expressed with the CASE operator:

CASE
    WHEN d.id IS NULL THEN
        'living'
    ELSE WHEN d.died is NOT NULL THEN
        d.died
    ELSE
        'unknown'
END

If a column may or may not be null, you can use the coalesce function to provide an alternative:

coalesce(d.died,'unknown')

This means we will use the d.died value, if any, or unknown if it is null. This is a neater version of the two ELSE clauses above

CASE
    WHEN d.id IS NULL THEN
        'living'
    ELSE coalesce(d.died,'unknown')
END

The Finished Statement

Combining the above, we can write:

SELECT
    a.id,a.name,a.born,
    CASE WHEN d.id IS NULL THEN 'living'
        ELSE coalesce(d.died,'unknown') END AS died
FROM authors a LEFT JOIN deaths d ON a.id=d.id;

This gives:

id	name	born	died
1	Freddy	now	unknown
2	Wilma	then	living
3	Betty	whenever	yesterday
4	Barney	sometime	anotherday