Comparing Db2 and MySQL: NULL in Indexes

Posted by

I’ve been spending some time working with indexes as I learn MySQL, and there were a few things I thought were were worth mentioning. As always, I’m writing from my own knowledge with deep experience in IBM DB2, and as a learner when it comes to MySQL.

NULL in Unique Indexes

MySQL handles NULL in a unique index very differently than Db2. In Db2 (on LUW), NULL is somewhat treated as a value and a unique index can have only one NULL value. In MySQL a unique index can have any number of NULL values and they are still considered unique. A quick and simple example to show this. In Db2:

$ db2 "create table t1 ( c1 int, c2 char(10))"
DB20000I  The SQL command completed successfully.
$ db2 "create unique index idx_t1_c1 on t1 (c1)"
DB20000I  The SQL command completed successfully.
$ db2 "insert into t1 values (1,'text1')"
DB20000I  The SQL command completed successfully.
$ db2 "insert into t1 values (1,'text1')"
DB21034E  The command was processed as an SQL statement because it was not a
valid Command Line Processor command.  During SQL processing it returned:
SQL0803N  One or more values in the INSERT statement, UPDATE statement, or
foreign key update caused by a DELETE statement are not valid because the
primary key, unique constraint or unique index identified by "1" constrains
table "DB2INST1.T1" from having duplicate values for the index key.
SQLSTATE=23505
$ db2 "insert into t1 values (null,'text_null')"
DB20000I  The SQL command completed successfully.
$ db2 "insert into t1 values (null,'text_null')"
DB21034E  The command was processed as an SQL statement because it was not a
valid Command Line Processor command.  During SQL processing it returned:
SQL0803N  One or more values in the INSERT statement, UPDATE statement, or
foreign key update caused by a DELETE statement are not valid because the
primary key, unique constraint or unique index identified by "1" constrains
table "DB2INST1.T1" from having duplicate values for the index key.
SQLSTATE=23505

The same actions in MYSQL (8):

mysql> create table t1 ( c1 int, c2 char(10));
Query OK, 0 rows affected (0.03 sec)

mysql> create unique index idx_t1_c1 on t1 (c1);
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> insert into t1 values (1,'text1');
Query OK, 1 row affected (0.00 sec)

mysql> insert into t1 values (1,'text1');
ERROR 1062 (23000): Duplicate entry '1' for key 't1.idx_t1_c1'
mysql> insert into t1 values (null,'text_null');
Query OK, 1 row affected (0.00 sec)

mysql> insert into t1 values (null,'text_null');
Query OK, 1 row affected (0.00 sec)

In these examples, we insert a row, and then try to insert the same row, and get an error for an attempt to insert a duplicate value on both platforms. We then try to insert a row with a NULL in the column the unique index is on. Db2 allows the first insertion, and then the second insertion fails because the unique index already has a NULL in it. MySQL allows the first insertion AND allows the second insertion.

This is confusing behavior the first time you see it, no matter which side you are used to.

My understanding, without thorough research, is that Oracle and PostgreSQL also take the MySQL approach, while MS SQL Server takes the Db2 approach.

Indexing NULLs

MySQL does index NULLs. Db2 Indexes them by default, but NULLs can be excluded on most table/index types by using the EXCLUDE NULLS syntax. Indexing NULLs is generally good for performance, as it means indexes in MySQL can be used to efficiently find NULLs. The performance of IS NOT NULL in MySQL suffers from the same performance problems that plague many NOT conditions in MySQL. While an index can be efficiently used to find a value or even a range of values, it usually means resorting to a leaf page scan or table scan to look for the absence of a value. Db2 does not generally suffer from this by rewriting a lot of predicates.

MySQL

Here’s an example of that in MySQL. I am using the sakila database. I’ve created a dummy table using the rental table as a starting point, increasing the amount of data, maintaining a balance of 50% null values for the return_date column.

mysql> create table t2 like rental;
Query OK, 0 rows affected (0.12 sec)

mysql> insert into t2 select * from rental where return_date is not null limit 183;
Query OK, 183 rows affected (0.03 sec)
Records: 183  Duplicates: 0  Warnings: 0

mysql> insert into t2 select * from rental where return_date is null;
Query OK, 183 rows affected (0.12 sec)
Records: 183  Duplicates: 0  Warnings: 0

mysql> insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select date_add(rental_date, interval 1 day), inventory_id, customer_id, return_date, staff_id from t2;
Query OK, 366 rows affected (0.03 sec)
Records: 366  Duplicates: 0  Warnings: 0

mysql> insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select date_add(rental_date, interval 2 day), inventory_id, customer_id, return_date, staff_id from t2;
Query OK, 732 rows affected (0.06 sec)
Records: 732  Duplicates: 0  Warnings: 0

mysql> insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select date_add(rental_date, interval 4 day), inventory_id, customer_id, return_date, staff_id from t2;
Query OK, 1464 rows affected (0.06 sec)
Records: 1464  Duplicates: 0  Warnings: 0

mysql> insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select date_add(rental_date, interval 8 day), inventory_id, customer_id, return_date, staff_id from t2;
Query OK, 2928 rows affected (0.16 sec)
Records: 2928  Duplicates: 0  Warnings: 0

mysql> insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select date_add(rental_date, interval 16 day), inventory_id, customer_id, return_date, staff_id from t2;
Query OK, 5856 rows affected (0.21 sec)
Records: 5856  Duplicates: 0  Warnings: 0

mysql> insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select date_add(rental_date, interval 32 day), inventory_id, customer_id, return_date, staff_id from t2;
Query OK, 11712 rows affected (0.29 sec)
Records: 11712  Duplicates: 0  Warnings: 0

mysql> insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select date_add(rental_date, interval 64 day), inventory_id, customer_id, return_date, staff_id from t2;
Query OK, 23424 rows affected (0.57 sec)
Records: 23424  Duplicates: 0  Warnings: 0

mysql> create index test_idx on t2 (return_date);
Query OK, 0 rows affected (0.16 sec)
Records: 0  Duplicates: 0  Warnings: 0

The significant amount of data is to deal with the tendency to always pick a table scan when tables are really small. When I then run a query based on the return_date column, you can see the different expected access plans for null vs not null:

mysql> explain select * from t2 where return_date is null;
+----+-------------+-------+------------+------+---------------+----------+---------+-------+-------+----------+-----------------------+
| id | select_type | table | partitions | type | possible_keys | key      | key_len | ref   | rows  | filtered | Extra                 |
+----+-------------+-------+------------+------+---------------+----------+---------+-------+-------+----------+-----------------------+
|  1 | SIMPLE      | t2    | NULL       | ref  | test_idx      | test_idx | 6       | const | 23495 |   100.00 | Using index condition |
+----+-------------+-------+------------+------+---------------+----------+---------+-------+-------+----------+-----------------------+
1 row in set, 1 warning (0.00 sec)

mysql> explain select * from t2 where return_date is not null;
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | t2    | NULL       | ALL  | test_idx      | NULL | NULL    | NULL | 46990 |    50.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

The explain on the IS NULL condition is using the the test_idx index that I created to get the data. The IS NOT NULL condition is doing a full table scan.

Db2

Here’s the same experiment in Db2. I’ve used essentially the sakila database, with a few tweaks. Note that I can’t really use the syntax to create a table like RENTAL in Db2 because when I do, not only are the indexes not created (a side effect of the fact that Db2 index names must be unique per database, while MySQL must only be unique per table), but Db2 also does not maintain the identity (auto-increment) for the primary key.

$ db2 -tvf t2_create.ddl
CREATE TABLE "DB2INST1"."T2"  ( "RENTAL_ID" INTEGER generated by default as identity NOT NULL primary key, "RENTAL_DATE" DATE NOT NULL , "INVENTORY_ID" INTEGER NOT NULL , "CUSTOMER_ID" INTEGER NOT NULL , "RETURN_DATE" DATE WITH DEFAULT NULL , "STAFF_ID" SMALLINT NOT NULL , "LAST_UPDATE" DATE NOT NULL WITH DEFAULT CURRENT DATE ) IN "USERSPACE1" ORGANIZE BY ROW
DB20000I  The SQL command completed successfully.

CREATE INDEX "DB2INST1"."T2_IDX_RENTAL_FK_INVENTORY_ID" ON "DB2INST1"."T2" ("INVENTORY_ID" ASC) COMPRESS NO INCLUDE NULL KEYS ALLOW REVERSE SCANS
DB20000I  The SQL command completed successfully.

CREATE INDEX "DB2INST1"."T2_IDX_RENTAL_FK_CUSTOMER_ID" ON "DB2INST1"."T2" ("CUSTOMER_ID" ASC) COMPRESS NO INCLUDE NULL KEYS ALLOW REVERSE SCANS
DB20000I  The SQL command completed successfully.

CREATE INDEX "DB2INST1"."T2_IDX_RENTAL_FK_STAFF_ID" ON "DB2INST1"."T2" ("STAFF_ID" ASC) COMPRESS NO INCLUDE NULL KEYS ALLOW REVERSE SCANS
DB20000I  The SQL command completed successfully.

CREATE UNIQUE INDEX "DB2INST1"."T2_IDX_RENTAL_UQ" ON "DB2INST1"."T2" ("RENTAL_DATE" ASC, "INVENTORY_ID" ASC, "CUSTOMER_ID" ASC) COMPRESS NO INCLUDE NULL KEYS ALLOW REVERSE SCANS
DB20000I  The SQL command completed successfully.

$ db2 "insert into t2 select * from rental where return_date is null"
DB20000I  The SQL command completed successfully.
$ db2 "insert into t2 select * from rental where return_date is not null fetch first 183 rows only"
DB20000I  The SQL command completed successfully.
$ db2 "insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select rental_date +1 day, inventory_id, customer_id, return_date, staff_id from t2"
DB20000I  The SQL command completed successfully.
$ db2 "insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select rental_date +2 day, inventory_id, customer_id, return_date, staff_id from t2"
DB20000I  The SQL command completed successfully.
$ db2 "insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select rental_date +4 day, inventory_id, customer_id, return_date, staff_id from t2"
DB20000I  The SQL command completed successfully.
$ db2 "insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select rental_date +8 day, inventory_id, customer_id, return_date, staff_id from t2"
DB20000I  The SQL command completed successfully.
$ db2 "insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select rental_date +16 day, inventory_id, customer_id, return_date, staff_id from t2"
DB20000I  The SQL command completed successfully.
$ db2 "insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select rental_date +32 day, inventory_id, customer_id, return_date, staff_id from t2"
DB20000I  The SQL command completed successfully.
$ db2 "insert into t2 (rental_date, inventory_id, customer_id, return_date, staff_id) select rental_date +64 day, inventory_id, customer_id, return_date, staff_id from t2"
DB20000I  The SQL command completed successfully.

$ db2 "create index test_idx on t2 (return_date) allow reverse scans"
DB20000I  The SQL command completed successfully.
$ db2 "runstats on table db2inst1.t2 with distribution and detailed indexes all"
DB20000I  The RUNSTATS command completed successfully.

The explains then look something like this. Note that I’m using an unusual explain method here that will make the output look a tad more like MS SQL Server and more compact to fit in the screen. It’s just a view on the explain tables freely available at the link above.

$ db2 set current explain mode explain
DB20000I  The SQL command completed successfully.
$ db2 "select * from t2 where return_date is null"
SQL0217W  The statement was not executed as only Explain information requests
are being processed.  SQLSTATE=01604
$ db2 set current explain mode no
DB20000I  The SQL command completed successfully.
$ db2 "select * from last_explained"

Explain Plan
----------------------------------------------------------------------------------------------------
ID | Operation           |                     Rows | Cost
 1 | RETURN              |                          |  317
 2 |  FETCH T2           | 23424 of 23424 (100.00%) |  317
 3 |   RIDSCN            | 23424 of 23424 (100.00%) |   97
 4 |    SORT (UNIQUE)    | 23424 of 23424 (100.00%) |   97
 5 |     IXSCAN TEST_IDX | 23424 of 46848 ( 50.00%) |   94

Predicate Information
 2 - SARG Q1.RETURN_DATE IS NULL
 5 - START Q1.RETURN_DATE IS NULL
      STOP Q1.RETURN_DATE IS NULL

$ db2 set current explain mode explain
DB20000I  The SQL command completed successfully.
$ db2 "select * from t2 where return_date is not null"
SQL0217W  The statement was not executed as only Explain information requests
are being processed.  SQLSTATE=01604
$ db2 set current explain mode no
DB20000I  The SQL command completed successfully.
$ db2 "select * from last_explained"

Explain Plan
----------------------------------------------------------------------------------------------------
ID | Operation           |                     Rows | Cost
 1 | RETURN              |                          |  317
 2 |  FETCH T2           | 23424 of 23424 (100.00%) |  317
 3 |   RIDSCN            | 23424 of 23424 (100.00%) |   97
 4 |    SORT (UNIQUE)    | 23424 of 23424 (100.00%) |   97
 5 |     IXSCAN TEST_IDX | 23424 of 46848 ( 50.00%) |   94

Predicate Information
 2 - SARG Q1.RETURN_DATE IS NOT NULL
 5 - STOP Q1.RETURN_DATE IS NOT NULL

In Db2, an index is used for each of these, and you can see that the expected cost is identical. If I drop the index to force a a table scan, the explain looks like this:

$ db2 drop index test_idx
DB20000I  The SQL command completed successfully.
$ db2 "runstats on table db2inst1.t2 with distribution and detailed indexes all"
DB20000I  The RUNSTATS command completed successfully.
$ db2 set current explain mode explain
DB20000I  The SQL command completed successfully.
$ db2 "select * from t2 where return_date is not null"
SQL0217W  The statement was not executed as only Explain information requests
are being processed.  SQLSTATE=01604
$ db2 set current explain mode no
DB20000I  The SQL command completed successfully.
$ db2 "select * from last_explained"

Explain Plan
----------------------------------------------------------------------------------------------------
ID | Operation  |                     Rows | Cost
 1 | RETURN     |                          |  392
 2 |  TBSCAN T2 | 23424 of 46848 ( 50.00%) |  392

Predicate Information
 2 - SARG Q1.RETURN_DATE IS NOT NULL

Not a huge difference in cost at this table size, but a measurable one.

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, and is now pivoting to focus on learning MySQL. Ember shares both posts about her core skill set and her journey learning MySQL. Ember lives in Denver and work from home

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.