Many RDBMSes have similar structures and concepts. In this post, I’m going to compare and contrast IBM Db2’s concept of a clustered index and MySQL’s concept of a clustered index.
Index vs. Clustered Index
An index is generally a way of organizing data in a way that it will be quicker to access than scanning through all of a table. Typically, Relational Database Management Systems have b+tree indexes. This uses a structure of pages where the root and intermediate levels store ranges of the index keys, pointing to leaf pages which store exact key values along with some kind of pointer to the actual data.
Clustered indexes are slightly different. With clustered indexes, the table data is actually physically stored in the order of the index. This can lead to great efficiencies when the data is being scanned through in order of the clustered index.
Db2’s Clustered Index
In Db2, we can create a clustered index by adding the
CLUSTER keyword to a
CREATE INDEX statement. After this, the table should be reorganized, if there was any data in it.
Db2 does not guarantee that the data in a table will be in the order of the clustering index. The clustering index is still just an index on disk, with pointers to the rows associated with the keys. Db2 makes an effort when data is added to insert it where it belongs in the clustering order, but this is not always possible when there is not sufficient free space on the right pages. A reorg will bring data back into appropriate clustered order.
This is a visual representation of a clustered index in Db2:
In this image, the index is represented in light green, while the table is represented in grey. Note that most keys are in order, but we see one line indicating a key that is out of order. This is possible in a Db2 clustered index. Note also that the table data is fully separate from the index – they are different objects.
In fact, you can simulate a clustering index without actually adding one by specifying an index as part of a reorg statement.
A clustering index can consist of more than one column, and may be unique or not unique.
Selecting Column(s) for a Clustering Index
Generally the primary key of a table is not the best choice for a clustering index in Db2. Ask yourself how often a table is queried based on a range of a particular column or set of columns. If the answer is “frequently” then that column or set of columns might be a good candidate for a clustering index. Clustering indexes are particularly helpful for low-cardinality columns that are frequently part of a query. While indexes with low cardinality are generally poor choices for indexing, the one exception is if the data on disk happens to be very well clustered over that index key.
With Db2, it is critical to have a reasonable value of PCTFREE for a table with a clustering index – the more data you expect to insert in the middle of the key range, the larger PCTFREE should be so that there is space to maintain clustering for the table. The other side of that coin is that the percentage of the pages specified by PCTFREE is kept empty by design, and the larger this value is, the more empty/unused space you’ll have allocated.
Regular reorgs are also critical, as a table with a clustering index can lose cluster ratio over time.
MySQL’s Clustered Index
Note: I’m still a rookie at MySQL as I write this, so please comment below or contact me if I’ve misrepresented anything.
This is written from the perspective of the InnoDB storage engine. Other storage engines are out of scope for this article
In MySQL, using the InnoDB storage engine, the primary key is automatically used as the clustering index. If there is no primary key, MySQL will use the first unique index composed of columns that do not allow nulls as the clustering index. If that also does not exist, MySQL will create a hidden column to use for this purpose.
In MySQL, using the InnoDB storage engine, the index used for the clustering index must be unique, and in order to choose it, you must set it as the primary key for the table.
Another significant difference from the way that Db2 handles clustered indexes is that MySQL guarantees the clustering of the table on the clustering index. In fact, the table data is stored in a b+ tree structure directly, with the table data on the leaf pages. One of the advantages of this approach is that there is no need for a reorg process to re-cluster data. One of the disadvantages is that there are scenarios in which the table itself is rebuilt – which is similar to a reorg.
This is a visual representation of a clustered index in MySQL (using InnoDB):
Note that here the data is imbedded within the clustering index and not separate from it.
Because this primary key index is used to organize the table as a whole, other indexes are generally referred to as “secondary indexes”.
Selecting a Column for a Clustering Index
The only way to change the clustering index for a table is to change the primary key, which may have deeper implications, and should generally not be taken lightly.
There are scenarios which can require the rebuild of a table, which is similar to Db2’s reorg, and may be online or offline for the table, depending on the scenario. The MySQL documentation contains a nice summary of operations that may lead to table rebuilds. Think about something like changing the primary key – while that’s not a common task, it would require the table to be rebuilt in MySQL, where in Db2 that operation would not even require a reorg. Note that not every table rebuild makes the table unavailable for DML – much like Db2 reorgs not always requiring full unavailability of the table.
Side note on the documentation – I really love how clear that documentation page is for MySQL. In Db2, understanding what operations require or suggest a Reorg is a magic trick that I struggle with every time, despite being intimately familiar with the documentation. There are a couple of different pages to read deeply to figure it out instead of something as clear as this MySQL page.
It is very interesting as a DBA with a deep knowledge of one RDBMS learning another RDBMS. While there are vast similarities, there are also differences. Hopefully this type of comparison article helps others while they are learning.
I’d love to hear from others in the comments below if you’d like to add anything for Db2, MySQL, or maybe another RDBMS!
Thank you for sharing your knowledge and making me smile – that there are other people doing stuff with both of these rdbms. Perhaps even Klaas B. will smile when seeing this
In Db2 z/OS, “The first index that you define on the table serves implicitly as the clustering index unless you explicitly specify CLUSTER when you create or alter another index”. Is the statement true for Db2 LUW?
No, that is for mysql. It is not true for Db2 LUW.
“The only way to change the clustering index for a table is to change the primary key, which may have deeper implications, and should generally not be taken lightly.”
One of these deeper implications is that each row in the primary key must be in primary key (clustering) order, always. You do not reorganize the table, because that happens all of the time, for every insert and delete. It can also not really be batched well, so with random inserts and deletes in various primary key locations you are going to see a lot of page splits or page merges. In MySQL 5.7, there is also no good way to set a fill factor for a tables pages. They are always 15/16 full in bulk loads or index builds.
The change buffer exists to batch certain updates, but is not useful in all situations.
InnoDB is a masterpiece in isolating change when a row moves around. You have seen the Jeremy Cole stuff, so you know that InnoDB uses “physical addresses” not much, but addresses rows through the primary key. InnoDB can move rows around physically inside a page, and that change is isolated to a single page — the secondary indexes and even other pages in the primary key are unchanged.
The same happens when a page split or merge in the primary key is necessary: The change happens inside the primary key, or is even isolated to a subtree of the primary key, if you are lucky. None of the secondary indexes will see the change, as they refer to items in the PK tree by PK value, which is kind of like a logical row address and is stable.
In the age of rotating hard disk storage, this was minimizing the number of changed blocks and extremely valuable. These days it is a bit less important, but still useful.
This also means that MySQL InnoDB is extremely dependent on the primary key lookups to be fast and efficient. If you cannot fit all of the working set of the database in memory, at least the non-leaf pages of the active primary keys must fit. If not, given the number of primary key lookups that happen all of the time, things would get extremely slow.
And a flatter PK tree is of course faster than a deeper PK tree. The depth of the B+ tree that makes up the PK is dependent on the number of rows that fit in a page, so smaller rows -> higher fanout -> flatter tree -> fewer lookups. But as long as you have “multiple dozen” to “hundreds of rows” in a page, you gave a fanout of hundreds, and ln(rows)/ln(rows per page) is a small number.
If you do the math, you will find that the magic number is approximately 4 for almost any table size and a rows_per_page value of 100 or so. That is, without any caching, MySQL would pay four disk accesses for any PK lookup. With the core tree in memory, only the final page load for the actual row data is an actual disk access (you get a disk bound database). With all of the working set in memory, no disk reads happen (you get a CPU bound database, and have a comfortable life as a DBA).
Now consider a primary key value change. What will happen?
1. The row will be physically removed from where it is currently in the B+ tree. This may trigger page merges, if you are really unlucky. At least one page, maybe many pages are dirty.
2. The row will be physically inserted to where it now belongs in the B+ tree. This may trigger page splits, if you are really unlucks. At least one page, maybe many pages are dirty.
3. Every single secondary index will need to be adjusted to reflect the changed PK value. That will cause a cascade of pages to become dirty because of the SK changes.
While the design of InnoDB minimizes page changes when non-PK values change, changing a PK value is really abysmally bad for performance and writeback.
Thank. you for all of your thoughtful and thorough comments! I’m learning a lot from them!