Many RDBMSes have similar structures and concepts. In this post, I’m going to compare and contrast IBM Db2’s concept of a clustered index and MySQL’s concept of a clustered index.
Index vs. Clustered Index
An index is generally a way of organizing data in a way that it will be quicker to access than scanning through all of a table. Typically, Relational Database Management Systems have b+tree indexes. This uses a structure of pages where the root and intermediate levels store ranges of the index keys, pointing to leaf pages which store exact key values along with some kind of pointer to the actual data.
Clustered indexes are slightly different. With clustered indexes, the table data is actually physically stored in the order of the index. This can lead to great efficiencies when the data is being scanned through in order of the clustered index.
Db2’s Clustered Index
In Db2, we can create a clustered index by adding the
CLUSTER keyword to a
CREATE INDEX statement. After this, the table should be reorganized, if there was any data in it.
Db2 does not guarantee that the data in a table will be in the order of the clustering index. The clustering index is still just an index on disk, with pointers to the rows associated with the keys. Db2 makes an effort when data is added to insert it where it belongs in the clustering order, but this is not always possible when there is not sufficient free space on the right pages. A reorg will bring data back into appropriate clustered order.
This is a visual representation of a clustered index in Db2:
In this image, the index is represented in light green, while the table is represented in grey. Note that most keys are in order, but we see one line indicating a key that is out of order. This is possible in a Db2 clustered index. Note also that the table data is fully separate from the index – they are different objects.
In fact, you can simulate a clustering index without actually adding one by specifying an index as part of a reorg statement.
A clustering index can consist of more than one column, and may be unique or not unique.
Selecting Column(s) for a Clustering Index
Generally the primary key of a table is not the best choice for a clustering index in Db2. Ask yourself how often a table is queried based on a range of a particular column or set of columns. If the answer is “frequently” then that column or set of columns might be a good candidate for a clustering index. Clustering indexes are particularly helpful for low-cardinality columns that are frequently part of a query. While indexes with low cardinality are generally poor choices for indexing, the one exception is if the data on disk happens to be very well clustered over that index key.
With Db2, it is critical to have a reasonable value of PCTFREE for a table with a clustering index – the more data you expect to insert in the middle of the key range, the larger PCTFREE should be so that there is space to maintain clustering for the table. The other side of that coin is that the percentage of the pages specified by PCTFREE is kept empty by design, and the larger this value is, the more empty/unused space you’ll have allocated.
Regular reorgs are also critical, as a table with a clustering index can lose cluster ratio over time.
MySQL’s Clustered Index
Note: I’m still a rookie at MySQL as I write this, so please comment below or contact me if I’ve misrepresented anything.
This is written from the perspective of the InnoDB storage engine. Other storage engines are out of scope for this article
In MySQL, using the InnoDB storage engine, the primary key is automatically used as the clustering index. If there is no primary key, MySQL will use the first unique index composed of columns that do not allow nulls as the clustering index. If that also does not exist, MySQL will create a hidden column to use for this purpose.
In MySQL, using the InnoDB storage engine, the index used for the clustering index must be unique, and in order to choose it, you must set it as the primary key for the table.
Another significant difference from the way that Db2 handles clustered indexes is that MySQL guarantees the clustering of the table on the clustering index. In fact, the table data is stored in a b+ tree structure directly, with the table data on the leaf pages. One of the advantages of this approach is that there is no need for a reorg process to re-cluster data. One of the disadvantages is that there are scenarios in which the table itself is rebuilt – which is similar to a reorg.
This is a visual representation of a clustered index in MySQL (using InnoDB):
Note that here the data is imbedded within the clustering index and not separate from it.
Because this primary key index is used to organize the table as a whole, other indexes are generally referred to as “secondary indexes”.
Selecting a Column for a Clustering Index
The only way to change the clustering index for a table is to change the primary key, which may have deeper implications, and should generally not be taken lightly.
There are scenarios which can require the rebuild of a table, which is similar to Db2’s reorg, and may be online or offline for the table, depending on the scenario. The MySQL documentation contains a nice summary of operations that may lead to table rebuilds. Think about something like changing the primary key – while that’s not a common task, it would require the table to be rebuilt in MySQL, where in Db2 that operation would not even require a reorg. Note that not every table rebuild makes the table unavailable for DML – much like Db2 reorgs not always requiring full unavailability of the table.
Side note on the documentation – I really love how clear that documentation page is for MySQL. In Db2, understanding what operations require or suggest a Reorg is a magic trick that I struggle with every time, despite being intimately familiar with the documentation. There are a couple of different pages to read deeply to figure it out instead of something as clear as this MySQL page.
It is very interesting as a DBA with a deep knowledge of one RDBMS learning another RDBMS. While there are vast similarities, there are also differences. Hopefully this type of comparison article helps others while they are learning.
I’d love to hear from others in the comments below if you’d like to add anything for Db2, MySQL, or maybe another RDBMS!