I’ve spent some time understanding how MySQL structures tables and indexes, and wanted to share what I’ve learned. This is not the structure as created in the
CREATE TABLE or
CREATE INDEX statements, but how that data is generally structured on disk, both logically and physically. Any statements about MySQL apply to the InnoDB storage engine, and I’m newer to MySQL than I am to Db2, so please comment if you have a correction.
In MySQL, generally I’ve seen table per file tablespaces. These are most like Db2s SMS tablespaces (which are largely deprecated) because each table has its own file, though unlike Db2 SMS, the table data and indexes are in the same file. Also while we very frequently use multiple “disks” (or logical volumes) for Db2, from what I’ve seen and been told it is rather unusual to use more than one logical disk for MySQL. Db2 also adds in a layer of abstraction called a storage group between tablespaces and the logical disk.
To understand both the placement of pages within the table file and the structure of the pages in MySQL InnoDB, there is an excellent series of blog entries by Jeremy Cole. These blog entries dig into more detail that I’ve ever seen from IBM on Db2, outside of conference presentations. Jeremy has also represented the InnoDB code in Ruby, which may be significantly easier to understand than looking at the source C++ code.
Both RDBMSes use a relatively similar structure within pages – the concepts of a page header, a page footer, and filling the data into pages are similar. There are a few more full pages which are headers themselves in db2, which is necessitated by storing multiple tables and indexes in the same file. A table header isn’t really needed when the entire file is about that table. Both have concepts on free space control information at reasonable intervals.
One of the most obvious differences in how MySQL and Db2 store tables on disk is that everything in MySQL is essentially an index. The table data is stored on the leaf pages of the clustering index, which is nearly always the primary key. If there is no primary key and no suitable alternate key, MySQL generates one to use for table structure.
Db2, by contrast, has separate structures for tables, which may vary for different table types like MDC or column-organized tables. Focusing on fairly standard tables, there are several implications to be aware of here. The first is that MySQL essentially has the primary key as a hidden inclusion in every secondary (i.e. not primary key) index. If rows to satisfy a query are identified by a secondary index, the primary key is then used to fetch the remaining data from the table itself. Db2 also has a hidden value to serve this purpose for each row in an index. Db2 uses what is called a RID or Row ID. The RID is composed of a page number (relative to the table space for most table spaces or relative to the table for SMS table spaces) and a slot number on that page. There might be a slight speed advantage for Db2 here, as accessing any index can result in very specific pages needed rather than having to navigate the B+ tree primary key index to get to a specific page. The RID approach does limit table size, though. This is actually the difference between the old regular tablespaces and the newer large tablespaces. On 4k page sizes, the RID was the thing that limited total table size to 64 GB in regular tablespaces. The limits increased in large tablespaces as more bits are allocated to the page number portion of the RID.
Db2’s typical table structure does not allow guaranteed clustering on an index, though. A clustering index can be added, but clustering is not fully guaranteed and may require regular reorgs to maintain, depending on activity. On the other hand, I’ve become a believer over the years that the best clustering index is usually not the primary key. To my mind, the best clustering index would cluster rows that are often queried together on the same page or pages. Range scans on primary key aren’t as common, at least in Db2 OLTP databases. There are other mechanisms that allow for clustering that may be more powerful in Db2 such as Multi-Dimensional Clustering(MDC) or Insert-Time Clustering(ITC), though those have somewhat limited use cases and specific drawbacks.
This structure also makes changing the primary key in MySQL quite a daunting process. In Db2, you could add a unique index and then simply change the primary key, requiring no restructuring of the table. For MySQL, changing the primary key would require a rebuild of the table. Changing a primary key is not a frequent task.
Both MySQL and Db2 primarily or entirely use B+ tree indexes. A B+ tree index is much like a b-tree index, but includes pointers on each page to neighboring pages. This makes scans through the B+ tree more efficient than a plain b-tree.
As far as how they are structured on disk, MySQL InnoDB indexes are stored within the file for the table. Essentially a directory or header within the file tells where the root page of each index is, and from there, the indexes can be navigated.
In Db2, the indexes can be stored in the same table space as the data, or they may be in a separate table space. In the bad old days, the best practice was to separate them, but that’s largely not needed today. Similar to MySQL, Db2 has a table space header that points to what page a particular index starts on, which can then be used to navigate the rest of the index.
Both Db2 and MySQL try to save space for future index entries with varying levels of success. MySQL tries to fill index pages 15/16ths full. Db2 uses PCTFREE to specify a default of 10% free space target on index pages. Both RDBMSes have operations that respect this space and operations that do not. For example,
REORG in Db2 will respect this, but it is meant to be violated by individual inserts. Reorganizing (classic) or reorganizing indexes in Db2 will rebuild the index to respect PCTFREE. In MySQL there isn’t really a way to rebuild indexes in an ideal way that I am aware of.
Both RDBMSes still run into situations where data does not fit on the index page where it belongs. In these cases, a page split occurs, and the RDBMS must take part of the data from the old page, place it on a new page. Both RDBMSes will also do this in the buffer pool whenever possible, making the impact not too bad.
MySQL will also attempt to merge a page while doing a write. If it tries to write a record to a page that is too sparse, it will pull that page’s neighbors, and see if either of them are sparse enough to combine into only one page. Db2 does not perform this generally, but before pseudo deleting of index keys was implemented, you could use MINPCTUSED to enable it. Realistically, Db2 requires a reorg to do this type of work.
MySQL will not reclaim/release pages from the index that have been completely emptied. Recent versions of Db2 can reclaim/release these pages using a
REORG with the
CLEANUP ... or
RECLAIM EXTENTS options.
Physical location of pages within files
Both Db2 and MySQL just can’t manage to keep the pages for a particular table or index really fully contiguous (next to each other) within the files that are used to store tables and tablespaces. Even if a table is alone in a file, every insert also requires data to be added to indexes, and therefore objects inevitably have their pages in amongst other objects.
This is why some of the directory structures such as file headers and table/table space headers are needed – to locate the pages within the file. In Db2, REORG will likely do some defragmentation, but absolutely will not guarantee contiguous pages for an object. You see this particularly if you ever had to go through some of the painful exercises to lower the high water mark of a non-reclaimable table space in Db2.
“To my mind, the best clustering index would cluster rows that are often queried together on the same page or pages.”
Some of that maybe can achieves which covering indexes. I do not know if that is a common concept in DB/2, it was not in Postgres for a very long time, and MySQL always had it. If you overspecify an index by adding more columns that are actually needed, the additional columns are part of the index and are sorted with the index. The optimizer takes the data then from the index and no lookup on the primary key and the original row is needed. This is shown in EXPLAIN as “index used”.
So “SELECT id, a FROM t WHERE b = 10” can run with an INDEX(b), which would cover b and id. The column a is missing and MySQL would do a PK lookup.
But with INDEX(b,a) we get an index on (b,a,id) and the above query can be solved with data from the secondary index alone. The optimizer will print “index used” and not go to the primary key, because the value of a is knows from the index.
This is not actually clustering, but often good enough. Domas Mituzas used this technique extensively on the MySQL powering Wikipedia, and it was one of the main reasons Wikipedia could keep up with demand.