What is an index in an RDBMS(Relational DataBase Management System), and what are the benefits and drawbacks of using an index? Why don’t we index everything? How is a clustered index different? Indexes are powerful tools in relational databases. They can speed up performance of queries exponentially, but they can also slow down inserts, updates, and deletes. Let’s take a look at indexing and why it is so important to RDBMS performance, good or bad.
What is an Index?
An index is like a phone directory. You know the value for one column or combination of columns (think first and last name), but you want to use that to return other information (such as phone number). In this example, the first and last name are the index key. The RDBMS then uses the index to find the rows (often using a hidden row identifier or RID) needed, and fetches the data from the table by the specific RID. An index is not just a logical construct, but exists physically on disk. This means it takes up space and must be maintained.
Different RDBMSes may offer different indexing options. The basic index is a b-tree index. Other index types may include bitmap and hash indexes, along with specialized indexes to support various types of clustering or partitioning, such as block indexes and partition indexes. A clustered index is a specialized type of b-tree index covered towards the bottom of this article.
Indexes in most RDBMS systems are B-tree indexes. A b-tree index starts with a root node (or page). The root node contains pointers to intermediate nodes. The intermediate nodes contain pointers to the leaf pages where the actual RIDs are stored. The leaf pages contain both the key values and a RID list for all RIDs matching each set of key values. If an index is a unique index, there is only one RID for each set of key values.
For example, in the image below:
If the RDBMS is traversing the index looking to the key value of ‘E’, it would look at the root node, and be pointed to the intermediate node that ends with ‘J’. It would then from that intermediate node, be pointed to the leaf page that ends with ‘F’ and contains the RID associated with ‘E’. Once it has that RID, it goes back to the table data and pulls just that one RID from the table.
There can be multiple levels of intermediate nodes depending on the size of the table and the size of the index.
An index may also be defined as a unique index. If an index is defined as unique, that means that in addition to the normal structures, it restricts the index and therefore the table from having duplicate values for the index keys. Many RDBMSes also allow covering indexes. A covering index is a unique index that includes the values of additional columns outside of the index key. The additional columns are called the include list or include columns. This may be desirable to enable index-only access in some scenarios.
How are Indexes Accessed?
The ideal access methodology for an index is that you have the values for all of the columns in the index key, and the RDBMS then uses those to come up with a RID or list of RIDs to fetch from the table. There are other access methodologies that may be used depending on your platform. In Db2, we can see what is called a jump scan, where we have only part of the index key, but the index can still be used for that part. A more common access methodology is a leaf-page scan. A leaf page scan occurs when instead of navigating down a b-tree index, the rdbms simply reads all of the leaf pages. This kind of access is marginally better than a table scan in some scenarios – if less data can be scanned this way than a table scan, it may be chosen as an access method.
In some cases, all of the data needed for a query may be available in the index. If this happens, then the RDBMS does not have to use the RIDs to fetch data from the table, and the query may run faster. Indexing with index-only access as a goal should only be done for the most critical or most problematic queries, though, as it tends to lead to over-indexing. Index-only access as a goal also leads to very wide indexes – those with many columns in the index key or in the combination of the index key and the include list (for covering indexes). If the same narrower index will contain the data needed may be faster than a wider index.
The data in relational tables themselves is in any order, and generally may be returned in any order. Often this is in the order the data was inserted in the table, but there are a large number of factors that make it so that order of data in the table is some other order. Clustered indexes go one step beyond all the other index details discussed here. In addition to being a traditional index, the table data is also ordered to match the clustering index. Depending on the RDBMS implementation and settings for a specific table, how good this clustering is or stays over time may vary. The reason a table is often not guaranteed to be clustered on a clustering index has to do with how data changes over time. Only so much space is reserved for future insertions that would go in the middle of the table, and RDBMSes vary on how they maintain clustering. Often there is a reorganization process to re-cluster a table on an index.
Some defaults either by RDBMSes or by human DBAs select the primary key as a clustering index. This only makes sense if the table is likely to be scanned in order by primary key. Often clustering on a low-cardinality index that is frequently used in join conditions or filtering conditions is more likely to be helpful to performance.
Function based indexes are now supported by a number of RDBMS platforms. These indexes allow the index key to not only consist of columns from a table, but also to consist of functions applied to columns. One of the most frequent uses of this is to transform a mixed-case text column to either UPPER or lower case for comparison.
How can Indexes Help Performance?
Indexes speed access to data. Nearly always, an index is smaller than the table, and if it isn’t you may want to reconsider your index design. This means that even a leaf page scan is more efficient than a table scan. Comparison of index access to a table scan for a single row looks like this:
- Index Access To access a single row, the RDBMS will first read the root page. It will then read one or more intermediate node pages. That will direct it to a single leaf page. Once it has the RID from the leaf page, it will access a single page from the table to get exactly the data it needs.
- Table Scan Without an index the RDBMS will scan every page of the table, sequentially, until it finds all of the rows it needs. This can be thousands or millions of pages
The difference in these two access methods is one of the frequent reasons why accessing data may be so much slower with a large database as compared to a small database. If using the index method, the access times will be fairly close, changed only by how many intermediate levels there are. If using a table scan, access in a small database is very fast, where in a large database, it is much slower while the entire table is scanned.
A clustered index most helps performance when the clustered index is on an index key that is low-cardinality – the index key has few unique values when compared to the number of rows in the table. The reason this is beneficial is because it means that the rows a query needs are likely to be on the same or adjacent pages. This means it is likely that fewer pages are needed to return the same number of rows, reducing the I/Os and memory space needed.
Indexes can also help RDBMS optimizers/compilers more accurately estimate the data that will be returned and therefore choose appropriate access methods and realistic memory sizes.
How can Indexes Hurt Performance?
Indexes have to be maintained. Both in the sense that they must be updated with every insert, update, and delete to the table, but also in the sense that over time, reorganization and/or rebuilding may be required. Some kinds of replication, and all high availability methods must also replicate that activity to other servers.
In more that one problem performance scenario, the solution I have found was to actually drop indexes rather than adding them. The time required to maintain indexes is not insignificant.
The Art of Indexing
Indexing in a relational database is an art. I, and others, can give you general guidelines on what should be faster and what should be slower, but there can be some odd edge cases. There are also quirks to each RDBMS. As a result, it may take some trial and error to come up with the highest performing solution, particularly in light of the multiple concurrent workloads that many databases see. It also makes sense to engage an expert in the RDBMS you are using to help you navigate the particular details of that system.
Finally, Indexing is rarely at a state of rest. Changing data patterns, changing query patterns and a dozen other factors mean that indexing should be revisited anywhere from once a week to several times a year.
Good morning and great article,
in fact I am going to send it to our team’s newcomers as mandatory reading before our ‘index’ lesson.
Though every time I read “cluster index keeps data ordered” my fists clench automatically.
The documentation says it only ATTEMPTS to keep the data sorted.
I’ve seen so many really really slow tables made almost completely from overflow data – because new data didn’t fit where cluster index wanted to put them and daily reorgs couldn’t be afforded.
To me cluster index only ever made sense for data that are already naturally sorted,
eg if you insert current_timestamp and you want to tell database that data in this table are sorted by that column too.