Indexes are magic, right? Add one and magically the performance of a query improves.
Well, not really. Each index has a cost, and thoughtful multi-column indexes will go a lot further than individual indexes. Let’s dig into the details.
What an Optimizer Does
Relational databases require queries to specify what data they want returned with no specification of where that data is on disk. On the one hand, this is an amazing strength, but on the other, it means that something is required to translate between the query of what data to return and how to actually get that data. Each RDBMS has an optimizer to do this. Different optimizers are structured differently. Some are rule-based, while others are cost-based. Most are some balance of both. In any case, it is the optimizer’s job to decide what access method to use to get the data requested. There are often hundreds or thousands of access plans that an optimizer can choose from. It is the optimizer that chooses which indexes (if any) to use.
What is an Index?
An index is an ordered set of one or more columns. It is persisted on disk. It consists of the actual data from one or more columns of a single table, along with a unique identifier for the row that allows the row to be immediately and directly located on disk. For Db2, this unique identifier is the hidden row id for the row, which consists of a page id and an offset within the page. For MySQL this unique identifier is usually the primary key, and all of the table data is actually stored on the leaf pages of the primary key index.
This allows a particular value or range of values to be very quickly found, without having to scan through every page on disk. Every row from the table which has a value in the specified columns also has an entry in the index. While this makes finding a specific value or range of values easy, it adds more work to be done for each insert, update, or delete which impacts columns in the index. Depending on the platform and the settings of the RDBMS, NULL values may or may not be included in the index.
How Indexes are Used in Relational Databases
With a few exceptions, either zero or one indexes are used to access each table each time it is accessed within a query. While there can be index ANDING (joining, essentially), it is pretty rare to see it actually used (at least in Db2). What this means is that if every column in the query is individually indexed, the optimizer must choose one and only one index to use to access the table, and most frequently scan through the table pages to apply any other predicates or ordering in the query.
Indexes take up space on disk, in memory, and add overhead to most update, insert, and delete operations, so it is critical to find the right set of indexes for a specific data model and workload. Without knowing the workload, it is impossible to to know the right set of indexes for a data model. Without knowing the data model, it is impossible to know the right set of indexes for a workload.
Modern RDBMSes support multi-column indexes. These are (usually non-unique) indexes that include more than one table column. Selecting a subset of multi-column indexes based on the workload will nearly always out-perform single-column indexes on every column, and may even take less space on disk. Like many areas of database performance, we can say “it depends”. I have no doubt there are workloads out there which are better served by single-column indexes, but they are the exception and not the norm.
Does this mean a single index with all of the columns in it is the best of all worlds? Absolutely not. I have ONCE in a 20-year career seen a case where that was true, but it is extremely rare.
Column Order Matters
The order of columns in a multi-column index matters. Let’s say we have an index on columns (A,B,C). A query comes along with conditions on A and C, but not on B. The index can be navigated to find everything that satisfies the condition on A, but then all pages matching that need to be scanned to satisfy the condition on C. That scan may be scans of the index itself or on the rows fetched from the table, but in either case it is less than ideal.
Functions that are applied to table data often reduce or eliminate the use of indexes as well.
For example, the condition
date(o.timeplaced) = current date - 1 day
is taking the table data (o.timeplaced) and comparing it to a static timestamp. Without specialized techniques like matching function-based indexes, the optimizer cannot navigate an index on o.timeplaced to satisfy this query, as it would have to apply this same transformation to every value in the index, forcing a scan of every leaf page.
If we take the same condition and instead write it like this:
o.timeplaced between timestamp(current date - 1 day,'00.00.00') and timestamp(current date - 1 day,'23.59.59')
Now, the index can be used as intended, because the value we’re comparing has not been transformed in any way.
The syntax for these two conditions is valid in Db2. Date/time syntax is one of the areas that is most likely to be different in SQL for different RDBMSes, but the concepts apply across platforms.
This is only one example. Many functions transform data in ways that the RDBMS cannot fully understand (even when it is the one doing the transformation), and that means that indexes are often excluded when functions are applied to columnar data.
In addition to helping with finding data quickly, indexes are really useful for ordering data. Not all queries require ordering, but either the query can require an order or an order can be imposed by the optimizer for certain join strategies (in Db2, anyway). If a query requires ordering, an index can be used to provide that order, and even possibly avoid things like spilling sorts or temporary data to disk.
Column order matters here, too. If I have an index on (A,B) and a query needs data in the order of B without specifying a reasonable predicate on A, then the index cannot be used for ordering on B.
Factors for Considering What to Include in Indexes
If I had a recipe for the perfect index, I would give it to you. I would also garner crazy consulting fees for a few years and then retire. I don’t have that. More than any other area of database performance, there are rarely hard and fast rules for what exactly will give the best performance when dealing with the optimizer. Also, workloads are constantly changing. Even if you don’t change the code and have a hard review on every change to SQL, the frequency at which various queries are executed still changes with how a website or application is used.
Testing reasonable combinations is the only sane way to really say which of several reasonable paths is best.
Importance of Performance
The first thing to consider is how important the performance of a particular query is. In many cases, we cannot have the perfect index for every query in the workload, so it is important to focus on indexing for queries and access patterns that are most critical for performance. For example, in an OLTP database serving an e-commerce workload, the performance of anything related to checkout is absolutely critical. An end user is sitting waiting on that and there could be a real-dollar cost if they get frustrated and go away. However, we update the description of a product much more infrequently and a few extra milliseconds there may not matter as much.
There may be a very few queries where it is worth an index tailored specifically to that one query. There shouldn’t be a large number of these.
Queries that are identified as a significant load on the database are also very appropriate to focus on. Even if their performance is not critical for the reasons above, hogging database resources may cause the most critical workloads to perform poorly.
Finding random queries in the database, even if their access plans or general stats don’t look good, isn’t a worthy way to find things important to index for. Most DBAs could do that for the rest of their lives in most databases out there. There is a huge ROI in finding problem queries and addressing them in some way, but a typical 80/20 rule applies in that we likely get 80% of the impact in the first 20% of effort.
Columns Often Queried Together
Columns that are often queried together should often be in indexes together. It may be worth playing with column order among related columns. In databases that support it, it may be worth putting one column in ascending order while putting another in descending order. Db2 supports this, while MySQL does not.
It often makes sense to put a column that larger result sets are ordered on in an index. This ensures that when results are retrieved, they are already in order. This makes less difference if only a few rows are being returned. Subsequent columns in an index can only be used for ordering if all columns before them in the index have equality or range predicates on them.
Columns that are included in most queries are likely to be most beneficial in the first column of an index. For the most part, a query that can use an index on (A) can just as easily use an index on (A,B). The latter might be slightly less efficient, but not by much. A query on (B) cannot really make use of this index.
Cardinality / Filtering
The cardinality of any column is not the total number of values, but the total number of unique values. Indexes that are only on a low-cardinality column are particularly bad for performance, and may actually perform WORSE than a full table scan. However, if those indexes include another column with higher cardinality (which is also used by the query), that performance penalty becomes far less likely. This doesn’t really influence the order in which you place them in a multi-column index, but always avoid bad indexes – mostly those whose total cardinality across the full index key is less than 3% of the total number of rows in the table.
Some details here will change by RDBMS. Many RDBMSes offer tools that can be used to mitigate some of the issues mentioned here. Function-based indexes and even adding generated columns come immediately to mind. Be sure to look into the indexing options of your platform, and be sure to thoroughly test any indexing decisions.