Clustering a Table on an Index

I have been playing a fair amount lately with clustering indexes and have been rehashing my views on reorging tables on an index. This is still a work in progress, but thought I’d share some details and see if others out there have any thoughts to share with me and others on it.

Clustering a Table on an Index

I’ve long been of the opinion that as long as I’m doing a table reorg in an OLTP database, I might as well reorg the table on an index. If there is a clustering index on the table, this is the easiest call in the world – of course the table will be reorged on that index – you can’t do it on any other index. But when you don’t have a clustering index, determining the right index is more difficult.

I have long reorged tables on the primary key when there is no clustering index defined. But with research I’ve done over the last couple of years, I have come around on this. I don’t see the performance gain there in many cases. When my tables are being accessed, how likely is it that they’ll be accessed by sequential primary key, especially in my WebSphere Commerce databases where the primary keys are almost always simply generated numbers in sequential order?

I’m increasingly convinced that my tables should be clustered on a low-cardinality index – I can get the most performance gain there. WebSphere Commerce databases have plenty of low-cardinality indexes that I’m not ready to drop outright. For a low-cardinality index, I am much more likely to be accessing a chunk of the data by the index value at a time. See my developerWorks article, Why low cardinality indexes negatively impact performance for a description of why clustering can make such a performance difference when these indexes are used to access a table.

What is a Clustering Index?

Before we dig in farther, let’s first start with the basics. A clustering index is created when you specify the CLUSTER keyword on index creation(if a new index on an existing table, the table should then be reorged on the clustering index). DB2 then attempts to maintain the physical data in the table in the same order as the index. It respects a clustering index on LOAD and even on insert, DB2 attempts to physically place rows where they belong in the order of the clustering index. However, the clustering is not guaranteed, and there are a number of situations in which DB2 is unable to maintain the table in the order of the index. This is where regular reorgs come into play – assuming data is changed or inserted across the range of values in the index column or columns, regular reorgs will be needed to keep the table clustered on the index.

MDC is a concept introduced all the way back in DB2 version 8. It is guaranteed clustering over one or more columns. I’d love to get my hands on it, but WebSphere Commerce considers it a “customization”, so I won’t be talking much about it here.

Performance Impact on Reorg of Reorging on an Index

The Reorg itself will run longer if you’re doing inplace reorgs on a index as opposed to a reorg that does not cluster the table on any index. The reason for this is that a reorg without specifying an index will scan the table starting at the end, and move things around as it makes sense, and at the end may have a range of pages that can be released from the table. It never has to scan an index. An inplace reorg on an index, though, must first clear a range of pages at the beginning of the table, then scan an index to determine what rows need to be in what order and move those rows into the proper positions. This is a lot more data to move around, and can take significantly longer. Especially in larger databases, the amount of time a reorg takes may be a limiting factor in the kinds of reorgs you do, especially considering that inplace reorgs are not fully online (see When is a Reorg Really Online?).

Determining Indexes for Clustering and Clustering Reorgs

The absolute best way to determine the ideal clustering indexes that you’re going to use is to look at your SQL workload on your database and find where a clustering index might help you there. This is time consuming as you consider each table and the SQL related to that table (which can be hard to find without a tool such as DBI’s Brother-Panther). It is also rigid, and likely requires you to define your indexes either as clustering indexes or to store the indexes you want to reorg tables on somewhere in an unchanging format. For DBAs supporting hundreds of databases with hundreds of tables in each database, this may be impossible.

Even when I consider only WebSphere Commerce databases, there is a vast variety in where clustering indexes are going to make the most impact. I have one client that uses multiple stores, but only one language, so there I’m looking strongly at clustering several tables on storeent_id. Another client doesn’t have many stores, but does have many languages that are being used. Certainly for the DESC tables, there I need to work more towards using language_id as a common clustering index. I wrote recently about how I helped a critical query using a couple of clustering indexes, and that was based on SOLR functionality that is only in the latest WCS Fix/Feature Pack, so changes between WCS versions can change things as well.

Some Queries in the Right Direction

What I really want is a query I can run that will tell me not just what index for any given table will make the most sense as the clustering index or index that I regularly cluster on using a reorg, but also how much of an improvement in my workload they will give me. There may be a way I can script some of that – explaining and/or advising the SQL in my package cache, and perhaps looking at explain/advis tables? Sounds like a fun science project, but I’m not there yet.

What I’m starting out with is something a bit more simple. The below two ksh scripts make a start in this direction. The first simply describes all the indexes in the table, in a way I like better than describe indexes. I’m still playing with it:

$ cat desc_ind.sh
tabname=`echo $1 | tr '[:lower:]' '[:upper:]'`;
db2 connect to wc005p01;
db2 -v "select i.lastused, substr(indname,1,20) as indname, substr(colnames,1,30) as colnames, fullkeycard as fullkeycard, card as table_card, decimal(clusterfactor,10,5) as clustefactor, indextype, index_scans from syscat.indexes i join syscat.tables t on i.tabname=t.tabname and i.tabschema=t.tabschema join table(mon_get_index( i.tabschema, i.tabname, -2 )) mgi on mgi.iid = i.iid where t.tabschema='WSCOMUSR' and t.tabname='$tabname'  with ur";
db2 connect reset;

This script is executed simply with ./desc_ind.sh TABNAME. It could easily be altered to also specify a schema name – that’s just not something I need at the moment. The output of this script looks something like this:

$ ./desc_ind.sh XSTORESKU

   Database Connection Information

 Database server        = DB2/LINUXX8664 9.7.7
 SQL authorization ID   = DB2INST1
 Local database alias   = WC005P01

select i.lastused, substr(indname,1,20) as indname, substr(colnames,1,30) as colnames, fullkeycard as fullkeycard, card as table_card, decimal(clusterfactor,10,5) as clustefactor, indextype, index_scans from syscat.indexes i join syscat.tables t on i.tabname=t.tabname and i.tabschema=t.tabschema join table(mon_get_index( i.tabschema, i.tabname, -2 )) mgi on mgi.iid = i.iid where t.tabschema='WSCOMUSR' and t.tabname='XSTORESKU'  with ur

LASTUSED   INDNAME              COLNAMES                       FULLKEYCARD          TABLE_CARD           CLUSTEFACTOR INDEXTYPE INDEX_SCANS
---------- -------------------- ------------------------------ -------------------- -------------------- ------------ --------- --------------------
09/15/2014 P_XSTORESKU          +XSTORESKU_ID                               3550167              3550167      0.85567 REG                     586462
09/04/2014 I_XSTORESKU01        +STLOC_ID                                      2681              3550167      0.18429 REG                         41
09/16/2014 I_XSTORESKU02        +UPC                                          15955              3550167      0.03371 REG                        334
09/16/2014 U_XSTORESKU01        +STLOC_ID+UPC                               3550167              3550167      0.15293 REG                     406417
01/01/0001 U_XSTORESKU02        +UPC+STLOC_ID                               3550167              3550167      0.00156 REG                          0

  5 record(s) selected.


DB20000I  The SQL command completed successfully.

This brings data together from several sources I want to know about when looking at my indexes in this context. Notice that I’m pulling the lastused to see if the index is used at all (in this case, there are some stupid indexes on the table), and also pulling from MON_GET_INDEX to look at the number of index scans.

The next script I’ve been playing with is designed to return the name of a single index that appears to be the best candidate for a clustering index without actually looking at the SQL. It finds indexes where the index cardinality is less than 50% of the table (not sure I’m happy with that number yet), but with a cardinality of greater than 1, because clustering a table on an index with a cardinality of one is patently ridiculous. Then it orders them in descending order of the number of index scans reported in MON_GET_INDEX, and returns only the index with the most scans for this table. This tells me that DB2 is already using this low-cardinality index even without the clustering, which tends to tell me that it may be causing potential performance problems.

 $ cat clus_ind.sh
tabname=`echo $1 | tr '[:lower:]' '[:upper:]'`;
db2 connect to wc005p01;
db2 -v "select substr(indname,1,20) as indname, float(fullkeycard)/float(card)
from syscat.indexes i join syscat.tables t on i.tabname=t.tabname and i.tabschema=t.tabschema join table(mon_get_index( i.tabschema, i.tabname, -2 )) mgi on mgi.iid = i.iid
where t.tabschema='WSCOMUSR' and t.tabname='$tabname' and float(fullkeycard)/float(card) < .5 and fullkeycard > 1 order by index_scans desc fetch first 1 row only  with ur";
db2 connect reset;

The problem with this approach is that I’m not looking at what index could have the most impact – maybe one of these indexes would be used very heavily if only the cluster factor was higher, since DB2 tends to avoid the usage of low-cardinality indexes when a table is not well clustered over them.

What other factors should I be considering here? Does anyone have any suggestions on how to make this methodology (or some other method) more relevantly find indexes on which a table would most optimally be clustered on? I feel like my approach could be heavily improved on, but wonder how to do it without extremes like explaining all the SQL in my package cache.

I’m still sure I should be dropping the lowest cardinality indexes in many situations. Perhaps choosing a clustering index is really more applicable to the “medium” cardinality indexes.

Ember Crooks
Ember Crooks

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, spent a year working with high-volume MySQL, and is now learning Snowflake. Ember shares both posts about her core skill sets and her journey learning Snowflake.

Ember lives in Denver and work from home

Articles: 549

7 Comments

  1. Benefits of clustering depend a lot on how data is accessed. There are some natural places this occurs:
    – If there is RI, then the child tables would benefit by being clustered by the parent key.
    – Many applications do not declare RI, but the relationship in the data still exist
    – Batch processes that do Merge type function. For example, take a sorted input file and update the master table. If both are sorted in the same field, enormous amounts of IO can be eliminated. Dynamic pre-fetch kicks in and all your data pages are ready before you need them.

    In the old days, clustering was also used to avoid hot spots on physical disk. You would pick a very random column for the clustering index. This caused inserts to be spread all over the table. This is not a problem today with SANs.

    There is a penalty to declaring a clustering index. Inserting data takes more work as DB2 tries to find the best place for the row. If inserts are not off the end, then additional reorgs are required to re-cluster the table. It’s a great tool to use, but declaring clustering indexes for no reason hides the cases where clustering is important.

    Norm

  2. Ember:

    Brilliant post. I read your article in IBM developer works on a similar topic (You have given the link to it above as well). Also, I was at IDUG listening to your presentation on this topic 🙂 . Now, I have two questions:

    1) When DB2 does not have the distribution statistics, does it assume equal or normal distribution?

    2) Cluster indices need to be maintained with Reorgs which can be expensive on big tables, Do you think an MDC (if appropriate) is better(provided we can spare extra storage)? Especially if we have multiple medium card indices.

    Thanks for the informative blogs as always.

    Ravi

    • 1) It assumes equal distribution. “Normal” would be a bell curve, and DB2 does not do that.
      2) I love the idea of MDC tables for this kind of thing, should your application allow that. WebSphere Commerce does not allow MDC. Obviously you don’t want to choose a dimension with too high of a cardinality for MDC for fear of wasting space.

  3. Hello,

    if i create a table and i create 2 indexes on them and none of them has the cluster keyword. In which order tries db2 to maintain the data?
    Is that in the order of the first index created?

    • DB2 does not try to maintain the data in any order. It is likely to be roughly in order of when it was inserted, but that’s not guaranteed. While SQL server may cluster on PK by default, DB2 definitely does NOT.

  4. Hi Ember,

    This is Yash and had a question on order of the rows returned by the DB2

    Suppose, there is table with 5 columns and all the 5 columns are part of the unique clustered index and all the 5 columns are mentioned in the fetch query, will the rows be always returned in the order they are stored or do we need to specify the Order by clause ?

    Thanks,
    Yash

    • There is never a guaranteed return order, and if order matters to you, you should specify it in the query. Db2 could choose a different return order at any time. While you might see a somewhat consistent return order, there are a number of things that can change it.

Leave a Reply to Ember CrooksCancel Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.