Runstats Issue With Columns Where There is a Long Common Prefix in Character Data

This technote caught my eye when @db2_support tweeted it the other day ago. It was titled as “DB2 might choose a sub-optimal query execution plan due to distribution statistics”. The title alone intrigued me. As I read through, I thought that I needed to do a bit of investigation to better understand the issue. It also seemed like something I could write a query or two to investigate and see if it was occurring in my databases.

Description of the Issue

This particular problem appears to be based on the fact that DB2’s runstats utility only looks at the first 32 characters of a longer text field when collecting distribution statistics. I always recommend distribution statistics because the more information the DB2 optimizer has, the more likely it is to generate a truly great access plan.

Distribution statistics means the quantiles and frequently occurring values are stored in SYSSTAT.COLDIST. By default, distribution statistics will store the 10 most frequently occurring values and 20 quantile values. Quantile values means that DB2 stores the value found 1/20th of the way through the table. If the table had just 20 rows, this means that every value would be stored. If the table had 40 values for this column, every other would be stored. With the default settings, 20 quantile values are stored for each and every column in each and every table, no matter what the size of the table.

The problem appears to be that the runstats utility considers only the first 32 characters of a string, no matter what the length of that string is. This presents a problem because DB2 relies on these values to estimate how many rows a query will return. If DB2 estimates incorrectly, it might choose the wrong access path to retrieve the data, possibly resulting in a query that performs worse than it might.

I have a fair number of mid-sized (50-1000 character) VARCHAR fields in the WebSphere Commerce databases that I support. I could also immediately think of two tables where it would be likely that I would have fields where the first 32 characters were the same, but the string was different later on.

Looking to see if the Conditions for the Issue Exist

The first thing I did was to develop a query to help me find tables/columns where the potential for this problem existed. With thousands of tables each with many columns, it’s just too much for me to know immediately where I might face this issue. So I developed this query to pull potential areas out of sysstat.coldist.

SELECT     substr(tabname,1,20) AS tabname, 
    substr(colname,1,20) AS colname, 
    length(colvalue) AS col_length, 
    valcount, 
    substr(colvalue,1,40) AS colvalue, 
    seqno 
FROM sysstat.coldist 
WHERE   tabschema='WSCOMUSR' 
    AND tabname NOT LIKE 'TI_%'
    AND colvalue is not null 
    AND length(colvalue) > 30 
    AND type ='F' 
    AND valcount>100 
ORDER BY    3 DESC, 
        tabname 
WITH UR;

What this query does is look only at the frequently occurring values, and look for ones with higher numbers of occurrences, where the length of the value is over 30 characters. I found it interesting that there were no values in my database longer than 37 characters, which confirms for me that there is some internal limit. Oddly enough, the COLVALUE column in SYSSTAT.COLDIST is defined as VARCHAR(254). When I ran this query, I came up with output like the following:

TABNAME              COLNAME              COL_LENGTH  VALCOUNT             COLVALUE                                 SEQNO 
-------------------- -------------------- ----------- -------------------- ---------------------------------------- ------
ATCHAST              ATCHASTPATH                   35                  438 'images/catalog/apparel/apparel_55'           2
ATCHAST              ATCHASTPATH                   35                  402 'images/catalog/apparel/apparel_40'           3
ATCHAST              ATCHASTPATH                   35                  355 'images/catalog/kitchenware/kitche'           4
...
SRCHSTAT             SUGGESTION                    32                19001 'fallen, face, fast, fade, else'              4
CTXDATA              SERVALUE                      31               178211 'null&null&null&null&null&null'               4
PX_ELEMENT           TYPE                          31                 8486 'IncludeCatalogEntryIdentifier'               1
PX_ELEMENT           TYPE                          31                 1909 'ExcludeCatalogEntryIdentifier'               3

  310 record(s) selected. 

310 or fewer tables is something I can deal with looking at, at least.

From this point, it requires a bit more knowledge of your application or your data model to dig down into what might actually be a problem area. Just because a value is more than 30 characters does not mean it is a problem. But if you look at what the field is used for and put that in context of the table, you may be able to find areas where the value changes in the later characters.

In the case of one of the tables here, ATCHAST, the table is used to store information about ATtaCHment ASseTs. The data stored includes the paths to files. It is assumed that many of the files are stored in similar locations, and therefore their paths may start similarly, but change at the end. This is exactly the kind of column I need to be looking into.

Once you have used a query like the one above, and have applied your own knowledge about the data model to determine which tables/columns might be a problem, you’ll want to examine if the problem actually exists

In the case of the ATCHAST table above, I would run something like this:

SELECT substr(atchastpath,1,70) AS atchastpath FROM atchast WHERE atchastpath LIKE 'images/catalog/apparel/apparel_40%' WITH UR

ATCHASTPATH                                                                     
----------------------------------------------------------------------
images/catalog/apparel/apparel_40x40/IMG_0034_q.jpg                   
images/catalog/apparel/apparel_40x40/IMG_0055_q.jpg                   
images/catalog/apparel/apparel_40x40/IMG_0077_q.jpg                   
images/catalog/apparel/apparel_40x40/IMG_0062_q.jpg                   
images/catalog/apparel/apparel_40x40/IMG_0054_q.jpg                   
images/catalog/apparel/apparel_40x40/IMG_0058_q.jpg                   
images/catalog/apparel/apparel_40x40/IMG_0020_q.jpg    `
...

I can see in this case, that indeed the values are the same until the last few characters. This tells me clearly that this problem has a real potential for happening in this case.

Now the next piece is to consider if this column is used in a way that matters. Would I ever have a where clause that would look for other information in the table by specifying the value in this column? Or a query that would look for values that were greater than or less than this value? There may be some cases where the answer is no. In 10.1 and higher, it may be useful to create a usage list to capture the SQL on the table in question.

When reviewing SQL in this context, remember that SQL with parameter markers (?) cannot ever make use of distribution statistics, so remember to discount any SQL you see along those lines.

Fix for the Issue

Here’s an interesting part. This is a technote and not an APAR. To me, this means that they have no intention of fixing it. C’mon, IBM, can runstats be improved to at least handle 254 characters for distribution statistics?

The stated fix is not to use distribution statistics at all, for the specific columns. That would probably lower the scope of the problem when it is causing access plans to be significantly off, but all that does is make it so DB2 assumes a normal distribution – which is just another way of estimating incorrectly, just hopefully not as far off.

Intriguing thought – SYSSTAT tables are update-able. I wonder if I could manually calculate the values to replace problem values like these, and then update SYSSTAT.COLDIST? If I did, would the optimizer be able to handle the longer values?

Ember Crooks
Ember Crooks

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, spent a year working with high-volume MySQL, and is now learning Snowflake. Ember shares both posts about her core skill sets and her journey learning Snowflake.

Ember lives in Denver and work from home

Articles: 544

One comment

  1. […] write this year were: Activities vs. Requests Detailed Analysis of Individual Performance Metrics Runstats Issue With Columns Where There is a Long Common Prefix in Character Data Three Different Ways to Write the Same Join in SQL When Index Scans Attack! Example of A Clustering […]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.