This technote caught my eye when @db2_support tweeted it the other day ago. It was titled as “DB2 might choose a sub-optimal query execution plan due to distribution statistics”. The title alone intrigued me. As I read through, I thought that I needed to do a bit of investigation to better understand the issue. It also seemed like something I could write a query or two to investigate and see if it was occurring in my databases.
Description of the Issue
This particular problem appears to be based on the fact that DB2’s runstats utility only looks at the first 32 characters of a longer text field when collecting distribution statistics. I always recommend distribution statistics because the more information the DB2 optimizer has, the more likely it is to generate a truly great access plan.
Distribution statistics means the quantiles and frequently occurring values are stored in SYSSTAT.COLDIST. By default, distribution statistics will store the 10 most frequently occurring values and 20 quantile values. Quantile values means that DB2 stores the value found 1/20th of the way through the table. If the table had just 20 rows, this means that every value would be stored. If the table had 40 values for this column, every other would be stored. With the default settings, 20 quantile values are stored for each and every column in each and every table, no matter what the size of the table.
The problem appears to be that the runstats utility considers only the first 32 characters of a string, no matter what the length of that string is. This presents a problem because DB2 relies on these values to estimate how many rows a query will return. If DB2 estimates incorrectly, it might choose the wrong access path to retrieve the data, possibly resulting in a query that performs worse than it might.
I have a fair number of mid-sized (50-1000 character) VARCHAR fields in the WebSphere Commerce databases that I support. I could also immediately think of two tables where it would be likely that I would have fields where the first 32 characters were the same, but the string was different later on.
Looking to see if the Conditions for the Issue Exist
The first thing I did was to develop a query to help me find tables/columns where the potential for this problem existed. With thousands of tables each with many columns, it’s just too much for me to know immediately where I might face this issue. So I developed this query to pull potential areas out of sysstat.coldist.
SELECT substr(tabname,1,20) AS tabname, substr(colname,1,20) AS colname, length(colvalue) AS col_length, valcount, substr(colvalue,1,40) AS colvalue, seqno FROM sysstat.coldist WHERE tabschema='WSCOMUSR' AND tabname NOT LIKE 'TI_%' AND colvalue is not null AND length(colvalue) > 30 AND type ='F' AND valcount>100 ORDER BY 3 DESC, tabname WITH UR;
What this query does is look only at the frequently occurring values, and look for ones with higher numbers of occurrences, where the length of the value is over 30 characters. I found it interesting that there were no values in my database longer than 37 characters, which confirms for me that there is some internal limit. Oddly enough, the COLVALUE column in SYSSTAT.COLDIST is defined as VARCHAR(254). When I ran this query, I came up with output like the following:
TABNAME COLNAME COL_LENGTH VALCOUNT COLVALUE SEQNO -------------------- -------------------- ----------- -------------------- ---------------------------------------- ------ ATCHAST ATCHASTPATH 35 438 'images/catalog/apparel/apparel_55' 2 ATCHAST ATCHASTPATH 35 402 'images/catalog/apparel/apparel_40' 3 ATCHAST ATCHASTPATH 35 355 'images/catalog/kitchenware/kitche' 4 ... SRCHSTAT SUGGESTION 32 19001 'fallen, face, fast, fade, else' 4 CTXDATA SERVALUE 31 178211 'null&null&null&null&null&null' 4 PX_ELEMENT TYPE 31 8486 'IncludeCatalogEntryIdentifier' 1 PX_ELEMENT TYPE 31 1909 'ExcludeCatalogEntryIdentifier' 3 310 record(s) selected.
310 or fewer tables is something I can deal with looking at, at least.
From this point, it requires a bit more knowledge of your application or your data model to dig down into what might actually be a problem area. Just because a value is more than 30 characters does not mean it is a problem. But if you look at what the field is used for and put that in context of the table, you may be able to find areas where the value changes in the later characters.
In the case of one of the tables here, ATCHAST, the table is used to store information about ATtaCHment ASseTs. The data stored includes the paths to files. It is assumed that many of the files are stored in similar locations, and therefore their paths may start similarly, but change at the end. This is exactly the kind of column I need to be looking into.
Once you have used a query like the one above, and have applied your own knowledge about the data model to determine which tables/columns might be a problem, you’ll want to examine if the problem actually exists
In the case of the ATCHAST table above, I would run something like this:
SELECT substr(atchastpath,1,70) AS atchastpath FROM atchast WHERE atchastpath LIKE 'images/catalog/apparel/apparel_40%' WITH UR ATCHASTPATH ---------------------------------------------------------------------- images/catalog/apparel/apparel_40x40/IMG_0034_q.jpg images/catalog/apparel/apparel_40x40/IMG_0055_q.jpg images/catalog/apparel/apparel_40x40/IMG_0077_q.jpg images/catalog/apparel/apparel_40x40/IMG_0062_q.jpg images/catalog/apparel/apparel_40x40/IMG_0054_q.jpg images/catalog/apparel/apparel_40x40/IMG_0058_q.jpg images/catalog/apparel/apparel_40x40/IMG_0020_q.jpg ` ...
I can see in this case, that indeed the values are the same until the last few characters. This tells me clearly that this problem has a real potential for happening in this case.
Now the next piece is to consider if this column is used in a way that matters. Would I ever have a where clause that would look for other information in the table by specifying the value in this column? Or a query that would look for values that were greater than or less than this value? There may be some cases where the answer is no. In 10.1 and higher, it may be useful to create a usage list to capture the SQL on the table in question.
When reviewing SQL in this context, remember that SQL with parameter markers (
?) cannot ever make use of distribution statistics, so remember to discount any SQL you see along those lines.
Fix for the Issue
Here’s an interesting part. This is a technote and not an APAR. To me, this means that they have no intention of fixing it. C’mon, IBM, can runstats be improved to at least handle 254 characters for distribution statistics?
The stated fix is not to use distribution statistics at all, for the specific columns. That would probably lower the scope of the problem when it is causing access plans to be significantly off, but all that does is make it so DB2 assumes a normal distribution – which is just another way of estimating incorrectly, just hopefully not as far off.
Intriguing thought – SYSSTAT tables are update-able. I wonder if I could manually calculate the values to replace problem values like these, and then update SYSSTAT.COLDIST? If I did, would the optimizer be able to handle the longer values?