On looking over my previous posts, I found no fewer than 25 of them that mentioned runstats. But I wanted to cover runstats in a different level of detail. With this post, I aim to answer 2 questions – What is runstats? and Why should you run runstats?
Runstats is a utility. That’s why we say “run runstats”. You may also hear “collect runstats” or “collect statistics”. Other RDBMS’s have a similar concept, but they may call it something slightly different. Runstats collects statistical information about the data in tables and indexes. You can view this information in views in the SYSSTAT schema. SYSCAT tables have some of the data as well.
There are a ton of options on the runstats command. Let’s look at my favorite syntax and what each part means.
db2 runstats on table schema.table with distribution and detailed indexes all
This tells DB2 to collect distribution statistics. Distribution statistics include two things:
- Frequent value statistics – DB2 notes the most frequent values. By default, the 10 most frequent values. Using the above syntax, this is collected for every column. You can change the number of most frequent values using the NUM_FREQVALUES database configuration parameter, or the NUM_FREQVALUES clause on the runstats command
- Quantile statistics – Divides the values into NUM_QUANTILES (default: 20) sections to describe the distribution of the data. The default means that the optimizer should be able to estimate the number of values that would meet any one-sided predicate within 2.5% of the actual value.
and detailed indexes all
Collecting index statistics helps DB2 decide which if any indexes to use to satisfy a particular query. Collecting detailed index statistics allows db2 to more accurately estimate the page fetches that will be required – allowing db2 to properly estimate the cost of accessing a table through an index. DB2 will use this data along with bufferpool information to determine how much (if any) synchronous page cleaning will have to occur.
I recommend always collecting detailed index statistics for e-commerce databases.
DB2 offers a facility for running runstats “as needed”. It looks for a certain percentage of data change and does runstats as needed. But you cannot easily tell it to use the syntax you prefer. That’s one reason that I don’t like to use it. The other is that I’m a control freak, and I want to know exactly when runstats are running (since they can cause mild performance degradation), and that runstats for all tables was done at a specific time.
Why do we gather all of this information? Especially when it can extend the time it takes to run runstats? DB2’s optimizer is very powerful, but its ability to choose the best access path is very heavily dependent on having current statistics. If you call DB2 support on a query performance issue one of the very first questions they will ask is when the last runstats was and if it covered all tables. The same goes for contacting WebSphere Commerce support on database performance issues, if you’re using WebSphere Commerce. I have seen old or missed runstats literally cause a site-down issue that was immediately resolved after runstats/rbind was complete. Give the optimizer the best chance of returning excellent query performance, and do runstats at least weekly for all non-volatile tables.
Info center entry on distribution statistics: http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0005093.html
Info center entry on detailed index statistics: http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0005095.html
I wanted to say, that if you want to control how the statistics are executed, you must probably use a profile, and then, the automatic maintenance will respect that profile. This is a good reason to activate the automatic runstats and still have the control, by configuring the statistic profiles.
“Automatic statistics collection respects the profile a user has specified by using the registered profile option in the SYSCAT.TABLES catalog table. If no profile is specified, automatic statistics collection uses the following options when collecting statistics:”
But I do have to set a profile for each individual table, right? I can see that working, but as a control freak, I also want to know exactly when runstats happened across the board. When someone asks, I want to be able to say “Saturday Morning” or “Last Night” without even thinking about it.
Do I need to execute runstats for each table? can’t we run by schema?
Not even in version 10 is that option available. A direct quote from the 9.7 Info Center page on Runstats: “The fully qualified name or alias in the form: schema.object-name must be used.” From: http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.cmd.doc/doc/r0001980.html
The 10.1 info center changes that “must” to “can”, but still offers no choices as to what else to specify.
I use one of my basic scripting tricks if I need to do a single schema: https://datageek.blog/2012/02/06/basic-scripting-tricks-for-db2-lu/
What is recommendation for running runstats, say my data changes by x% then I should do runstats? What is the recommended x%, 10%, 20% 30 % etc
I don’t have that number. I don’t like that method of running runstats, so I don’t know the thresholds associated with it.
Is there any tool or SQL stmt to know when to run runstats as we do with REORGCHK.
Usually if you’re doing runstats manually, then you have a fixed interval that you do runstats for all tables or perhaps categorize tables based on how often you need to do runstats on them based on data change rates (daily, weekly, monthly, etc).
The driving factor is often data change rates, but I don’t have a specific query to track that. DB2’s automated runstats facilities do make decisions based on data change rates. But I don’t use them because I’m a control freak and I want to know when the last runstats was. Also because the data I’m querying in an e-commerce database tends to be the most recently added data.
Personally I do runstats on all tables either daily or weekly depending on whether I’m in a build or post-go-live phase where data changes quickly or if I’m just in normal operation. This works well in e-commerce databases. I can see a point being made for different frequencies in different types of databases.
Few tables have range partitioning with 1 to 5 billion rows. Therefore, there are approximately 450 million rows per partition (with a non-partitioning index and 4 partitioning indexes). Can the optimizer use the statistics of the partition, if available, rather than the table statistics?
I don’t know.
If we use the runstats command similar to Ember’s (db2 runstats on table schema.table with distribution and detailed indexes all), do I need to run runstats on the individual indexes of that table (db2 runstats on table schema.table for index …..) ?
First, you cannot do runstats on an individual index unless you are using range partitioning. Second, no, that syntax does all indexes in addition to the table.
Thanks Ember, how do you determine what columns currently have statistics collected on them. We have “Help stats tablename” in teradata, is there anything similar in db2. I’m accessing db2 from TSO
In DB2, we generally go with all columns. I don’t have the sql off the top of my head, but I think you’d be able to see if only specific columns have distribution stats by querying sysstat.columns. A powerful strategy can also be to gather statistics on columns that are frequently queried together or joined on together and have correlations of some sort. See https://www.ibm.com/developerworks/data/library/techarticle/dm-0612kapoor/ for information on how to see if that has been done.
Hi Ember – can I get your opinion? Would you do runstats on tables that don’t get any data inserted or updated to them? Is there any value in performing runstats on tables that are only ever read from?
The only value in NOT doing runstats on them is that you save the time and CPU cycles required. If you can reliably define such tables and have a regular (annual?) check to verify they’re still being as static as you think they are, I’m just fine with skipping runstats on them.
There are three potential problems in not doing runstats on them. First usage patterns may change and you may not be aware of the change. Second, you can’t use stock runstats scripts unless they have an option for an exception list. I have written these scripts with an exception list in the past, but I’ve also written them without that option. Third, if you (or a consultant or vendor) run a query to evaluate runstats within the database, you or they might mistakenly feel something is being missed. Those are mostly minor annoyances. I have excepted tables in the right set of circumstances in the past, but I don’t go looking for opportunities to except tables.