Some of the more complicated work a DBA does is often analyzing a query. Whether it is proactive or in response to a performance problem, there are so many factors that go into query performance. Even when looking at a query that has a performance problem, there is only occasionally a single, obvious cause for all of the problems.
Defining Poor Performance and Success
The most important step when starting to work on a single query is to define what the problem is and what success looks like. This is going to depend on things like why you’re doing the analysis in the first place, and what requirements there are. Some problem/success scenarios are very well defined such as “It currently takes this query 40 seconds to execute, and it needs to execute in less than 30 seconds”. Others may be more technically rooted, such as “This query returns only 5 rows each time it executes, but it reads over a million – I want it to scan fewer rows”.
Some goals may not be possible, particularly if they are coming from a business requirement. If I want a query to run in 10 seconds that currently runs in five minutes, that may or may not be possible. Depending on the query, it may not be possible to improve performance at all.
The complexity of queries can vary vastly. I’ve tuned queries that were two lines and queries that are 10 pages. Even if a query looks simple, it can often be more complex when views, stored procedures, or other structures are referenced.
How Long Should Query Analysis Take?
Query analysis can be extremely time consuming. I start with a minimum estimate of 8 hours. Even if a query is short and the recommendations of the index advisor make sense and work well, it still takes time. Complicating factors include
- Not having a production-sized non-production environment to work in, or not being able to make changes in production
- Training someone else or explaining what you’re looking at while you’re working
- Not having direct access to the production environment where performance is a problem
- Db2 running in a virtual environment – this may add an additional layer to analyze
- Not having access to hardware, virtualization, and OS experts for complicated system issues
- Complex queries – those which do not fit on a screen when views are expanded
- Heavy use of stored procedures and functions
Factors Contributing to Poor Query Performance
When looking into query performance, I start with a long list of questions – some for actual people and some that I’m looking into on the system. These questions include:
- Why are we looking at this specific query? Common reasons include:
- This query used to perform well, but performance is worse now.
- This query is going into a future production release and needs to be reviewed first.
- This query is going into a future production release, but performance in non-production is not acceptable.
- This query has been identified as a poor performer either via app or database analysis – though performance has not necessarily changed recently.
- Are there concerns with overall database or end-to-end performance, or is it just this one query?
- What are the System Details where this query needs to perform? This often includes:
- What OS is being used? Has the OS been recently patched or upgraded?
- What does the storage/CPU/memory configuration look like? Has anything changed at this level?
- What version/fixpack of DB2 is in use? Is it old? Has it changed recently?
- What is the runstats strategy? Are runstats current on all the tables involved and generally across the database?
- What is the reorg strategy?
- What does data growth look like? Have there been any events recently that have lead to greater than previous data growth?
- Are there any red flags in Db2 key performance indicators?
- Does this query result in any errors or warnings? Are there SQL error codes in the application logs?
- Did the performance for this query change recently?
- Does this query use parameter markers, or static values?
- Does this query perform consistently – both across different timing of execution, and across different values if parameter markers are used?
- If parameter markers are used, and performance varies based on values used, what are the problematic values? Are there any values that perform well? Is there a range of sample values I should be working with?
- How often is this query executed? I have different thresholds and concerns for a query executed 10 times a second than I do for one that is executed once per day.
Analysis
Some of those questions are asked immediately, up front, and some are questions I ask as I work through it. Once I have a good starting point, I prefer to be able to test and analyze the query in production. This is not always possible. If not, then working in a fully-sized non-production environment works. Working in a smaller, even proportionally sized non-production environment is less than optimal.
There are a number of things I may work through in the actual analysis. After I’ve gone through the basics from the section above – verifying runstats and looking at overall KPIs for the database, I will do an explain (using db2exfmt) and db2advis first. These are the starting point. Depending on what I see there, I will likely move on to various combinations of the below:
- Analyze indexes recommended by db2advis to see if they make sense in the context of database structure
- Look at the explain plan to see if there are any particularly expensive areas that I may be able to address or that point to specific indexes from the Design Advisor
- Use db2caem or other methods to capture section actuals and see if Db2’s estimates are off, sometimes across multiple sets of potential values
- Dig deeper into the explain plan on problem areas to see if there is anything I can do to reduce them
- Look closely at the SQL to see if there are any efficiencies to be gained or major mistakes being made there – this can be difficult and time consuming, and often involves running and analyzing other versions of the SQL. Some of the things I look for here include:
- What do the joins look like? Do they all have join conditions so we’re not seeing a Cartesian Product?
- Are a large number of columns being returned? Do the developers really need all of those columns?
- In the where clause, are comparisons keeping functions on the “right” side of the comparisons, or are functions being applied to column data? If so, can I flip any of them?
- Are any subqueries using the “fetch first 1 row only” cop-out?
- Are the same aggregate functions being applied over and over to the same data? Can I break those out into MQTs?
- Consider if any advanced statistics methods may help, including column group statistics or statistical views
- Consider if any MQTs may help
- Consider if clustering, either via a clustering index or MDC, might make a significant impact
- Discuss with developers whether the output from this query could be cached and reused at the application level
One tool that I am integrating into my analysis is a query to MON_GET_PKG_CACHE_STMT to understand what types of wait time and processing time are bogging this query down. I’m still playing with these queries and understanding everything available, but I use something like the following three queries to look at various metrics on the query:
--overview SELECT SUBSTR(STMT_TEXT,1,10) as STATEMENT , ROWS_READ , ROWS_READ/ROWS_RETURNED as STMT_IXREF , ROWS_RETURNED/NUM_EXECUTIONS as ROWS_RETURNED_PER_EXEC , TOTAL_ACT_TIME as TOTAL_ACT_TIME_MS , decimal(float(TOTAL_ACT_WAIT_TIME)/float(TOTAL_ACT_TIME),5,2)*100 as PCT_WAIT , decimal(1 - ((float(pool_data_p_reads) + float(pool_xda_p_reads) + float(pool_index_p_reads) + float(pool_temp_data_p_reads) + float(pool_temp_xda_p_reads) + float(pool_temp_index_p_reads) ) / (float(pool_data_l_reads) + float(pool_xda_l_reads) + float(pool_index_l_reads) + float(pool_temp_data_l_reads) + float(pool_temp_xda_l_reads) + float(pool_temp_index_l_reads) )) ,5,2) as stmt_bphr , TOTAL_SORTS/NUM_EXECUTIONS as SORTS_PER_EXEC , decimal(float(SORT_OVERFLOWS)/float(TOTAL_SORTS),5,2) * 100 as SORT_OVERFLOW_PCT , POST_THRESHOLD_SORTS+POST_SHRTHRESHOLD_SORTS as POST_THRESHOLD_SORTS , NUM_EXECUTIONS , DECIMAL(FLOAT(STMT_EXEC_TIME)/FLOAT(NUM_EXECUTIONS),10,2) AS AVG_EXEC_TIME , DEADLOCKS , LOCK_TIMEOUTS , INSERT_TIMESTAMP FROM TABLE(MON_GET_PKG_CACHE_STMT ( 'D', x'01000000000000001E1906000000000000000000020020170903070924368243', NULL, -2)) AS T ; --wait time SELECT decimal(float(TOTAL_ACT_WAIT_TIME)/float(TOTAL_ACT_TIME),9,4) as PCT_WAIT , TOTAL_ACT_WAIT_TIME , LOCK_WAIT_TIME , LOG_BUFFER_WAIT_TIME , LOG_DISK_WAIT_TIME FROM TABLE(MON_GET_PKG_CACHE_STMT ( 'D', x'01000000000000001E1906000000000000000000020020170903070924368243', NULL, -2)) AS T ; --processing time SELECT TOTAL_ACT_TIME - TOTAL_ACT_WAIT_TIME as TOTAL_ACT_EXECUTING , PREP_TIME , TOTAL_CPU_TIME , POOL_READ_TIME , POOL_WRITE_TIME , DIRECT_READ_TIME , DIRECT_WRITE_TIME , TOTAL_SECTION_SORT_TIME , TOTAL_SECTION_SORT_PROC_TIME , WLM_QUEUE_TIME_TOTAL , TOTAL_ROUTINE_TIME FROM TABLE(MON_GET_PKG_CACHE_STMT ( 'D', x'01000000000000001E1906000000000000000000020020170903070924368243', NULL, -2)) AS T
Note that these queries require as input the EXECUTABLE_ID of the query. If the query is one that is already being run on the system by applications, it is usually not hard to find this by querying MON_GET_PKG_CACHE_STMT with a LIKE condition on STMT_TEXT. If it is not already being executed, you’ll have to execute it at least once to get the data in the package cache.
These queries can be very helpful in targeting the slow portion of the query. Calculating the buffer pool hit ratio on a query-by-query basis may not be something we do very often, but it can be very telling to have that, the sort time, sort overflows, or another metric stand out in these queries.
Here is one example of the output from these queries:
STATEMENT ROWS_READ STMT_IXREF ROWS_RETURNED_PER_EXEC TOTAL_ACT_TIME_MS PCT_WAIT STMT_BPHR SORTS_PER_EXEC SORT_OVERFLOW_PCT POST_THRESHOLD_SORTS NUM_EXECUTIONS AVG_EXEC_TIME DEADLOCKS LOCK_TIMEOUTS INSERT_TIMESTAMP ---------- -------------------- -------------------- ---------------------- -------------------- ------------------ --------- -------------------- ------------------ -------------------- -------------------- ------------- -------------------- -------------------- -------------------------- SELECT COL 15521066 1 1 1308700 0.00 1.00 1 0.00 0 15521066 0.08 0 0 2017-09-04-02.32.58.032500 1 record(s) selected. PCT_WAIT TOTAL_ACT_WAIT_TIME LOCK_WAIT_TIME LOG_BUFFER_WAIT_TIME LOG_DISK_WAIT_TIME ----------- -------------------- -------------------- -------------------- -------------------- 0.0001 143 0 0 0 1 record(s) selected. TOTAL_ACT_EXECUTING PREP_TIME TOTAL_CPU_TIME POOL_READ_TIME POOL_WRITE_TIME DIRECT_READ_TIME DIRECT_WRITE_TIME TOTAL_SECTION_SORT_TIME TOTAL_SECTION_SORT_PROC_TIME WLM_QUEUE_TIME_TOTAL TOTAL_ROUTINE_TIME -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- ----------------------- ---------------------------- -------------------- -------------------- 1308557 10 1303285656 0 0 0 0 382732 382606 0 0 1 record(s) selected.
The output above shows excellent metrics overall, and tells us that a significant amount of the execution time is taken up by the one sort the query is doing. This information might help me focus on eliminating the sort, if possible.
Another example:
STATEMENT ROWS_READ STMT_IXREF ROWS_RETURNED_PER_EXEC TOTAL_ACT_TIME_MS PCT_WAIT STMT_BPHR SORTS_PER_EXEC SORT_OVERFLOW_PCT POST_THRESHOLD_SORTS NUM_EXECUTIONS AVG_EXEC_TIME DEADLOCKS LOCK_TIMEOUTS INSERT_TIMESTAMP ---------- -------------------- -------------------- ---------------------- -------------------- ------------------ --------- -------------------- ------------------ -------------------- -------------------- ------------- -------------------- -------------------- -------------------------- SELECT ORD 3755212322 2151984 1 720416 0.00 1.00 1 0.00 0 1459 493.77 0 0 2017-09-07-17.21.24.365010 1 record(s) selected. SELECT decimal(float(TOTAL_ACT_WAIT_TIME)/float(TOTAL_ACT_TIME),9,4) as PCT_WAIT , TOTAL_ACT_WAIT_TIME , LOCK_WAIT_TIME , LOG_BUFFER_WAIT_TIME , LOG_DISK_WAIT_TIME FROM TABLE(MON_GET_PKG_CACHE_STMT ( 'D', x'0100000000000000BD1D07000000000000000000020020170907172124365010', NULL, -2)) AS T PCT_WAIT TOTAL_ACT_WAIT_TIME LOCK_WAIT_TIME LOG_BUFFER_WAIT_TIME LOG_DISK_WAIT_TIME ----------- -------------------- -------------------- -------------------- -------------------- 0.0000 4 0 0 0 1 record(s) selected. SELECT TOTAL_ACT_TIME - TOTAL_ACT_WAIT_TIME as TOTAL_ACT_EXECUTING , PREP_TIME , TOTAL_CPU_TIME , POOL_READ_TIME , POOL_WRITE_TIME , DIRECT_READ_TIME , DIRECT_WRITE_TIME , TOTAL_SECTION_SORT_TIME , TOTAL_SECTION_SORT_PROC_TIME , WLM_QUEUE_TIME_TOTAL , TOTAL_ROUTINE_TIME FROM TABLE(MON_GET_PKG_CACHE_STMT ( 'D', x'0100000000000000BD1D07000000000000000000020020170907172124365010', NULL, -2)) AS T TOTAL_ACT_EXECUTING PREP_TIME TOTAL_CPU_TIME POOL_READ_TIME POOL_WRITE_TIME DIRECT_READ_TIME DIRECT_WRITE_TIME TOTAL_SECTION_SORT_TIME TOTAL_SECTION_SORT_PROC_TIME WLM_QUEUE_TIME_TOTAL TOTAL_ROUTINE_TIME -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- ----------------------- ---------------------------- -------------------- -------------------- 720412 2 719811899 0 0 0 0 70 70 0 0 1 record(s) selected.
This output, on the other hand, tells me that this particular query is scanning more than two million rows to return a single row. Though all of those are either in or being prefetched into the bufferpool, one or more indexes may be able to help this query significantly.
It Ran Fast Yesterday
While the performance for some queries degrades slowly over time, other queries just fall off a cliff. Even with gradually increasing data volume, there may be some point at which a query did not overflow the sortheap a minute ago, but now it does. Or the same scenario with other memory areas. Maybe a hash join worked with less data, but today, there’s too much and a nested loop join is used instead. In any case, reactive query performance analysis may address things that did not change the moment performance changed.
If a DBA is constantly looking for problem queries and tuning the worst offenders, and a DBA has the time to look at key performance indicators on a regular basis, this “fall off a cliff” is a bit less likely to happen, but it is still possible. Queries that have a good value for read efficiency are more likely to be stable with performance than those that do not – because there are fewer factors that are likely to lead to a sudden performance change.
Resolution
I enjoy most working in environments where there is no change control to prevent me from trying different solutions to discover what works. I find it critical to use db2batch to test actual query performance throughout this process to have exact numbers without the cloudiness added by network and application overhead. Rarely do I get to work in environments that allow me that much leeway, though. Being able to work within the bounds of change control is often critical.
What tips, tricks, or methodology do you have for analyzing query performance?
1. Get long running SQLs
2. Get top table scans
3. Get top row overflow access
4. Get top execution_id with respective SQL
5. Get db2advis and exec plan if first is not good.
6. Confirm tables and indexes are compressed
7. Provide info to development team