Query Performance Analysis

Some of the more complicated work a DBA does is often analyzing a query. Whether it is proactive or in response to a performance problem, there are so many factors that go into query performance. Even when looking at a query that has a performance problem, there is only occasionally a single, obvious cause for all of the problems.

Defining Poor Performance and Success

The most important step when starting to work on a single query is to define what the problem is and what success looks like. This is going to depend on things like why you’re doing the analysis in the first place, and what requirements there are. Some problem/success scenarios are very well defined such as “It currently takes this query 40 seconds to execute, and it needs to execute in less than 30 seconds”. Others may be more technically rooted, such as “This query returns only 5 rows each time it executes, but it reads over a million – I want it to scan fewer rows”.

Some goals may not be possible, particularly if they are coming from a business requirement. If I want a query to run in 10 seconds that currently runs in five minutes, that may or may not be possible. Depending on the query, it may not be possible to improve performance at all.

The complexity of queries can vary vastly. I’ve tuned queries that were two lines and queries that are 10 pages. Even if a query looks simple, it can often be more complex when views, stored procedures, or other structures are referenced.

How Long Should Query Analysis Take?

Query analysis can be extremely time consuming. I start with a minimum estimate of 8 hours. Even if a query is short and the recommendations of the index advisor make sense and work well, it still takes time. Complicating factors include

  • Not having a production-sized non-production environment to work in, or not being able to make changes in production
  • Training someone else or explaining what you’re looking at while you’re working
  • Not having direct access to the production environment where performance is a problem
  • Db2 running in a virtual environment – this may add an additional layer to analyze
  • Not having access to hardware, virtualization, and OS experts for complicated system issues
  • Complex queries – those which do not fit on a screen when views are expanded
  • Heavy use of stored procedures and functions

Factors Contributing to Poor Query Performance

When looking into query performance, I start with a long list of questions – some for actual people and some that I’m looking into on the system. These questions include:

  • Why are we looking at this specific query? Common reasons include:
    • This query used to perform well, but performance is worse now.
    • This query is going into a future production release and needs to be reviewed first.
    • This query is going into a future production release, but performance in non-production is not acceptable.
    • This query has been identified as a poor performer either via app or database analysis – though performance has not necessarily changed recently.
  • Are there concerns with overall database or end-to-end performance, or is it just this one query?
  • What are the System Details where this query needs to perform? This often includes:
    • What OS is being used? Has the OS been recently patched or upgraded?
    • What does the storage/CPU/memory configuration look like? Has anything changed at this level?
    • What version/fixpack of DB2 is in use? Is it old? Has it changed recently?
    • What is the runstats strategy? Are runstats current on all the tables involved and generally across the database?
    • What is the reorg strategy?
    • What does data growth look like? Have there been any events recently that have lead to greater than previous data growth?
    • Are there any red flags in Db2 key performance indicators?
  • Does this query result in any errors or warnings? Are there SQL error codes in the application logs?
  • Did the performance for this query change recently?
  • Does this query use parameter markers, or static values?
  • Does this query perform consistently – both across different timing of execution, and across different values if parameter markers are used?
  • If parameter markers are used, and performance varies based on values used, what are the problematic values? Are there any values that perform well? Is there a range of sample values I should be working with?
  • How often is this query executed? I have different thresholds and concerns for a query executed 10 times a second than I do for one that is executed once per day.

Analysis

Some of those questions are asked immediately, up front, and some are questions I ask as I work through it. Once I have a good starting point, I prefer to be able to test and analyze the query in production. This is not always possible. If not, then working in a fully-sized non-production environment works. Working in a smaller, even proportionally sized non-production environment is less than optimal.

There are a number of things I may work through in the actual analysis. After I’ve gone through the basics from the section above – verifying runstats and looking at overall KPIs for the database, I will do an explain (using db2exfmt) and db2advis first. These are the starting point. Depending on what I see there, I will likely move on to various combinations of the below:

  • Analyze indexes recommended by db2advis to see if they make sense in the context of database structure
  • Look at the explain plan to see if there are any particularly expensive areas that I may be able to address or that point to specific indexes from the Design Advisor
  • Use db2caem or other methods to capture section actuals and see if Db2’s estimates are off, sometimes across multiple sets of potential values
  • Dig deeper into the explain plan on problem areas to see if there is anything I can do to reduce them
  • Look closely at the SQL to see if there are any efficiencies to be gained or major mistakes being made there – this can be difficult and time consuming, and often involves running and analyzing other versions of the SQL. Some of the things I look for here include:
    • What do the joins look like? Do they all have join conditions so we’re not seeing a Cartesian Product?
    • Are a large number of columns being returned? Do the developers really need all of those columns?
    • In the where clause, are comparisons keeping functions on the “right” side of the comparisons, or are functions being applied to column data? If so, can I flip any of them?
    • Are any subqueries using the “fetch first 1 row only” cop-out?
    • Are the same aggregate functions being applied over and over to the same data? Can I break those out into MQTs?
  • Consider if any advanced statistics methods may help, including column group statistics or statistical views
  • Consider if any MQTs may help
  • Consider if clustering, either via a clustering index or MDC, might make a significant impact
  • Discuss with developers whether the output from this query could be cached and reused at the application level

One tool that I am integrating into my analysis is a query to MON_GET_PKG_CACHE_STMT to understand what types of wait time and processing time are bogging this query down. I’m still playing with these queries and understanding everything available, but I use something like the following three queries to look at various metrics on the query:

--overview
SELECT
        SUBSTR(STMT_TEXT,1,10) as STATEMENT
        , ROWS_READ
        , ROWS_READ/ROWS_RETURNED as STMT_IXREF
        , ROWS_RETURNED/NUM_EXECUTIONS as ROWS_RETURNED_PER_EXEC
        , TOTAL_ACT_TIME as TOTAL_ACT_TIME_MS
        , decimal(float(TOTAL_ACT_WAIT_TIME)/float(TOTAL_ACT_TIME),5,2)*100 as PCT_WAIT
        , decimal(1 - ((float(pool_data_p_reads) + float(pool_xda_p_reads) +
                float(pool_index_p_reads) + float(pool_temp_data_p_reads)
                + float(pool_temp_xda_p_reads) + float(pool_temp_index_p_reads) )
                / (float(pool_data_l_reads) + float(pool_xda_l_reads) + float(pool_index_l_reads) +
                float(pool_temp_data_l_reads) + float(pool_temp_xda_l_reads)
                + float(pool_temp_index_l_reads) )) ,5,2) as stmt_bphr
        , TOTAL_SORTS/NUM_EXECUTIONS as SORTS_PER_EXEC
        , decimal(float(SORT_OVERFLOWS)/float(TOTAL_SORTS),5,2) * 100 as SORT_OVERFLOW_PCT
        , POST_THRESHOLD_SORTS+POST_SHRTHRESHOLD_SORTS as POST_THRESHOLD_SORTS
        , NUM_EXECUTIONS
        , DECIMAL(FLOAT(STMT_EXEC_TIME)/FLOAT(NUM_EXECUTIONS),10,2) AS AVG_EXEC_TIME
        , DEADLOCKS
        , LOCK_TIMEOUTS
        , INSERT_TIMESTAMP
    FROM TABLE(MON_GET_PKG_CACHE_STMT ( 'D', x'01000000000000001E1906000000000000000000020020170903070924368243', NULL, -2)) AS T
;
--wait time
SELECT
        decimal(float(TOTAL_ACT_WAIT_TIME)/float(TOTAL_ACT_TIME),9,4) as PCT_WAIT
        , TOTAL_ACT_WAIT_TIME
        , LOCK_WAIT_TIME
        , LOG_BUFFER_WAIT_TIME
        , LOG_DISK_WAIT_TIME
    FROM TABLE(MON_GET_PKG_CACHE_STMT ( 'D', x'01000000000000001E1906000000000000000000020020170903070924368243', NULL, -2)) AS T
;
--processing time
SELECT
        TOTAL_ACT_TIME - TOTAL_ACT_WAIT_TIME as TOTAL_ACT_EXECUTING
        , PREP_TIME
        , TOTAL_CPU_TIME
        , POOL_READ_TIME
        , POOL_WRITE_TIME
        , DIRECT_READ_TIME
        , DIRECT_WRITE_TIME
        , TOTAL_SECTION_SORT_TIME
        , TOTAL_SECTION_SORT_PROC_TIME
        , WLM_QUEUE_TIME_TOTAL
        , TOTAL_ROUTINE_TIME
    FROM TABLE(MON_GET_PKG_CACHE_STMT ( 'D', x'01000000000000001E1906000000000000000000020020170903070924368243', NULL, -2)) AS T

Note that these queries require as input the EXECUTABLE_ID of the query. If the query is one that is already being run on the system by applications, it is usually not hard to find this by querying MON_GET_PKG_CACHE_STMT with a LIKE condition on STMT_TEXT. If it is not already being executed, you’ll have to execute it at least once to get the data in the package cache.

These queries can be very helpful in targeting the slow portion of the query. Calculating the buffer pool hit ratio on a query-by-query basis may not be something we do very often, but it can be very telling to have that, the sort time, sort overflows, or another metric stand out in these queries.

Here is one example of the output from these queries:

STATEMENT  ROWS_READ            STMT_IXREF           ROWS_RETURNED_PER_EXEC TOTAL_ACT_TIME_MS    PCT_WAIT           STMT_BPHR SORTS_PER_EXEC       SORT_OVERFLOW_PCT  POST_THRESHOLD_SORTS NUM_EXECUTIONS       AVG_EXEC_TIME DEADLOCKS            LOCK_TIMEOUTS        INSERT_TIMESTAMP
---------- -------------------- -------------------- ---------------------- -------------------- ------------------ --------- -------------------- ------------------ -------------------- -------------------- ------------- -------------------- -------------------- --------------------------
SELECT COL             15521066                    1                      1              1308700               0.00      1.00                    1               0.00                    0             15521066          0.08                    0                    0 2017-09-04-02.32.58.032500

  1 record(s) selected.


PCT_WAIT    TOTAL_ACT_WAIT_TIME  LOCK_WAIT_TIME       LOG_BUFFER_WAIT_TIME LOG_DISK_WAIT_TIME
----------- -------------------- -------------------- -------------------- --------------------
     0.0001                  143                    0                    0                    0

  1 record(s) selected.


TOTAL_ACT_EXECUTING  PREP_TIME            TOTAL_CPU_TIME       POOL_READ_TIME       POOL_WRITE_TIME      DIRECT_READ_TIME     DIRECT_WRITE_TIME    TOTAL_SECTION_SORT_TIME TOTAL_SECTION_SORT_PROC_TIME WLM_QUEUE_TIME_TOTAL TOTAL_ROUTINE_TIME
-------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- ----------------------- ---------------------------- -------------------- --------------------
             1308557                   10           1303285656                    0                    0                    0                    0                  382732                       382606                    0                    0

  1 record(s) selected.

The output above shows excellent metrics overall, and tells us that a significant amount of the execution time is taken up by the one sort the query is doing. This information might help me focus on eliminating the sort, if possible.

Another example:

STATEMENT  ROWS_READ            STMT_IXREF           ROWS_RETURNED_PER_EXEC TOTAL_ACT_TIME_MS    PCT_WAIT           STMT_BPHR SORTS_PER_EXEC       SORT_OVERFLOW_PCT  POST_THRESHOLD_SORTS NUM_EXECUTIONS       AVG_EXEC_TIME DEADLOCKS            LOCK_TIMEOUTS        INSERT_TIMESTAMP
---------- -------------------- -------------------- ---------------------- -------------------- ------------------ --------- -------------------- ------------------ -------------------- -------------------- ------------- -------------------- -------------------- --------------------------
SELECT ORD           3755212322              2151984                      1               720416               0.00      1.00                    1               0.00                    0                 1459        493.77                    0                    0 2017-09-07-17.21.24.365010

  1 record(s) selected.


SELECT decimal(float(TOTAL_ACT_WAIT_TIME)/float(TOTAL_ACT_TIME),9,4) as PCT_WAIT , TOTAL_ACT_WAIT_TIME , LOCK_WAIT_TIME , LOG_BUFFER_WAIT_TIME , LOG_DISK_WAIT_TIME FROM TABLE(MON_GET_PKG_CACHE_STMT ( 'D', x'0100000000000000BD1D07000000000000000000020020170907172124365010', NULL, -2)) AS T

PCT_WAIT    TOTAL_ACT_WAIT_TIME  LOCK_WAIT_TIME       LOG_BUFFER_WAIT_TIME LOG_DISK_WAIT_TIME
----------- -------------------- -------------------- -------------------- --------------------
     0.0000                    4                    0                    0                    0

  1 record(s) selected.


SELECT TOTAL_ACT_TIME - TOTAL_ACT_WAIT_TIME as TOTAL_ACT_EXECUTING , PREP_TIME , TOTAL_CPU_TIME , POOL_READ_TIME , POOL_WRITE_TIME , DIRECT_READ_TIME , DIRECT_WRITE_TIME , TOTAL_SECTION_SORT_TIME , TOTAL_SECTION_SORT_PROC_TIME , WLM_QUEUE_TIME_TOTAL , TOTAL_ROUTINE_TIME FROM TABLE(MON_GET_PKG_CACHE_STMT ( 'D', x'0100000000000000BD1D07000000000000000000020020170907172124365010', NULL, -2)) AS T

TOTAL_ACT_EXECUTING  PREP_TIME            TOTAL_CPU_TIME       POOL_READ_TIME       POOL_WRITE_TIME      DIRECT_READ_TIME     DIRECT_WRITE_TIME    TOTAL_SECTION_SORT_TIME TOTAL_SECTION_SORT_PROC_TIME WLM_QUEUE_TIME_TOTAL TOTAL_ROUTINE_TIME
-------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- ----------------------- ---------------------------- -------------------- --------------------
              720412                    2            719811899                    0                    0                    0                    0                      70                           70                    0                    0

  1 record(s) selected.

This output, on the other hand, tells me that this particular query is scanning more than two million rows to return a single row. Though all of those are either in or being prefetched into the bufferpool, one or more indexes may be able to help this query significantly.

It Ran Fast Yesterday

While the performance for some queries degrades slowly over time, other queries just fall off a cliff. Even with gradually increasing data volume, there may be some point at which a query did not overflow the sortheap a minute ago, but now it does. Or the same scenario with other memory areas. Maybe a hash join worked with less data, but today, there’s too much and a nested loop join is used instead. In any case, reactive query performance analysis may address things that did not change the moment performance changed.

If a DBA is constantly looking for problem queries and tuning the worst offenders, and a DBA has the time to look at key performance indicators on a regular basis, this “fall off a cliff” is a bit less likely to happen, but it is still possible. Queries that have a good value for read efficiency are more likely to be stable with performance than those that do not – because there are fewer factors that are likely to lead to a sudden performance change.

Resolution

I enjoy most working in environments where there is no change control to prevent me from trying different solutions to discover what works. I find it critical to use db2batch to test actual query performance throughout this process to have exact numbers without the cloudiness added by network and application overhead. Rarely do I get to work in environments that allow me that much leeway, though. Being able to work within the bounds of change control is often critical.

What tips, tricks, or methodology do you have for analyzing query performance?

Ember Crooks
Ember Crooks

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, spent a year working with high-volume MySQL, and is now learning Snowflake. Ember shares both posts about her core skill sets and her journey learning Snowflake.

Ember lives in Denver and work from home

Articles: 548

One comment

  1. 1. Get long running SQLs
    2. Get top table scans
    3. Get top row overflow access
    4. Get top execution_id with respective SQL
    5. Get db2advis and exec plan if first is not good.
    6. Confirm tables and indexes are compressed
    7. Provide info to development team

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.