Backup Performance Investigation

While I have tuned backup performance before, it has often been through physical database changes (spreading tables across more table spaces) that I achieve my greatest results. This post is not about backup performance tuning, but about solving a specific backup duration mystery.

The Problem

A database that is about 330 GB backs up to Data Domain using a vendor library. This is DB2 10.5, Fix Pack 8, on AIX. Full online backups are taken daily at 1 AM. Monday through Saturday they run in about 30-40 minutes, while on Sundays they run at about 90 minutes.

select  date(timestamp(start_time)) as start_date 
        , time(timestamp(start_time)) as start_time 
        , timestampdiff ( 4, varchar(timestamp(end_time) - timestamp(start_time)) ) as duration 
        , case when objecttype = 'D' 
            then 'Database' 
                else objecttype end as object 
        , case operationtype 
            when 'D' then 'Delta Offline' 
            when 'E' then 'Delta Online' 
            when 'F' then 'Offline' 
            when 'I' then 'Incremental Offline' 
            when 'N' then 'Online' 
            when 'O' then 'Incremental Online' 
            else operationtype end as Type 
        , devicetype 
        , sqlcode 
from sysibmadm.db_history 
where operation='B' 
    and seqnum=1 
order by start_date
    , start_time 
with ur

START_DATE START_TIME DURATION    OBJECT   TYPE                DEVICETYPE SQLCODE
---------- ---------- ----------- -------- ------------------- ---------- -----------
09/01/2017 01:05:04            35 Database Online              O                    -
09/02/2017 01:05:04            33 Database Online              O                    -
09/03/2017 01:05:04            93 Database Online              O                    -
09/04/2017 01:05:03            28 Database Online              O                    -
09/05/2017 01:05:04            34 Database Online              O                    -
09/06/2017 01:05:02            34 Database Online              O                    -
09/07/2017 01:05:03            36 Database Online              O                    -
09/08/2017 01:05:04            35 Database Online              O                    -
09/09/2017 01:05:04            35 Database Online              O                    -
09/10/2017 01:05:04            93 Database Online              O                    -
09/11/2017 01:05:06            38 Database Online              O                    -
09/12/2017 01:05:02            34 Database Online              O                    -
09/13/2017 01:05:03            36 Database Online              O                    -
09/14/2017 01:05:05            34 Database Online              O                    -
09/15/2017 01:05:03            36 Database Online              O                    -
09/16/2017 01:05:04            36 Database Online              O                    -
09/17/2017 01:05:04            91 Database Online              O                    -
09/18/2017 01:05:05            35 Database Online              O                    -
09/19/2017 01:05:03            34 Database Online              O                    -
09/20/2017 01:05:04            36 Database Online              O                    -
09/21/2017 01:05:03            32 Database Online              O                    -

  21 record(s) selected.

Same data, same kind of backup every day. This is a very consistent pattern. I immediately think of some sort of issue with the backup location, but how do I verify that theory?

Backup Statistics

One of my favorite silent feature additions was when Db2 started writing backup statistics to the diagnostic log on backup completion. This became available with a registry variable in 9.7 and became the default in fixpack 2 of 10.1. Dale McInnis (@dalemmcinnis) wrote an excellent blog entry on how to read this information. I’m not going to repeat what he says – his write-up is worth a thorough read.

To compare what is happening with these two backups, I pulled the BAR information from the Db2 diagnostic log for a regular fast backup and a slow Sunday one:

Fast – 9/15/2017

Parallelism       = 2
Number of buffers = 4
Buffer size       = 16781312 (4097 4kB pages)

BM#    Total      I/O      MsgQ      WaitQ      Buffers   kBytes
---  --------  --------  --------  --------    --------  --------
000     32.53      2.83      4.13     25.25          25    316288
001     32.53     12.36      8.83     10.74         168   2710528
---  --------  --------  --------  --------    --------  --------
TOT     65.07     15.20     12.97     36.00         193   3026816

MC#    Total      I/O      MsgQ      WaitQ      Buffers   kBytes
---  --------  --------  --------  --------    --------  --------
000      8.14      4.04      1.05      0.00          24    376928
001     34.95     24.07      9.68      0.72         172   2769608
---  --------  --------  --------  --------    --------  --------
TOT     43.09     28.11     10.73      0.72         196   3146536

Slow – 9/17/2017

Parallelism       = 2
Number of buffers = 4
Buffer size       = 16781312 (4097 4kB pages)

BM#    Total      I/O      MsgQ      WaitQ      Buffers   kBytes
---  --------  --------  --------  --------    --------  --------
000     69.92     11.54     38.03     19.61          96   1465600
001     69.88     11.24     40.66     17.32         115   1868800
---  --------  --------  --------  --------    --------  --------
TOT    139.80     22.78     78.69     36.94         211   3334400

MC#    Total      I/O      MsgQ      WaitQ      Buffers   kBytes
---  --------  --------  --------  --------    --------  --------
000     57.86     43.51      2.41      0.00          94   1524088
001     76.80     53.56      9.47      6.18         120   1917432
---  --------  --------  --------  --------    --------  --------
TOT    134.66     97.07     11.88      6.18         214   3441520

The first thing I verified here was that Db2 was selecting the same buffer size and number of buffers. This ruled out contention in the utility heap as a potential cause.

In this output, the first set of numbers (BM or Buffer Manger) represents the read side of a backup. The second set of numbers (MC or Media Controller) represents the write side of the backup.

I was not surprised to see that one of the major contributors to my issue seems to be I/O on the write side – that’s where I had theorized the problem might be. I was surprised to see there was another noticeable contributor to poor performance – MsgQ, which represents Db2 waiting for an empty buffer to read into. This did surprise me. While I was writing up an email to ask some experts about this, I re-read Dale’s blog article and learned this is a normal symptom of waiting on the write I/O – an empty buffer to read into is not available because that buffer is still being written from, very slowly. Maybe by blogging, I’ve pre-disposed myself to do my best analysis by blogging. I find that 50% of the time I get ready to reach out to a mentor, I catch the details I needed while writing up the problem, and don’t even have to hit send on the email.

Ember Crooks
Ember Crooks

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, spent a year working with high-volume MySQL, and is now learning Snowflake. Ember shares both posts about her core skill sets and her journey learning Snowflake.

Ember lives in Denver and work from home

Articles: 544

5 Comments

  1. So what was the final verdict? Was there another process hitting the backup target disk on Sundays? I’ve had similar customer backup performance issues that I tried to blame on concurrent I/O but was never able to close the loop entirely.

    There’s also a weird situation with the stats from your “good” backup: 24 buffers going to MC#0 and 172 buffers going to MC#1. Those numbers should be roughly similar if the targets are similar. The stats for the “bad” backup are also slightly skewed but not nearly to the same extent.

    • This is still an ongoing discussion with the client, so I don’t have a final resolution. Once a week having a long-running backup is not a severe enough problem to get a lot of attention. This client is also on the smaller side and doesn’t always understand that their DBAs are not also experts in the backup tool, so I suspect it may be a while until final resolution. I just hope we never have to do a restore at that time.

  2. This was the original SQL . I am glad DBA are reusing it 🙂

    SELECT DISTINCT char(current server,8) DB_NAME,
    DBPARTITIONNUM NODE,
    CASE OPERATIONTYPE
    WHEN ‘D’ THEN ‘Delta OFFline’
    WHEN ‘E’ THEN ‘Delta ONline’
    WHEN ‘F’ THEN ‘Full OFFine’
    WHEN ‘I’ THEN ‘Incr OFFline’
    WHEN ‘N’ THEN ‘Full ONline’
    WHEN ‘O’ THEN ‘Incr ONline’
    ELSE NULL
    END AS TYPE
    , CHAR(CHAR(NUM_TBSPS),5) “TBSP#”
    — , TIMESTAMP(END_TIME) -TIMESTAMP(START_TIME) TS_DIFF
    , CHAR(RTRIM(CHAR(( INT((TIMESTAMP(END_TIME) -TIMESTAMP(START_TIME)) /1000000 ) * 24 ) +
    MOD( INT ( (TIMESTAMP(END_TIME) -TIMESTAMP(START_TIME)) /10000 ), 100 )))||’h ‘||
    RTRIM(CHAR(MOD( INT ( (TIMESTAMP(END_TIME) -TIMESTAMP(START_TIME)) /100 ), 100 )))||’m’,10) DUR_TOTAL
    , CHAR(LEFT(DAYNAME(TIMESTAMP(START_TIME)),3),3) DOW
    , ‘(‘||CHAR(LEFT(CHAR(INT(current timezone)),2),2)||’)’ TZ
    ,CHAR(CHAR(TIMESTAMP(START_TIME)),19) START_TIME
    ,CHAR(CHAR(TIMESTAMP(END_TIME)),19) END_TIME
    , CHAR(LOCATION,${LOCCOLWD}) LOCATION
    , SQLCODE, CHAR(SQLERRMC,8) SQLERRMC
    from SYSIBMADM.DB_HISTORY
    where OPERATION = ‘B’
    order by START_TIME desc
    — fetch first 35 rows only

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.