While I have tuned backup performance before, it has often been through physical database changes (spreading tables across more table spaces) that I achieve my greatest results. This post is not about backup performance tuning, but about solving a specific backup duration mystery.
The Problem
A database that is about 330 GB backs up to Data Domain using a vendor library. This is DB2 10.5, Fix Pack 8, on AIX. Full online backups are taken daily at 1 AM. Monday through Saturday they run in about 30-40 minutes, while on Sundays they run at about 90 minutes.
select date(timestamp(start_time)) as start_date , time(timestamp(start_time)) as start_time , timestampdiff ( 4, varchar(timestamp(end_time) - timestamp(start_time)) ) as duration , case when objecttype = 'D' then 'Database' else objecttype end as object , case operationtype when 'D' then 'Delta Offline' when 'E' then 'Delta Online' when 'F' then 'Offline' when 'I' then 'Incremental Offline' when 'N' then 'Online' when 'O' then 'Incremental Online' else operationtype end as Type , devicetype , sqlcode from sysibmadm.db_history where operation='B' and seqnum=1 order by start_date , start_time with ur START_DATE START_TIME DURATION OBJECT TYPE DEVICETYPE SQLCODE ---------- ---------- ----------- -------- ------------------- ---------- ----------- 09/01/2017 01:05:04 35 Database Online O - 09/02/2017 01:05:04 33 Database Online O - 09/03/2017 01:05:04 93 Database Online O - 09/04/2017 01:05:03 28 Database Online O - 09/05/2017 01:05:04 34 Database Online O - 09/06/2017 01:05:02 34 Database Online O - 09/07/2017 01:05:03 36 Database Online O - 09/08/2017 01:05:04 35 Database Online O - 09/09/2017 01:05:04 35 Database Online O - 09/10/2017 01:05:04 93 Database Online O - 09/11/2017 01:05:06 38 Database Online O - 09/12/2017 01:05:02 34 Database Online O - 09/13/2017 01:05:03 36 Database Online O - 09/14/2017 01:05:05 34 Database Online O - 09/15/2017 01:05:03 36 Database Online O - 09/16/2017 01:05:04 36 Database Online O - 09/17/2017 01:05:04 91 Database Online O - 09/18/2017 01:05:05 35 Database Online O - 09/19/2017 01:05:03 34 Database Online O - 09/20/2017 01:05:04 36 Database Online O - 09/21/2017 01:05:03 32 Database Online O - 21 record(s) selected.
Same data, same kind of backup every day. This is a very consistent pattern. I immediately think of some sort of issue with the backup location, but how do I verify that theory?
Backup Statistics
One of my favorite silent feature additions was when Db2 started writing backup statistics to the diagnostic log on backup completion. This became available with a registry variable in 9.7 and became the default in fixpack 2 of 10.1. Dale McInnis (@dalemmcinnis) wrote an excellent blog entry on how to read this information. I’m not going to repeat what he says – his write-up is worth a thorough read.
To compare what is happening with these two backups, I pulled the BAR information from the Db2 diagnostic log for a regular fast backup and a slow Sunday one:
Fast – 9/15/2017
Parallelism = 2 Number of buffers = 4 Buffer size = 16781312 (4097 4kB pages) BM# Total I/O MsgQ WaitQ Buffers kBytes --- -------- -------- -------- -------- -------- -------- 000 32.53 2.83 4.13 25.25 25 316288 001 32.53 12.36 8.83 10.74 168 2710528 --- -------- -------- -------- -------- -------- -------- TOT 65.07 15.20 12.97 36.00 193 3026816 MC# Total I/O MsgQ WaitQ Buffers kBytes --- -------- -------- -------- -------- -------- -------- 000 8.14 4.04 1.05 0.00 24 376928 001 34.95 24.07 9.68 0.72 172 2769608 --- -------- -------- -------- -------- -------- -------- TOT 43.09 28.11 10.73 0.72 196 3146536
Slow – 9/17/2017
Parallelism = 2 Number of buffers = 4 Buffer size = 16781312 (4097 4kB pages) BM# Total I/O MsgQ WaitQ Buffers kBytes --- -------- -------- -------- -------- -------- -------- 000 69.92 11.54 38.03 19.61 96 1465600 001 69.88 11.24 40.66 17.32 115 1868800 --- -------- -------- -------- -------- -------- -------- TOT 139.80 22.78 78.69 36.94 211 3334400 MC# Total I/O MsgQ WaitQ Buffers kBytes --- -------- -------- -------- -------- -------- -------- 000 57.86 43.51 2.41 0.00 94 1524088 001 76.80 53.56 9.47 6.18 120 1917432 --- -------- -------- -------- -------- -------- -------- TOT 134.66 97.07 11.88 6.18 214 3441520
The first thing I verified here was that Db2 was selecting the same buffer size and number of buffers. This ruled out contention in the utility heap as a potential cause.
In this output, the first set of numbers (BM or Buffer Manger) represents the read side of a backup. The second set of numbers (MC or Media Controller) represents the write side of the backup.
I was not surprised to see that one of the major contributors to my issue seems to be I/O on the write side – that’s where I had theorized the problem might be. I was surprised to see there was another noticeable contributor to poor performance – MsgQ, which represents Db2 waiting for an empty buffer to read into. This did surprise me. While I was writing up an email to ask some experts about this, I re-read Dale’s blog article and learned this is a normal symptom of waiting on the write I/O – an empty buffer to read into is not available because that buffer is still being written from, very slowly. Maybe by blogging, I’ve pre-disposed myself to do my best analysis by blogging. I find that 50% of the time I get ready to reach out to a mentor, I catch the details I needed while writing up the problem, and don’t even have to hit send on the email.
So what was the final verdict? Was there another process hitting the backup target disk on Sundays? I’ve had similar customer backup performance issues that I tried to blame on concurrent I/O but was never able to close the loop entirely.
There’s also a weird situation with the stats from your “good” backup: 24 buffers going to MC#0 and 172 buffers going to MC#1. Those numbers should be roughly similar if the targets are similar. The stats for the “bad” backup are also slightly skewed but not nearly to the same extent.
This is still an ongoing discussion with the client, so I don’t have a final resolution. Once a week having a long-running backup is not a severe enough problem to get a lot of attention. This client is also on the smaller side and doesn’t always understand that their DBAs are not also experts in the backup tool, so I suspect it may be a while until final resolution. I just hope we never have to do a restore at that time.
Thank you Ember!
This article really helped me out 🙂
This was the original SQL . I am glad DBA are reusing it 🙂
SELECT DISTINCT char(current server,8) DB_NAME,
DBPARTITIONNUM NODE,
CASE OPERATIONTYPE
WHEN ‘D’ THEN ‘Delta OFFline’
WHEN ‘E’ THEN ‘Delta ONline’
WHEN ‘F’ THEN ‘Full OFFine’
WHEN ‘I’ THEN ‘Incr OFFline’
WHEN ‘N’ THEN ‘Full ONline’
WHEN ‘O’ THEN ‘Incr ONline’
ELSE NULL
END AS TYPE
, CHAR(CHAR(NUM_TBSPS),5) “TBSP#”
— , TIMESTAMP(END_TIME) -TIMESTAMP(START_TIME) TS_DIFF
, CHAR(RTRIM(CHAR(( INT((TIMESTAMP(END_TIME) -TIMESTAMP(START_TIME)) /1000000 ) * 24 ) +
MOD( INT ( (TIMESTAMP(END_TIME) -TIMESTAMP(START_TIME)) /10000 ), 100 )))||’h ‘||
RTRIM(CHAR(MOD( INT ( (TIMESTAMP(END_TIME) -TIMESTAMP(START_TIME)) /100 ), 100 )))||’m’,10) DUR_TOTAL
, CHAR(LEFT(DAYNAME(TIMESTAMP(START_TIME)),3),3) DOW
, ‘(‘||CHAR(LEFT(CHAR(INT(current timezone)),2),2)||’)’ TZ
,CHAR(CHAR(TIMESTAMP(START_TIME)),19) START_TIME
,CHAR(CHAR(TIMESTAMP(END_TIME)),19) END_TIME
, CHAR(LOCATION,${LOCCOLWD}) LOCATION
, SQLCODE, CHAR(SQLERRMC,8) SQLERRMC
from SYSIBMADM.DB_HISTORY
where OPERATION = ‘B’
order by START_TIME desc
— fetch first 35 rows only
Article from Dale moved here
https://www.idug.org/browse/blogs/blogviewer?BlogKey=63c66993-b043-4ed2-8409-ff3b447d9ffc