I first addressed the new 9.7 methodology for deadlock and locktimeout analysis in my post Analyzing Deadlocks – The New Way. I’ve been playing with it a bit, and wanted to share what I’ve found. First of all, so far, I like it. I can’t speak to overhead yet, but it integrates lock timeout analysis more tightly than it previously was. I went out to take a look at one database I support that generally runs pretty smoothly, thanks in part to a ridiculously oversized database server.
I used the SQL from my own original post on the topic to see what was going on. First:
-bash-3.2$ db2 "select substr(event_type,1,18) as event_type, count(*) as count, sum(dl_conns) sum_involved_connections from DBA.LOCK_EVENT group by event_type with ur" EVENT_TYPE COUNT SUM_INVOLVED_CONNECTIONS ------------------ ----------- ------------------------ LOCKTIMEOUT 600 - 1 record(s) selected.
The good news was that there are no deadlocks over a couple of weeks. The bad news? 600 lock timeouts. Eek, that seems a bit high. So I dug further to see what times we were seeing issues. Was it peak volume or during nighttime maintenance?
-bash-3.2$ db2 "select substr(event_type,1,18) as event_type, year(event_timestamp) as year, month(event_timestamp) as month, day(event_timestamp) as day, hour(event_timestamp) as hour, count(*) as count from DBA.LOCK_EVENT group by year(event_timestamp), month(event_timestamp), day(event_timestamp), hour(event_timestamp), event_type order by year(event_timestamp), month(event_timestamp), day(event_timestamp), hour(event_timestamp), event_type with ur" EVENT_TYPE YEAR MONTH DAY HOUR COUNT ------------------ ----------- ----------- ----------- ----------- ----------- LOCKTIMEOUT 2012 1 21 0 425 LOCKTIMEOUT 2012 1 21 3 1 LOCKTIMEOUT 2012 1 22 3 1 LOCKTIMEOUT 2012 1 23 3 1 LOCKTIMEOUT 2012 1 24 3 1 LOCKTIMEOUT 2012 1 25 3 1 LOCKTIMEOUT 2012 1 26 3 1 LOCKTIMEOUT 2012 1 27 3 1 LOCKTIMEOUT 2012 1 28 0 138 LOCKTIMEOUT 2012 1 28 3 1 LOCKTIMEOUT 2012 1 29 3 1 LOCKTIMEOUT 2012 1 30 3 1 LOCKTIMEOUT 2012 1 31 3 1 LOCKTIMEOUT 2012 2 1 3 1 LOCKTIMEOUT 2012 2 2 3 1 LOCKTIMEOUT 2012 2 3 3 1 LOCKTIMEOUT 2012 2 4 0 7 LOCKTIMEOUT 2012 2 4 3 1 LOCKTIMEOUT 2012 2 5 3 1 LOCKTIMEOUT 2012 2 6 3 1 LOCKTIMEOUT 2012 2 7 3 1 LOCKTIMEOUT 2012 2 8 3 1 LOCKTIMEOUT 2012 2 9 3 1 LOCKTIMEOUT 2012 2 10 3 1 LOCKTIMEOUT 2012 2 11 0 9 25 record(s) selected.
That information immediately made me feel better. The Spikes are Saturday mornings in the midnight hour. That’s when I do my runstats and reorgs, along with an rbind. To confirm my suspicion of the connection, I took a look at the tables involved.
-bash-3.2$ db2 "select substr(lp.table_schema,1,18) as table_schema, substr(lp.table_name,1,30) as table_name, substr(le.event_type,1,18) as lock_event, count(*)/2 as count from DBA.LOCK_PARTICIPANTS lp, DBA.LOCK_EVENT le where lp.xmlid=le.xmlid group by lp.table_schema, lp.table_name, le.event_type order by lp.table_schema, lp.table_name, le.event_type with ur"
TABLE_SCHEMA TABLE_NAME LOCK_EVENT COUNT ------------------ ------------------------------ ------------------ ----------- LOCKTIMEOUT 279 DBA ROS_LOCKS LOCKTIMEOUT 10 SYSIBM SYSTABLES LOCKTIMEOUT 2 WSCOMUSR KEYS LOCKTIMEOUT 1 WSCOMUSR ORDERS LOCKTIMEOUT 2 WSCOMUSR USERS LOCKTIMEOUT 4 - - LOCKTIMEOUT 300 7 record(s) selected.
The row with dashes for table_name and table_schema, I was expecting. Half of the locktimeout is always reported this way. However, we should be getting values for the other side of those. So I decide to take a look at the statements involved, and sure enough I find this:
-bash-3.2$ db2 "with t1 as (select STMT_PKGCACHE_ID as STMT_PKGCACHE_ID, count(*) as stmt_count from dba.lock_participant_activities group by STMT_PKGCACHE_ID) select t1.stmt_count, (select substr(STMT_TEXT,1,100) as stmt_text from dba.lock_participant_activities a1 where a1.STMT_PKGCACHE_ID=t1.STMT_PKGCACHE_ID fetch first 1 row only) from t1 order by t1.stmt_count desc with ur" STMT_COUNT STMT_TEXT ----------- ---------------------------------------------------------------------------------------------------- 425 SELECT NAME,LENGTH(NAME), CREATOR, LENGTH(CREATOR), PKGVERSION, LENGTH(PKGVERSION),VALID,UNIQUE_ID F 138 SELECT NAME,LENGTH(NAME), CREATOR, LENGTH(CREATOR), PKGVERSION, LENGTH(PKGVERSION),VALID,UNIQUE_ID F 2 SELECT T1.KEYS_ID, T1.TABLENAME, T1.LOWERBOUND, T1.UPPERBOUND, T1.COUNTER, T1.PREFETCHSIZE, T1.COLUM 1 UPDATE KEYS SET TABLENAME = ?, LOWERBOUND = ?, UPPERBOUND = ?, COUNTER = ?, PREFETCHSIZE = ?, COLUM 4 record(s) selected.
Clearly there are just two statements that account for just about everything. Expanding that, I see:
-bash-3.2$ db2 "with t1 as (select STMT_PKGCACHE_ID as STMT_PKGCACHE_ID, count(*) as stmt_count from dba.lock_participant_activities group by STMT_PKGCACHE_ID) select t1.stmt_count, (select substr(STMT_TEXT,1,256) as stmt_text from dba.lock_participant_activities a1 where a1.STMT_PKGCACHE_ID=t1.STMT_PKGCACHE_ID fetch first 1 row only) from t1 order by t1.stmt_count desc with ur" STMT_COUNT STMT_TEXT ----------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 425 SELECT NAME,LENGTH(NAME), CREATOR, LENGTH(CREATOR), PKGVERSION, LENGTH(PKGVERSION),VALID,UNIQUE_ID FROM SYSIBM.SYSPLAN WHERE NOT ((CREATOR = 'NULLID ') AND (NAME = 'SQLE1E03') AND (PKGVERSION = '')) ORDER BY CREATOR ASC, NAME ASC, PKGVERSION ASC 138 SELECT NAME,LENGTH(NAME), CREATOR, LENGTH(CREATOR), PKGVERSION, LENGTH(PKGVERSION),VALID,UNIQUE_ID FROM SYSIBM.SYSPLAN WHERE NOT ((CREATOR = 'NULLID ') AND (NAME = 'SQLE1E03') AND (PKGVERSION = '')) ORDER BY CREATOR ASC, NAME ASC, PKGVERSION ASC 2 SELECT T1.KEYS_ID, T1.TABLENAME, T1.LOWERBOUND, T1.UPPERBOUND, T1.COUNTER, T1.PREFETCHSIZE, T1.COLUMNNAME, T1.OPTCOUNTER FROM KEYS T1 WHERE T1.tablename = ? FOR UPDATE WITH RS 1 UPDATE KEYS SET TABLENAME = ?, LOWERBOUND = ?, UPPERBOUND = ?, COUNTER = ?, PREFETCHSIZE = ?, COLUMNNAME = ?, OPTCOUNTER = ? WHERE KEYS_ID = ? AND OPTCOUNTER = ? 4 record(s) selected.
Looking at the SQL, both of those statements are related to SYSIBM.SYSPLAN. My conclusion is that it’s probably the db2rbind that is done during the process, and that the locks are on internal plans, and that’s why the table wasn’t showing up in our list of tables.
Gathering this same information would not have been as easy with the old method of monitoring for locking issues. I would have had to parse through the files created for the purpose in the DIAGPATH and found the commonality that way.
This is really nice!.. I like the way you take a top-down approach in diagnosing problems.. Very very nice!..
– Saurabh
[…] Fun With 9.7 Locktimout Analysis Covers a real-world investigation into lock timeouts using the new method, and goes into greater detail than I had time for in the DB2′s Got talent presentation. It includes specific SQL and my thought process while analyzing a situation. […]