I work on a team of DBAs, and we will frequently talk through an issue together to find the root cause. We are all capable of getting there on our own, but sometimes we realize that we’re not seeing something or think one of the others might see an issue faster.
This post is based on a real issue one of my fellow DBAs saw and that we talked through to find the root cause after the immediate issue had been resolved.
Client reports site is very slow to the point of being non-functional. Our DBA logs in and sees that there are 200 applications in a lock-wait status, and analyzing a bit further discovers they all appear to be hung up waiting on one other application. She takes an application snapshot and a lock snapshot and forces off the offending application. It takes about 5 minutes to rollback, and the immediate problem is resolved – no more applications in lock-wait, and no more site slowness.
We know our application (WebSphere Commerce) well enough to know that forcing off applications in this manner is safe as long as we’re not forcing off too many, and when we are dealing with a site-down situation in the middle of a business day, it is frequently most important to resolve the outage quickly, and then figure out what went wrong. This approach also requires gathering some data before resolving the immediate issue, so that data is not lost.
Analyzing What Really Happened
Upon looking at the application snapshot for the forced off application, we find this statement was the last statement executed:
SELECT T1.CALCODE_ID FROM CALCODE T1 WHERE T1.PUBLISHED=1 AND T1.STOREENT_ID IN(?,?,?) AND T1.CALUSAGE_ID=? AND ((T1.CALCODE_ID NOT IN (SELECT T4.CALCODE_ID FROM CALC ODEMGP T4)) OR (T1.CALCODE_ID IN (SELECT T5.CALCODE_ID FROM CALCODEMGP T5 WHERE T5.MBRGRP_ID IN (?)))) AND (T1.CALCODE_ID IN (SELECT T10.CALCODE_ID FROM CLCDPROMO T10 WHERE T10.PX_PROMOTION_ID IN (SELECT T11.PX_PROMOTION_ID FROM PX_PROMOTION T11 WHERE T11.CDREQUIRED=0))) AND T1.CALCODE_ID IN (SELECT T2.CALCODE_ID FROM CATENCALCD T 2 WHERE (T2.CATENTRY_ID = ? OR T2.CATENTRY_ID = (SELECT T3.CATENTRY_ID_PARENT FROM CATENTREL T3 WHERE T3.CATENTRY_ID_CHILD=? AND T3.CATRELTYPE_ID='PRODUCT_ITEM')) AND T2.STORE_ID IN(?,?) AND (T2.TRADING_ID IS NULL OR T2.TRADING_ID IN (?))) UNION SELECT T7.CALCODE_ID FROM CATGPCALCD T7 WHERE T7.CATGROUP_ID IN (?,?,?,?,?,?) AND ( ( T7.CALCODE_ID NOT IN (SELECT T8.CALCODE_ID FROM CALCODEMGP T8)) OR (T7.CALCODE_ID IN (SELECT T5.CALCODE_ID FROM CALCODEMGP T5 WHERE T5.MBRGRP_ID IN (?))) ) AND ( T7.C ALCODE_ID IN (SELECT T9.CALCODE_ID FROM CALCODE T9 WHERE T9.PUBLISHED=1 AND T9.STOREENT_ID IN(?,?,?) AND T9.CALUSAGE_ID=?)) AND (T7.CALCODE_ID IN (SELECT T13.CALCODE_ ID FROM CLCDPROMO T13 WHERE T13.PX_PROMOTION_ID IN (SELECT T12.PX_PROMOTION_ID FROM PX_PROMOTION T12 WHERE T12.CDREQUIRED=0)))
That’s a base-Commerce SQL statement and not a custom one.
However, the statements from the applications that were in lock wait on the table were pretty much all:
INSERT INTO DMEXPLOG (DMEXPLOG_ID, STOREENT_ID, EXPERIMENT_ID, TESTELEMENT_ID, PERSONALIZATIONID, CREATED, CLICKED, VALUE, CONTENTTYPE, OPTCOUNTER) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
The lock-waits were all on the DMEXPLOG table. My fellow DBA puzzled over this for a bit because the two statements don’t really look like they should conflict. It is important to remember that the the SQL statement supplied in the application snapshot is really the LAST statement executed, and not the one that is CURRENTLY executing, and this is one of the cases where that shows up. So how do we get what the current statement executing is? We cannot, really without a statement event monitor, which we were not running at the time. If it’s a long-running statement, we might be able to catch it with db2top.
I don’t recommend running a statement event monitor on a running production e-commerce database unless you are really desperate and have a repeating problem, and even then only for a very short duration. The large numbers of statements executed in e-commerce databases generate a large amount of data in a very short time.
So the next question we considered is why weren’t the locks timing out. We went and verified that our LOCKTIMEOUT variable was set properly using:
db2 "get db cfg for sample" |grep LOCKTIMEOUT Lock timeout (sec) (LOCKTIMEOUT) = 45
OK, so that’s set properly. Probably what was happening was that the locks were timing out, but then the application was simply re-connecting or picking up another existing connection and trying again – good application logic when it’s a step in the transaction that cannot be skipped.
While we quickly ruled it out, we also considered that a reorg might be involved. Reorgs do not respect LOCKTIMEOUT in many situations, so can be problematic in lock chaining issues. However we’re smart enough not to run reorgs in the middle of the day, and we verified that none were unexpectedly running here.
We next looked at the snapshot information we had about the locks:
Lock object type = Table Lock mode = Exclusive Lock Lock mode requested = Intention Exclusive Lock (IX) Name of tablespace holding lock = USERSPACE1 Schema of table holding lock = WSCOMUSR Name of table holding lock = DMEXPLOG
Hang on, is it saying there’s an Exclusive(X) lock on the entire table?!?!?
The insert needs an Intent Exclusive(IX) lock on the table and will acquire exclusive locks on individual rows. X locks and IX locks are not compatible, and so the two cannot access the table at the same time. See the lock compatibility chart in the Info Center if you need to check the compatibility of two (or more) locks.
In an e-commerce database during regular operation, we NEVER expect to see an exclusive table lock. That immediately made me think of Lock Escalation, and so looking at db2diag.log, we found:
2013-01-18-12.16.34.057220-360 E82054608E917 LEVEL: Warning PID : 32104 TID : 47614489782592PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE APPHDL : 0-50232 APPID: 126.96.36.199.59586.13011817552 AUTHID : WSCOMUSR EDUID : 8208 EDUNAME: db2agent (SAMPLE) 0 FUNCTION: DB2 UDB, data management, sqldEscalateLocks, probe:1 MESSAGE : ADM5501I DB2 is performing lock escalation. The affected application is named "db2jcc_application", and is associated with the workload name "SYSDEFAULTUSERWORKLOAD" and application ID "188.8.131.52.59586.13011817552" at member "0". The total number of locks currently held is "1249935", and the target number of locks to hold is "624967". The current statement being executed is "delete from DMEXPLOG where CREATED < ?".
There it is, our smoking gun. A delete was doing a table scan and locking up the whole of a table that every single transaction touches.
That delete is actually out-of-the-box Commerce code too.
WebSphere Commerce is actually pretty good at leaving deletes to be done at another time and simply marking rows for deletion. So our ongoing investigation will be in two directions:
- Why does this delete need to be done in real time? Can we work it into our dbclean/pruning process instead?
- There’s an index that will prevent the tablescan during the delete. We probably want to implement that index, though since this is an 88 million record table that every transaction seems to hit, we’re actually going to use a performance testing environment to test adding the index during a load test to see if adding the index itself causes any performance problems before we attempt it on production
Solving this problem was three major steps:
- Gather info that will disappear and then get the database functioning normally again
- Find the root cause of the problem
- Work on a long-term solution for the true problem
Note that the problem here is not one that is caused by DB2 itself or the DBMS. This is actually an application problem that manifests in the database. The application should probably not be doing that delete as a part of normal operation, and if it must, then an index needs to be added to support the delete.