There have been a few times in my career when DB2 has blown me away with an ability to stay up when I would have given really good odds that it should be unavailable. It’s been a few years since the last time that happened, and this was an interesting lesson in transaction logging.
The Scenario
My teammates tend to come to me with the tough problems, after they’ve applied their considerable experience. This means that many times the problems they come to me with are quite interesting. In this case, we were bringing in a new client and working on a problem staging environment. The system we were looking at this day had an active transaction log filesytem that had been full for about a week. Since the system availabity was not currently an issue, we were looking at the transaction log requirements for about 10 databases across two DB2 (9.5 and 9.7) instances. For one of the databases, we saw that LOGPRIMARY was set to 35. However there were only 21 active log files in the active log location. The database was connectable, and select statements worked. As there was no transaction log space, we didn’t test further than this. This had us scratching our heads as to how it could possibly be.
When a DB2 database is activated there must be space available in the path to log files for a minimum of every active log file at once (LOGPRIMARY * LOGFILSIZ). If space is not available for this, database activation/connection will fail. Nearly every DB2 DBA has experienced this at some point in their career. This is also why you don’t want larger than necessary log files – because allocating those files takes time.
What Must Have Happened
The database was last activated via connection 10 days ago. This means that at the time of database activation, DB2 would have been able to allocate all 35 files. Looking at historical data, we were able to confirm that the filesystem had not reached 100% until 8 days ago. But what happened to those other files?
The first thing we had checked to see if there was a deferred change to LOGPRIMARY – maybe it had been set at 21, and someone had changed it to 35 without deactivating and reactivating the database? This was easy to check with the SHOW DETAIL
option on the GET DB CFG
command. The relevant portion showed us this:
Log file size (4KB) (LOGFILSIZ) = 80000 80000 Number of primary log files (LOGPRIMARY) = 35 35 Number of secondary log files (LOGSECOND) = 45 45
An additional possibility considered was that there was indeed a deferred change pending, but that it wasn’t showing up in the output. One of the DBAs on the team had seen a bug like that in action.
As we looked at the DB2 diagnostic log, though, we saw these error messages:
2016-01-16-17.37.39.438128-300 I22254390A416 LEVEL: Error PID : 27322 TID : 433231 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE EDUID : 433231 EDUNAME: db2loggr (SAMPLE) 0 FUNCTION: DB2 UDB, data protection services, sqlpgadf, probe:1970 RETCODE : ZRC=0x850F000C=-2062614516=SQLO_DISK "Disk full." DIA8312C Disk was full. 2016-01-16-17.37.39.438237-300 I22254807A388 LEVEL: Info PID : 27322 TID : 433231 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE EDUID : 433231 EDUNAME: db2loggr (SAMPLE) 0 FUNCTION: DB2 UDB, data protection services, sqlpgadf, probe:1970 DATA #1 :Last log file not required: S0063128.LOG. 2016-01-16-17.37.39.438331-300 I22255196A395 LEVEL: Info PID : 27322 TID : 433231 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE EDUID : 433231 EDUNAME: db2loggr (SAMPLE) 0 FUNCTION: DB2 UDB, data protection services, sqlpgasn, probe:190 DATA #1 : Error -2062614516 from add active log is ignored.
So here is my theory on this issue:
- DB2 was activated with the full complement of active transaction log files 10 days ago
- At some point DB2 went to archive a transaction log file while the filesystem was full
- Because other database are using the same filesystem for transaction logging, the very instant that space was available, it started to be used by other datbases
- Creating a new transaction log file is actually an mv operation where an older transaction log file is renamed to make the new one
- Based on past experience, I’m not sure that HP-UX’s mv is atomic, even when source and target are on the same filesystem
- During the process of that rename, the filesystem became too full for that log file to exist
- The old log file disappears, but the new one is not created.
- Instead of crashing, DB2 miraculously stays up.
Now this is a theory. I would have sworn before this that mv is atomic and wouldn’t allow this to happen, and that if it somehow did manage to happen, the DB2 database would either crash or be unable to do anything without a full complement of primary active transaction log files.
This is guessing though. I have no other idea on how this could possibly happen.
The reason I was previously unaware of this behavior is that my active log filesystems rarely fill up or stay filled without me being alerted. Had this not been a non-production system for a client for whom we were just implementing monitoring, it never would have gotten to this point. Also, there may be an OS element to this issue – if mv were atomic, I’m not sure how this would happen, and we see HP-UX less often than pretty much all of the supported LUW operating systems.
Does anyone have any theories to share or details from the OS level or how DB2 allocates new or renames transaction log files to help illuminate this situation? I can’t find details on whether mv is atomic on HP-UX or not.
Lessons Learned
- DB2’s ability to stay up is sometimes really astonishing
- This client should buy a lottery ticket – if the database had been deactivated, it would not be able to be activated without solving the disk space issue
I believe DB2 will only check for log space, and the number of primary logs, during activation, not while it is running. Except, of course, when a transaction is running and spans multiple log files. When a query is running and it fills up the log files the transaction will fail with SQL0968C but DB2 does not crash. But in this case the SQL0968C was returned during the rename phase.
I suspect what happened is when the log file was not able to be renamed, the “sqlpgasn” function simply fails and a Informational level message is written to the db2diag.log. Once the recent log file was archived successfully, the history file was updated and DB2 moves on. DB2 might have crashed once the transactions reached the log file that could not be renamed. I honestly do not know if DB2 will attempt to assign that log file number again if that function fails previously. Reproducing this would be tricky but it would be interesting to see a trace file on it.
More research is needed.
Thanks for sharing your interesting find!
-Eric S