Three HADR Failures

When you move DB2 servers from one hosting provider to another, you have to re-do hadr. Along with another DBA, I had 6 production databases across 3 instances to move last Saturday night, along with the HADR databases. The entire change had been tested ahead of time at least once, and probably more like 3 times, with no hiccups. But when we went to do it for real, we ran into three different HADR failures, so I thought I’d share them with my readers. Our solutions worked, but there may have been other options.

HADR isn’t “configured” properly

This is actually a common error message when you’re first setting up HADR. In that case, it most often means you have a typeo in one of the HADR parameters in the database configuration. AFTER HADR appears to have started successfully, it stops, and then you get errors like this in the db2diag.log:

2012-09-30- I33173537E425      LEVEL: Error
PID     : 9586                 TID  : 182959892064PROC : db2hadrs (PWPTST) 0
INSTANCE: db2wps1              NODE : 000         DB   : PWPTST
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduS, probe:20750
          "One or both databases of the HADR pair is configured incorrectly"

In this case, when I checked the HADR configuration and found it was all correct.


$ db2 get db cfg for PWPTST |grep HADR
 HADR database role                                      = STANDARD
 HADR local host name                  (HADR_LOCAL_HOST) = srv-db2p05
 HADR local service name                (HADR_LOCAL_SVC) = 18950
 HADR remote host name                (HADR_REMOTE_HOST) = srv-db2p08
 HADR remote service name              (HADR_REMOTE_SVC) = 18951
 HADR instance name of remote server  (HADR_REMOTE_INST) = db2wps1
 HADR timeout value                       (HADR_TIMEOUT) = 120
 HADR log write synchronization mode     (HADR_SYNCMODE) = NEARSYNC


$ db2 get db cfg for PWPTST |grep HADR
 HADR database role                                      = STANDARD
 HADR local host name                  (HADR_LOCAL_HOST) = srv-db2p08
 HADR local service name                (HADR_LOCAL_SVC) = 18951
 HADR remote host name                (HADR_REMOTE_HOST) = srv-db2p05
 HADR remote service name              (HADR_REMOTE_SVC) = 18950
 HADR instance name of remote server  (HADR_REMOTE_INST) = db2wps1
 HADR timeout value                       (HADR_TIMEOUT) = 120
 HADR log write synchronization mode     (HADR_SYNCMODE) = NEARSYNC

But I’ve seen these before, so I knew to check /etc/hosts. My HADR configuration uses host names, and DB2 then resolves those host names to IP addresses using the hosts file (/etc/hosts). In this case, /etc/hosts was correct on the standby Server, but incorrect on the Primary server. If you didn’t already know, this is a confirmation that the standby talks to the primary even before you get HADR started on the primary – I went through all of this before even issuing the command to start HADR on the primary server.
As soon as the hosts file was corrected, HADR stayed up on the standby, and I was able to start HADR on the primary and see them sync up.

An Odd One

This failure was the oddest of the three. When the command was issued to start HADR on the standby database, the following error was returned:

SQL1224N A database agent could not be started to service a request, or was terminated as a result of a database system shutdown or a force command.

My first thoughts on this error are to try it again, to try a db2 terminate, and to try stopping and starting the instance (not a big deal since this is a standby), but even after a db2stop/db2start, this message was still returned. There were no errors in db2diag.log. We then fail back to our general HADR setup bag of tricks. First, we took a new backup, and tried to restore it to see if that would work. It didn’t. Finally we resorted to dropping the database on the standby server, restoring the same image we had been using, and then re-configuring HADR. If you end up dropping your HADR standby database, you lose the HADR settings. That was what finally worked.

Bad Log File

The third failure we saw on Saturday night was a bad log file. After the restore, DB2 should be deleting all active log files. It usually gives you a warning message that it is doing this. But occasionally it doesn’t for whatever reason. In this case, again HADR looked like it started successfully, but then went down with the following in the db2diag.log:

2012-09-30- I16080925E381      LEVEL: Error
PID     : 4826                 TID  : 182959892064PROC : db2hadrs (TS256P01) 0
INSTANCE: db2inst1             NODE : 000         DB   : TS256P01
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduEntry, probe:21150
RETCODE : ZRC=0x87800148=-2021654200=HDR_ZRC_BAD_LOG
          "HADR standby found bad log"

In this case, at least the solution is simple. We deleted all the logs on the standby and then did the restore again, and it worked. As much as I hate deleting log files – especially since they were in the active path – it was necessary.


No matter how many times you test it, you can still run into issues when you go to do a change in production. It is important to be prepared to try different solutions. Sometimes you don’t even know the true problem – you simply have a set of solutions that have worked in the past for a given problem, and you work through them until something works.

Special thanks to Jim Reutener who I was assisting in this situation.

Ember Crooks
Ember Crooks

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, spent a year working with high-volume MySQL, and is now learning Snowflake. Ember shares both posts about her core skill sets and her journey learning Snowflake.

Ember lives in Denver and work from home

Articles: 556


  1. Hello Ember,
    I have an issue with the BAD logs on standby, we restored the database from offline backup, but the HADR breaks every night at midnight:

    2022-08-15- I15975827A499 LEVEL: Info
    PID : 18874738 TID : 69058 PROC : db2sysc 0
    INSTANCE: bcsdist1 NODE : 000 DB : AUDITDB
    APPHDL : 0-8164 APPID: *LOCAL.DB2.220813131842
    HOSTNAME: utinip2s
    EDUID : 69058 EDUNAME: db2shred (AUDITDB) 0
    FUNCTION: DB2 UDB, recovery manager, sqlpshrEdu, probe:45900
    DATA #1 :
    Partial log record in log stream 0 has been wiped at lso 4579070843608

    022-08-15- I15976327A685 LEVEL: Warning
    PID : 18874738 TID : 35403 PROC : db2sysc 0
    INSTANCE: ###### NODE : 000 DB : #####
    HOSTNAME: #####
    EDUID : 35403 EDUNAME: db2logmgr (#####) 0
    FUNCTION: DB2 UDB, data protection services, sqpLogMgrEdu::sqlpgRetrieveLogFile, probe:4062
    MESSAGE : ZRC=0x8010019D=-2146434659=SQLP_LOG_NOT_IN_ARCHIVE
    “Log extent not found in archive.”
    DATA #1 : String, 38 bytes
    Search on disk did not find any chain.
    DATA #2 : db2LogStreamIDType, PD_TYPE_DB2_LOG_STREAM_ID, 2 bytes

    • It sounds like Db2 is trying to retrieve logs from the archive and is not able to do so. I wonder if the timing might be connected to a backup that archives a bunch of logs? The most correct solution would be to make sure that the standby can access the archive log location.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.