When Both HADR Servers are Rebooted at the Same Time, DB2 Won’t Work

Posted by

I have a client that has been having some issues that cause both HADR servers to go down at the same time. Ignoring for the moment that it’s a bad idea to share resources that can cause this to happen, they seem frustrated that when both servers come back up, DB2 does not automatically come back up. If I have only one client with this issue, and they think it’s a basic failing of DB2 HADR, then there must be others out there, so I thought I would explain it.

When Both HADR Servers are Rebooted (or crash) at Once

IBM clearly did not mean HADR to protect against failures of both servers at once. HADR can protect against failure of one server at a time. With the use of TSA(db2haicu), it can also automate the takeover. Failures of both servers at once can be a network issue, so DB2 protects against the possibility of having two database servers actively taking transactions that cannot later be reconciled (a nightmarish condition called ‘Split-Brain’). The protection that it uses prevents the primary database from being started until one of the following happens:

  1. HADR starts on the standby server and the primary server is able to talk to it:
    (on standby) db2 “START HADR ON DATABASE SAMPLE AS STANDBY”

  2. OR

  3. A start command is issued with a “by force” keyword, potentially breaking HADR and introducing the possibility of Split-Brain:
    (on primary) db2 “START HADR ON DATABASE SAMPLE AS PRIMARY BY FORCE”

Please use the “by force” option with extreme care

I happen to agree with IBM on the approach they’ve taken here.

The problem comes in with the next part:
When DB2 starts on the Primary, HADR is also automatically started. BUT when DB2 starts on the standby, HADR DOES NOT START, and MUST BE STARTED MANUALLY.

Now that is the effect. What actually happens is that HADR is started on either database when the database is activated. On the Primary, connections from the application activate the database pretty immediately. On the standby, the only thing that can activate the database is an explicit “activate database” command. So one solution here may be to add explicit database activation to your startup scripts.

The difference in likely database activation may cause problems. If HADR started automatically on the standby, then there would be no problem in a double-reboot scenario. But again, HADR is not designed to protect against double-outage scenarios, so it makes sense to me that an actual person has to be engaged and involved in these situations.

If Only the Standby Crashes

This is one of the failure scenarios HADR is designed to deal with. If only the standby crashes, there is no outage. However, when the standby comes back up, HADR itself will be down. This is one of the reasons why it is critical to monitor HADR itself. I treat an HADR outage as if it were a sev 1 – to get it back up and running, day or night – at least for systems where I expect takeover to occur in a few minutes or less. If caught soon enough, only a single command would have to be issued to manually start HADR on the standby database. If HADR has been down too long – ‘too long’ depends heavily on your transaction volume – then take a backup from the primary and restore it on the standby before HADR will start.

If Only the Primary Crashes

This is the failure HADR is meant to protect against. When using db2haicu/TSA, if the Primary database server crashes or is rebooted, the database will fail over to the standby. The time it takes to fail over depends on two settings in the DB cfg and the transaction volume on the database. When the primary database server comes back up, it checks with the standby before it starts, learns that it is now the new standby, and starts itself as a standby.

TSA states for Unexpected Failures

Most of the time TSA can deal with many different failure scenarios. But some failures can cause TSA to get stuck in a “pending online” or other unhealthy state. Always check TSA status when re-starting HADR using the lssam command or db2pd -ha.

How to Properly Stop TSA if you’re Stopping Things on Purpose

It is best for TSA if you stop it fully before a planned outage of one or more servers. Details on how to do that are in this document: https://datageek.blog/wp-content/uploads/2012/01/Shutdown-Procedure-for-an-Automated-HADR-Environment_11152010.pdf

Simply put, DB2 is not a product you can set up and expect it to run without human involvement. It requires frequent attention by a DBA, especially in outage scenarios.

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, and is now pivoting to focus on learning MySQL. Ember shares both posts about her core skill set and her journey learning MySQL. Ember lives in Denver and work from home

9 comments

  1. IF my primary database(P) goes down and standby database(S) start working as a primary(as HADR Configuration) and all application store data in this database.
    My Primary db(P) come up after 2 days so :

    1. what will be the scenario to make it Primary again? it will manage by HADR and TSAMP?
    2. What about the data both database automatically sync again or need to take backup and restore ?

    Thanks & Regards:

    1. Assuming the failover was clean or not too bad, the only thing you have to do to make P primary again is a “TAKEOVER HADR” command, though you’ll want to first make sure that HADR came back up and is in sync, and that TSAMP states are all good. You may need to do a db2haicu -disable, and then use db2haicu to re-enable it after HADR is in a good place.

      If the above conditions are not true, you have to take a backup of your current primary and restore it into the current standby. Not too painful.

  2. Hi Ember
    What happens in the following scenario
    1. Network outage on the primary
    2. Standby is issued take over hadr instruction
    3. Network restored on Primary

    I am thinking the database activation on the primary would be the point of protection, but what happens if an explict activation was issued.

    Thanks
    John

    1. The takeover in #2 will have to use the “by force” syntax, otherwise it will fail.

      On #3, likely the original primary would communicate with the original standby and my guess is that both databases would go down with a “poison pill” message. If not that then the database on the original primary would detect that the other one was primary and go down. It is fairly likely you would have to restore the database on the original primary in order to bring the databases back in sync.

      1. Hi thanks for quick reply –
        I am concerned about the scenario where I loose connectivity to the Data Centre due to some network fault beyond my control, but I can simulate a network outage on the VM and use the console to test the following scenarios such as

        1. Ensure original primary DB is deactivated prior to network restore and subsequent activation.
        2. Ensure original primary DB is active prior to network restore and subsequent activation.

        I will let you know how it goes.

        Thanks
        John

  3. Hi Ember,
    When I execute a reboot on primary, the standby server also reboots, did you already see this strange situation?
    HADR is working, TSA is configured.
    Thanks,

  4. Hi Ember, like your tips.
    Question: if I need to do some db maintenance for DB2 primary database.
    Can I just stop hadr on primary db and then do my table reorg and then restart the primary?
    This means I don’t have to stop hadr on the standby db.

    Regards,

    Ray S

    1. I believe you could, but it has been a while since I’ve tried it, and I don’t have a sandbox ready to test on. Worst case, HADR would also stop on the standby, and you’d have to restart it, but I’m fairly sure it would stay up. I would recommend testing it.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.