Adding an Auxilliary HADR Standby to an Existing Two-Server Cluster

Mike guest blogged for me about three and four server HADR clusters before, but I want to blog about it from the perspective of adding a third server into an existing two-server cluster. Please go visit Mike’s article for a more general view of multi-standby HADR.

In my case, the existing two-server implementation is in the same data center and is used for high availability. We’re adding a third server in another data center for DR purposes.

Terminology

Primary

The server where the database is currently up and applications are able to connect and make changes to data. This can rotate among all servers in an HADR cluster. There can be one and only one Primary server at a time.

Principal Standby

The standby HADR server that is currently the target for the defined SYNCMODE from the Primary. Only one standby may be the principal standby at a time. This can rotate among all servers in an HADR cluster.

Auxiliary Standby

Any standby HADR server that is not the principal. The only syncmode supported for auxiliary standbys is SUPERASYNC. Any server in the HADR cluster beyond the first two is an Auxiliary Standby.

Design

Before making changes, it is important to have a clear picture of exactly what your HADR cluster will look like if each of the servers is primary. This is really easy for your primary. In my case, the primary will look like this:
HADR1
Even the diagram for when the principal standby becomes the primary is fairly straight forward:
HADR2
The one that gets a bit different is when you go to another data center.
HADR3
Note that the only syncmode used when my primaries are in the other data center is SUPERASYNC. This means I would be at a greater risk of data loss. Often we use four servers for this reason, two in each data center, and often with the fourth on a delay to protect against human error.
Note also that any one of the four databases could fail over individually, in which case the others would not automatically join it. I show them all together because for the app in question we would never fail over one without failing over all of them

TSAMP and Automatic Failover

I am mostly a fan of using TSAMP and db2haicu to automatically fail over databases on Linux and UNIX systems. However, if you’re talking about more than two servers, you’re only going to be using TSAMP between the primary and principal standby, in the same data center. This fits with the real world, as often when we do a DR failover it is a human choice – we often don’t want this kind of failover happening automatically. This does mean that you must disable TSAMP before issuing a takeover from an auxiliary standby.

Restoring the Databases

To set up the third server, a backup and restore are required, or initializing a split mirror, leaving the database in a rollforward state.

Actually Making the Changes

If you make your own diagram, it won’t be hard to come up with the parameter and other changes.

Services File

You’ll want to define a port for each instance to use for HADR. I like to define different ports for each server. It doesn’t really matter what port numbers you choose, as long as they’re not used by anything else. They must be different than the port that applications use to connect into the database. Once you’ve decided on these, they should be placed in /etc/services or the services file on Windows. This reserves them so that future applications have a chance of noticing a conflict before it happens. I usually use something like this:

db2hadr_sama1   60061/tcp # Reserved for DB2 hadr failover
db2hadr_sama2   60062/tcp # Reserved for DB2 hadr failover
db2hadr_sama3   60063/tcp # Reserved for DB2 hadr failover
db2hadr_samb1   60071/tcp # Reserved for DB2 hadr failover
db2hadr_samb2   60072/tcp # Reserved for DB2 hadr failover
db2hadr_samb3   60073/tcp # Reserved for DB2 hadr failover
db2hadr_samc1   60081/tcp # Reserved for DB2 hadr failover
db2hadr_samc2   60082/tcp # Reserved for DB2 hadr failover
db2hadr_samc3   60083/tcp # Reserved for DB2 hadr failover
db2hadr_samd1   60091/tcp # Reserved for DB2 hadr failover
db2hadr_samd2   60092/tcp # Reserved for DB2 hadr failover
db2hadr_samd3   60093/tcp # Reserved for DB2 hadr failover

HADR_TARGET_LIST

HADR_TARGET_LIST is a newer parameter. It is taking over some of the older parameters we used to use for two-server HADR, so even in a two-server implementation, it should be set. There is also a trick with this parameter. It can be dynamically updated AS LONG AS IT IS ALREADY SET TO SOMETHING. This means you can dynamically bring auxiliary standbys or change the principal standby, but only if you already set the parameter for a two-server implementation. I’ve also heard that with DB2 11.1, it is no longer optional and must be set, even for a two-server implementation. If you have not set this up for a two-server implementation, you must stop and start the databases for the new value to be picked up. This means that whenever I have a two server implementation down for an upgrade or other work, I also make sure this parameter is set

The format for the HADR target list is simple. It defines the list of primaries if the server it was set on is the primary. The first server listed is the one that would be the principal standby if that server were the primary. Each server is defined by a hostname (or IP address) and a port number. Each server entry is separated by a pipe. You can mix hostnames and IP addresses, but all servers must be on the same network – you cannot use hostnames that refer to one network for some servers and a different network for other servers. If server1 has HADR_LOCAL_HOST set to ‘server1’, then an auxiliary standby cannot refer to it by a hostname that does not resolve to the same IP address as ‘server1’ does, even if that is a valid alternate IP address for server. The settings for my three-server setup will look something like this:
server1:

update db cfg for SAMPLEA using HADR_TARGET_LIST server2:60062|server3:60063 immediate

server2:

update db cfg for SAMPLEA using HADR_TARGET_LIST server1:60061|server3:60063 immediate

server3:

update db cfg for SAMPLEA using HADR_TARGET_LIST server1:60061|server2:60062 immediate

Other HADR parameters

We still need three parameters set on each database, and it’s easiest if we also set the other three, though they’re technically no longer required.
On server 3 (these are already set on the other two servers, since HADR is up and running there):

HADR_SYNCMODE SUPERASYNC 
HADR_LOCAL_HOST server3
HADR_LOCAL_SVC 60063 
HADR_REMOTE_HOST server1 
HADR_REMOTE_SVC 60061
HADR_REMOTE_INST db2inst1

Starting HADR

If you’ve planned correctly, the above changes will take place without a database change, however, you will have to stop hadr, connect to each database on the primary, issue the changes on all servers, and then start HADR for them to take effect. I have also had some issues getting the changes to take effect without a database deactivation when also changing the HADR_LOCAL_HOST on the primary.

The correct order for starting HADR in this configuration is to start HADR on the principal standby first, and then start it on the primary, and then finally start it on the auxiliary standby(s).

Monitoring

It is critical with HADR to let it run for at least 5 minutes after a change and check the state again. It is also critical to wait for it to catch up before calling any change good, in case there’s a problem.

The main difference in monitoring a three-server HADR implementation is that db2pd with the -hadr option will now show a section of output for each of the standbys. Each standby continues to only show the details of the relationship it has with the primary, without any input about the other standbys. This is what the db2pd output looks like on the primary:


Database Member 0 -- Database SAMPLEA -- Active -- Up 0 days 00:34:59 -- Date 2016-11-06-18.33.42.382822

                            HADR_ROLE = PRIMARY
                          REPLAY_TYPE = PHYSICAL
                        HADR_SYNCMODE = NEARSYNC
                           STANDBY_ID = 1
                        LOG_STREAM_ID = 0
                           HADR_STATE = PEER
                           HADR_FLAGS =
                  PRIMARY_MEMBER_HOST = server1
                     PRIMARY_INSTANCE = db2inst1
                       PRIMARY_MEMBER = 0
                  STANDBY_MEMBER_HOST = server2
                     STANDBY_INSTANCE = db2inst1
                       STANDBY_MEMBER = 0
                  HADR_CONNECT_STATUS = CONNECTED
             HADR_CONNECT_STATUS_TIME = 11/06/2016 17:58:45.856894 (1478473125)
          HEARTBEAT_INTERVAL(seconds) = 30
                     HEARTBEAT_MISSED = 0
                   HEARTBEAT_EXPECTED = 69
                HADR_TIMEOUT(seconds) = 120
        TIME_SINCE_LAST_RECV(seconds) = 26
             PEER_WAIT_LIMIT(seconds) = 0
           LOG_HADR_WAIT_CUR(seconds) = 0.000
    LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000032
   LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.009
                  LOG_HADR_WAIT_COUNT = 255
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 262088
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 262088
            PRIMARY_LOG_FILE,PAGE,POS = S0000453.LOG, 34, 18505181955
            STANDBY_LOG_FILE,PAGE,POS = S0000453.LOG, 34, 18505181955
                  HADR_LOG_GAP(bytes) = 0
     STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000453.LOG, 34, 18505181955
       STANDBY_RECV_REPLAY_GAP(bytes) = 0
                     PRIMARY_LOG_TIME = 11/06/2016 18:30:17.000000 (1478475017)
                     STANDBY_LOG_TIME = 11/06/2016 18:30:17.000000 (1478475017)
              STANDBY_REPLAY_LOG_TIME = 11/06/2016 18:30:17.000000 (1478475017)
         STANDBY_RECV_BUF_SIZE(pages) = 4096
             STANDBY_RECV_BUF_PERCENT = 0
           STANDBY_SPOOL_LIMIT(pages) = 430000
                STANDBY_SPOOL_PERCENT = 0
                   STANDBY_ERROR_TIME = NULL
                 PEER_WINDOW(seconds) = 280
                      PEER_WINDOW_END = 11/06/2016 18:37:57.000000 (1478475477)
             READS_ON_STANDBY_ENABLED = N

                            HADR_ROLE = PRIMARY
                          REPLAY_TYPE = PHYSICAL
                        HADR_SYNCMODE = SUPERASYNC
                           STANDBY_ID = 2
                        LOG_STREAM_ID = 0
                           HADR_STATE = REMOTE_CATCHUP
                           HADR_FLAGS =
                  PRIMARY_MEMBER_HOST = server1
                     PRIMARY_INSTANCE = db2inst1
                       PRIMARY_MEMBER = 0
                  STANDBY_MEMBER_HOST = server3
                     STANDBY_INSTANCE = db2inst1
                       STANDBY_MEMBER = 0
                  HADR_CONNECT_STATUS = CONNECTED
             HADR_CONNECT_STATUS_TIME = 11/06/2016 17:59:24.324558 (1478473164)
          HEARTBEAT_INTERVAL(seconds) = 30
                     HEARTBEAT_MISSED = 0
                   HEARTBEAT_EXPECTED = 68
                HADR_TIMEOUT(seconds) = 120
        TIME_SINCE_LAST_RECV(seconds) = 19
             PEER_WAIT_LIMIT(seconds) = 0
           LOG_HADR_WAIT_CUR(seconds) = 0.000
    LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000035
   LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.009
                  LOG_HADR_WAIT_COUNT = 255
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 262800
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 262800
            PRIMARY_LOG_FILE,PAGE,POS = S0000453.LOG, 34, 18505181955
            STANDBY_LOG_FILE,PAGE,POS = S0000453.LOG, 34, 18505181955
                  HADR_LOG_GAP(bytes) = 0
     STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000453.LOG, 34, 18505181955
       STANDBY_RECV_REPLAY_GAP(bytes) = 0
                     PRIMARY_LOG_TIME = 11/06/2016 18:30:17.000000 (1478475017)
                     STANDBY_LOG_TIME = 11/06/2016 18:30:17.000000 (1478475017)
              STANDBY_REPLAY_LOG_TIME = 11/06/2016 18:30:17.000000 (1478475017)
         STANDBY_RECV_BUF_SIZE(pages) = 4096
             STANDBY_RECV_BUF_PERCENT = 0
           STANDBY_SPOOL_LIMIT(pages) = 430000
                STANDBY_SPOOL_PERCENT = 0
                   STANDBY_ERROR_TIME = NULL
                 PEER_WINDOW(seconds) = 0
             READS_ON_STANDBY_ENABLED = N

Restrictions and Problems

It is important to realize what can and cannot be done on standby servers. I had a client who wanted an auxiliary server to be used as a test environment, and that’s not really possible without breaking HADR and re-initializing it with a backup/restore when done. Backups are also not possible. Any action which takes the standby database out of a rollforward pending state will require re-initialization.

You can enable reads on standby on an auxiliary database, but there are a number of restrictions on reads on standby – and a standby with reads enabled must be fully licensed.

Ember Crooks
Ember Crooks

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, spent a year working with high-volume MySQL, and is now learning Snowflake. Ember shares both posts about her core skill sets and her journey learning Snowflake.

Ember lives in Denver and work from home

Articles: 545

2 Comments

    • The reason I do it in this order is because I want the main pair up and working first, and don’t want to mess with the auxiliaries until I’m sure that main pair is working first. In my order, you may have the aux standbys become temporarily disconnected, but if you need to change back quickly, you’ll have some server somewhere up to date while you also revert the change on the principal standby. That’s my reasoning. I think either order functions.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.