Mike guest blogged for me about three and four server HADR clusters before, but I want to blog about it from the perspective of adding a third server into an existing two-server cluster. Please go visit Mike’s article for a more general view of multi-standby HADR.
In my case, the existing two-server implementation is in the same data center and is used for high availability. We’re adding a third server in another data center for DR purposes.
Terminology
Primary
The server where the database is currently up and applications are able to connect and make changes to data. This can rotate among all servers in an HADR cluster. There can be one and only one Primary server at a time.
Principal Standby
The standby HADR server that is currently the target for the defined SYNCMODE from the Primary. Only one standby may be the principal standby at a time. This can rotate among all servers in an HADR cluster.
Auxiliary Standby
Any standby HADR server that is not the principal. The only syncmode supported for auxiliary standbys is SUPERASYNC. Any server in the HADR cluster beyond the first two is an Auxiliary Standby.
Design
Before making changes, it is important to have a clear picture of exactly what your HADR cluster will look like if each of the servers is primary. This is really easy for your primary. In my case, the primary will look like this:
Even the diagram for when the principal standby becomes the primary is fairly straight forward:
The one that gets a bit different is when you go to another data center.
Note that the only syncmode used when my primaries are in the other data center is SUPERASYNC. This means I would be at a greater risk of data loss. Often we use four servers for this reason, two in each data center, and often with the fourth on a delay to protect against human error.
Note also that any one of the four databases could fail over individually, in which case the others would not automatically join it. I show them all together because for the app in question we would never fail over one without failing over all of them
TSAMP and Automatic Failover
I am mostly a fan of using TSAMP and db2haicu to automatically fail over databases on Linux and UNIX systems. However, if you’re talking about more than two servers, you’re only going to be using TSAMP between the primary and principal standby, in the same data center. This fits with the real world, as often when we do a DR failover it is a human choice – we often don’t want this kind of failover happening automatically. This does mean that you must disable TSAMP before issuing a takeover from an auxiliary standby.
Restoring the Databases
To set up the third server, a backup and restore are required, or initializing a split mirror, leaving the database in a rollforward state.
Actually Making the Changes
If you make your own diagram, it won’t be hard to come up with the parameter and other changes.
Services File
You’ll want to define a port for each instance to use for HADR. I like to define different ports for each server. It doesn’t really matter what port numbers you choose, as long as they’re not used by anything else. They must be different than the port that applications use to connect into the database. Once you’ve decided on these, they should be placed in /etc/services or the services file on Windows. This reserves them so that future applications have a chance of noticing a conflict before it happens. I usually use something like this:
db2hadr_sama1 60061/tcp # Reserved for DB2 hadr failover db2hadr_sama2 60062/tcp # Reserved for DB2 hadr failover db2hadr_sama3 60063/tcp # Reserved for DB2 hadr failover db2hadr_samb1 60071/tcp # Reserved for DB2 hadr failover db2hadr_samb2 60072/tcp # Reserved for DB2 hadr failover db2hadr_samb3 60073/tcp # Reserved for DB2 hadr failover db2hadr_samc1 60081/tcp # Reserved for DB2 hadr failover db2hadr_samc2 60082/tcp # Reserved for DB2 hadr failover db2hadr_samc3 60083/tcp # Reserved for DB2 hadr failover db2hadr_samd1 60091/tcp # Reserved for DB2 hadr failover db2hadr_samd2 60092/tcp # Reserved for DB2 hadr failover db2hadr_samd3 60093/tcp # Reserved for DB2 hadr failover
HADR_TARGET_LIST
HADR_TARGET_LIST is a newer parameter. It is taking over some of the older parameters we used to use for two-server HADR, so even in a two-server implementation, it should be set. There is also a trick with this parameter. It can be dynamically updated AS LONG AS IT IS ALREADY SET TO SOMETHING. This means you can dynamically bring auxiliary standbys or change the principal standby, but only if you already set the parameter for a two-server implementation. I’ve also heard that with DB2 11.1, it is no longer optional and must be set, even for a two-server implementation. If you have not set this up for a two-server implementation, you must stop and start the databases for the new value to be picked up. This means that whenever I have a two server implementation down for an upgrade or other work, I also make sure this parameter is set
The format for the HADR target list is simple. It defines the list of primaries if the server it was set on is the primary. The first server listed is the one that would be the principal standby if that server were the primary. Each server is defined by a hostname (or IP address) and a port number. Each server entry is separated by a pipe. You can mix hostnames and IP addresses, but all servers must be on the same network – you cannot use hostnames that refer to one network for some servers and a different network for other servers. If server1 has HADR_LOCAL_HOST set to ‘server1’, then an auxiliary standby cannot refer to it by a hostname that does not resolve to the same IP address as ‘server1’ does, even if that is a valid alternate IP address for server. The settings for my three-server setup will look something like this:
server1:
update db cfg for SAMPLEA using HADR_TARGET_LIST server2:60062|server3:60063 immediate
server2:
update db cfg for SAMPLEA using HADR_TARGET_LIST server1:60061|server3:60063 immediate
server3:
update db cfg for SAMPLEA using HADR_TARGET_LIST server1:60061|server2:60062 immediate
Other HADR parameters
We still need three parameters set on each database, and it’s easiest if we also set the other three, though they’re technically no longer required.
On server 3 (these are already set on the other two servers, since HADR is up and running there):
HADR_SYNCMODE SUPERASYNC HADR_LOCAL_HOST server3 HADR_LOCAL_SVC 60063 HADR_REMOTE_HOST server1 HADR_REMOTE_SVC 60061 HADR_REMOTE_INST db2inst1
Starting HADR
If you’ve planned correctly, the above changes will take place without a database change, however, you will have to stop hadr, connect to each database on the primary, issue the changes on all servers, and then start HADR for them to take effect. I have also had some issues getting the changes to take effect without a database deactivation when also changing the HADR_LOCAL_HOST on the primary.
The correct order for starting HADR in this configuration is to start HADR on the principal standby first, and then start it on the primary, and then finally start it on the auxiliary standby(s).
Monitoring
It is critical with HADR to let it run for at least 5 minutes after a change and check the state again. It is also critical to wait for it to catch up before calling any change good, in case there’s a problem.
The main difference in monitoring a three-server HADR implementation is that db2pd with the -hadr option will now show a section of output for each of the standbys. Each standby continues to only show the details of the relationship it has with the primary, without any input about the other standbys. This is what the db2pd output looks like on the primary:
Database Member 0 -- Database SAMPLEA -- Active -- Up 0 days 00:34:59 -- Date 2016-11-06-18.33.42.382822 HADR_ROLE = PRIMARY REPLAY_TYPE = PHYSICAL HADR_SYNCMODE = NEARSYNC STANDBY_ID = 1 LOG_STREAM_ID = 0 HADR_STATE = PEER HADR_FLAGS = PRIMARY_MEMBER_HOST = server1 PRIMARY_INSTANCE = db2inst1 PRIMARY_MEMBER = 0 STANDBY_MEMBER_HOST = server2 STANDBY_INSTANCE = db2inst1 STANDBY_MEMBER = 0 HADR_CONNECT_STATUS = CONNECTED HADR_CONNECT_STATUS_TIME = 11/06/2016 17:58:45.856894 (1478473125) HEARTBEAT_INTERVAL(seconds) = 30 HEARTBEAT_MISSED = 0 HEARTBEAT_EXPECTED = 69 HADR_TIMEOUT(seconds) = 120 TIME_SINCE_LAST_RECV(seconds) = 26 PEER_WAIT_LIMIT(seconds) = 0 LOG_HADR_WAIT_CUR(seconds) = 0.000 LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000032 LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.009 LOG_HADR_WAIT_COUNT = 255 SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 262088 SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 262088 PRIMARY_LOG_FILE,PAGE,POS = S0000453.LOG, 34, 18505181955 STANDBY_LOG_FILE,PAGE,POS = S0000453.LOG, 34, 18505181955 HADR_LOG_GAP(bytes) = 0 STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000453.LOG, 34, 18505181955 STANDBY_RECV_REPLAY_GAP(bytes) = 0 PRIMARY_LOG_TIME = 11/06/2016 18:30:17.000000 (1478475017) STANDBY_LOG_TIME = 11/06/2016 18:30:17.000000 (1478475017) STANDBY_REPLAY_LOG_TIME = 11/06/2016 18:30:17.000000 (1478475017) STANDBY_RECV_BUF_SIZE(pages) = 4096 STANDBY_RECV_BUF_PERCENT = 0 STANDBY_SPOOL_LIMIT(pages) = 430000 STANDBY_SPOOL_PERCENT = 0 STANDBY_ERROR_TIME = NULL PEER_WINDOW(seconds) = 280 PEER_WINDOW_END = 11/06/2016 18:37:57.000000 (1478475477) READS_ON_STANDBY_ENABLED = N HADR_ROLE = PRIMARY REPLAY_TYPE = PHYSICAL HADR_SYNCMODE = SUPERASYNC STANDBY_ID = 2 LOG_STREAM_ID = 0 HADR_STATE = REMOTE_CATCHUP HADR_FLAGS = PRIMARY_MEMBER_HOST = server1 PRIMARY_INSTANCE = db2inst1 PRIMARY_MEMBER = 0 STANDBY_MEMBER_HOST = server3 STANDBY_INSTANCE = db2inst1 STANDBY_MEMBER = 0 HADR_CONNECT_STATUS = CONNECTED HADR_CONNECT_STATUS_TIME = 11/06/2016 17:59:24.324558 (1478473164) HEARTBEAT_INTERVAL(seconds) = 30 HEARTBEAT_MISSED = 0 HEARTBEAT_EXPECTED = 68 HADR_TIMEOUT(seconds) = 120 TIME_SINCE_LAST_RECV(seconds) = 19 PEER_WAIT_LIMIT(seconds) = 0 LOG_HADR_WAIT_CUR(seconds) = 0.000 LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000035 LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.009 LOG_HADR_WAIT_COUNT = 255 SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 262800 SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 262800 PRIMARY_LOG_FILE,PAGE,POS = S0000453.LOG, 34, 18505181955 STANDBY_LOG_FILE,PAGE,POS = S0000453.LOG, 34, 18505181955 HADR_LOG_GAP(bytes) = 0 STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000453.LOG, 34, 18505181955 STANDBY_RECV_REPLAY_GAP(bytes) = 0 PRIMARY_LOG_TIME = 11/06/2016 18:30:17.000000 (1478475017) STANDBY_LOG_TIME = 11/06/2016 18:30:17.000000 (1478475017) STANDBY_REPLAY_LOG_TIME = 11/06/2016 18:30:17.000000 (1478475017) STANDBY_RECV_BUF_SIZE(pages) = 4096 STANDBY_RECV_BUF_PERCENT = 0 STANDBY_SPOOL_LIMIT(pages) = 430000 STANDBY_SPOOL_PERCENT = 0 STANDBY_ERROR_TIME = NULL PEER_WINDOW(seconds) = 0 READS_ON_STANDBY_ENABLED = N
Restrictions and Problems
It is important to realize what can and cannot be done on standby servers. I had a client who wanted an auxiliary server to be used as a test environment, and that’s not really possible without breaking HADR and re-initializing it with a backup/restore when done. Backups are also not possible. Any action which takes the standby database out of a rollforward pending state will require re-initialization.
You can enable reads on standby on an auxiliary database, but there are a number of restrictions on reads on standby – and a standby with reads enabled must be fully licensed.
In IBM doc:
https://www.ibm.com/support/pages/steps-adding-new-auxiliary-standby-existing-db2-hadr-pair
It states
“The correct order for starting HADR is to start HADR on the principal standby first, then the Aux standby, and then start it on the primary”,
which is different from yours.
Apparently yours is correct. Would you elaborate.
The reason I do it in this order is because I want the main pair up and working first, and don’t want to mess with the auxiliaries until I’m sure that main pair is working first. In my order, you may have the aux standbys become temporarily disconnected, but if you need to change back quickly, you’ll have some server somewhere up to date while you also revert the change on the principal standby. That’s my reasoning. I think either order functions.