Part 3 in this series is a bit overdue. Parts 1 and 2 were back in April. This is a complicated topic. Please use any procedures here with extreme care, and keep in mind that if you have anything other than the standard two-server HADR-only TSA implementation, these procedures probably aren’t the best idea, as they could break other things. There will also be a Part 4 – dealing with problems after set-up.
I’m not saying I’m covering every possible failure scenario, but I’ve seen a number of different issues and wanted to share some strategies for dealing with them.
Testing automated failover
First of all, it is absoultely critical that you test your failover. As many tests as you can manage will help you out here. I try to set up hadr, set up failover using TSA/db2haicu, and test all in the same week to keep things from getting missed.
The absolute minimum tests you should do are:
- Manual takeover, verifying the database
- Manual takeover, verifying the Commerce (or other) application
- Hard failure with inability to start (renaming executable)
If at all possible, also do the following tests:
- Power-off tests on each node
- db2_kill test on each node (with caution)
- Manual takeover by force on each node
- Network failure test on each node
- Failover under load (during load test)
See section 6 of this document for some really excellent details on testing: http://download.boulder.ibm.com/ibmdl/pub/software/dw/data/dm-0908hadrdb2haicu/HADR_db2haicu.pdf
If you just assume it will work, it probably will not. On at least three occasions, I’ve caught issues while testing failover.
While I’ve caught 3 issues while testing failover, I’ve had at least twice that many during the setup process. The most common cause of failure that I’ve seen is missed steps during preparation. For nearly every problem or issue I’ve seen, I’ve gone back and added to that preparation post. The first few times I set up TSA with HADR, my preparation was mostly just gathering inputs. Then, one by one, as I saw failures, I added to the prep work. I’m still going to talk about what those missed prep work errors look like, because it’s easy to miss something. I always say that the best DBA is a detail-oriented control freak, and this is one area where that’s certainly true. If you’re having problems, literally go through the preparation post line by line on each server and see if you missed anything. Seriously, for any failure prior to testing, go through the preparation items with a fine-tooth comb on both servers.
If you go to do the preprpnode preparation step, and you get a failure like this:
# preprpnode server1.domain.com server2.domain.com -bash: preprpnode: command not found
This likely means that your SAM installation was not completed successfully. See https://datageek.blog/2012/04/09/using-tsadb2haicu-to-automate-failover-part-1-the-preparation/ – the section called “Software Installed” – for details on how to do that.
Failure on Creating the Domain
What this looks like
> db2haicu ... Create the domain now?  1. Yes 2. No 1 Creating domain prod_db2ha in the cluster ... Creating domain failed. Refer to db2diag.log and the DB2 Information Center for details.
I don’t have excerpts from the db2diag log at this point – if anyone does, please share.
This usually means you didn’t do the preprpnode or you didn’t do it properly. Remember that the preprpnode must be done as root on both servers in this format:
# preprpnode server1.domain.com server2.domain.com
db2haicu Fails Near the End of the Setup for the Standby Server
What This Looks Like
> db2haicu ... Retrieving high availability configuration parameter for instance db2inst1 ... The cluster manager name configuration parameter (high availability configuration parameter) is not set. For more information, se e the topic "cluster_mgr - Cluster manager name configuration parameter" in the DB2 Information Center. Do you want to set the hi gh availability configuration parameter? The following are valid settings for the high availability configuration parameter: 1.TSA 2.Vendor Enter a value for the high availability configuration parameter:  1 Setting a high availability configuration parameter for instance db2inst1 to TSA. Adding DB2 database partition 0 to the cluster ... There was an error with one of the issued cluster manager commands. Refer to db2diag.log and the DB2 Information Center for detai ls.
In the case where I most recently saw this particular failure, I had set up HADR with IP addresses. TSA/db2haicu does not seem to like or allow the use of just IP addresses. So I had to go back and re-do the HADR setup using host names. I believe I’ve also seen failures here due to incorrect formatting of the hosts file or incorrect entries in db2nodes.cfg(yes, for single node implementations). Basically a failure at this point most frequently means that you missed some part of the preparation steps. See https://datageek.blog/2012/04/09/using-tsadb2haicu-to-automate-failover-part-1-the-preparation/.
Failure on failover test while testing the Application
This one seems a bit dumb in retrospect, but I was working with someone I don’t normally, and made some assumptions that I shouldn’t have. Essentially what happened was that when we tested the failover, we saw the database come up fine every time, but the application never seemed to re-establish connections. After a couple of hours of troubleshooting, we realized that the application’s ID did not exist on the standby server, and when it was created and the passwords synced, the problem immediately went away. This holds true for just standard HADR, even if you’re not using TSA: ensure that your user ids and passwords are identical between your primary and your standby database servers.
TSA Installation Issues
We normally install DB2 from Base Code, and then Apply the latest FixPack (well, as long as it has been out for a month or so). On RedHat, we’ve seen issues where the version of RedHat we’re using doesn’t support the version of TSA that comes with the base code. So when we install DB2, it gives an error message that the TSA/SAM component could not be installed. Luckily the version of TSA that comes with FixPack 4 and later is supported with the version of RedHat. But the FixPack does not automatically install it, of course. So for servers where we want to use TSA, we have to install the DB2 Base Code, Install the FixPack, and then install the TSA/SAM component from the FixPack code using this procedure: https://datageek.blog/2012/04/09/using-tsadb2haicu-to-automate-failover-part-1-the-preparation/ – the section called “Software Installed”
Ultimately, I know that I don’t fully understand at least half of the failures I’ve seen. I need to see what information I can find on pure TSA so that I really understand what to do and all of the states. I would love it if there were some education offered for this at the conference or even just in a webcast. So what I really have are a series of things that I try when a failure occurs. Some I’ve already mentioned above.
- Go through the prep work with a fine tooth comb: https://datageek.blog/2012/04/09/using-tsadb2haicu-to-automate-failover-part-1-the-preparation/. This includes:
- Double and tripple check that you have picked either the server’s short name or the server’s long name and are using it consistently in each of:
- HADR configuration parameters in db cfg
- db2nodes.cfg (in $HOME/sqllib)
- Results of the ‘hostname’ command
- Double check that you successfully executed the preprpnode command on both hosts
- Double check that you successfully executed the db2cptsa command on both hosts
- Double and tripple check that you have picked either the server’s short name or the server’s long name and are using it consistently in each of:
- Start Over. Delete your TSA work using the -delete option on db2haicu and start over with db2haicu fresh
[db2inst1@403238-Prod-db2 ~]$ db2haicu -delete Welcome to the DB2 High Availability Instance Configuration Utility (db2haicu). You can find detailed diagnostic information in the DB2 server diagnostic log file called db2diag.log. Also, you can use the util ity called db2pd to query the status of the cluster domains you create. For more information about configuring your clustered environment using db2haicu, see the topic called 'DB2 High Availability Ins tance Configuration Utility (db2haicu)' in the DB2 Information Center. db2haicu determined the current DB2 database manager instance is db2inst1. The cluster configuration that follows will apply to t his instance. When you use db2haicu to configure your clustered environment, you create cluster domains. For more information, see the topic 'C reating a cluster domain with db2haicu' in the DB2 Information Center. db2haicu is searching the current machine for an existing active cluster domain ... db2haicu found a cluster domain called prod_db2ha on this machine. The cluster configuration that follows will apply to this doma in. Deleting the domain prod_db2ha from the cluster ... Deleting the domain prod_db2ha from the cluster was successful. All cluster configurations have been completed successfully. db2haicu exiting ...
- Try uninstalling and re-installing the TSA/SAM component
- Uninstalling looks like this:
/db2/linuxamd64/tsamp [root@server1]# ./uninstallSAM uninstallSAM: Uninstalling System Automation on platform: x86_64 uninstallSAM: Package is not installed: sam.sappolicy uninstallSAM: Uninstalling sam.adapter-220.127.116.11-08261.i386 uninstallSAM: Uninstalling sam.msg.de_DE-18.104.22.168-0.i386 sam.msg.de_DE.ISO-8859-1-22.214.171.124-0.i386 sam.msg.de_DE@euro-126.96.36.199-0.i386 sam.msg.de_DE.UTF-8-188.8.131.52-0.i386 uninstallSAM: Uninstalling sam.msg.es_ES-184.108.40.206-0.i386 sam.msg.es_ES.ISO-8859-1-220.127.116.11-0.i386 sam.msg.es_ES@euro-18.104.22.168-0.i386 sam.msg.es_ES.UTF-8-22.214.171.124-0.i386 uninstallSAM: Uninstalling sam.msg.fr_FR-126.96.36.199-0.i386 sam.msg.fr_FR.ISO-8859-1-188.8.131.52-0.i386 sam.msg.fr_FR@euro-184.108.40.206-0.i386 sam.msg.fr_FR.UTF-8-220.127.116.11-0.i386 uninstallSAM: Uninstalling sam.msg.it_IT-18.104.22.168-0.i386 sam.msg.it_IT.ISO-8859-1-22.214.171.124-0.i386 sam.msg.it_IT@euro-126.96.36.199-0.i386 sam.msg.it_IT.UTF-8-188.8.131.52-0.i386 uninstallSAM: Uninstalling sam.msg.ja_JP.eucJP-184.108.40.206-0.i386 sam.msg.ja_JP.UTF-8-220.127.116.11-0.i386 uninstallSAM: Uninstalling sam.msg.ko_KR.eucKR-18.104.22.168-0.i386 sam.msg.ko_KR.UTF-8-22.214.171.124-0.i386 uninstallSAM: Uninstalling sam.msg.pt_BR-126.96.36.199-0.i386 sam.msg.pt_BR.UTF-8-188.8.131.52-0.i386 uninstallSAM: Uninstalling sam.msg.zh_CN.GB2312-184.108.40.206-0.i386 sam.msg.zh_CN.GB18030-220.127.116.11-0.i386 sam.msg.zh_CN.GBK-18.104.22.168-0.i386 sam.msg.zh_CN.UTF-8-22.214.171.124-0.i386 uninstallSAM: Uninstalling sam.msg.zh_TW-126.96.36.199-0.i386 sam.msg.zh_TW.Big5-188.8.131.52-0.i386 sam.msg.zh_TW.eucTW-184.108.40.206-0.i386 sam.msg.zh_TW.UTF-8-220.127.116.11-0.i386 uninstallSAM: Uninstalling sam-18.104.22.168-08261.i386 uninstallSAM: Uninstalling rsct.opt.storagerm-22.214.171.124-08249.i386 uninstallSAM: Uninstalling rsct.64bit-126.96.36.199-08249.x86_64 uninstallSAM: Uninstalling rsct.basic.msg.de_DE-188.8.131.52-0.i386 rsct.basic.msg.de_DE.ISO-8859-1-184.108.40.206-0.i386 rsct.basic.msg.de_DE@euro-220.127.116.11-0.i386 rsct.basic.msg.de_DE.UTF-8-18.104.22.168-0.i386 uninstallSAM: Uninstalling rsct.basic.msg.es_ES-22.214.171.124-0.i386 rsct.basic.msg.es_ES.ISO-8859-1-126.96.36.199-0.i386 rsct.basic.msg.es_ES@euro-188.8.131.52-0.i386 rsct.basic.msg.es_ES.UTF-8-184.108.40.206-0.i386 uninstallSAM: Uninstalling rsct.basic.msg.fr_FR-220.127.116.11-0.i386 rsct.basic.msg.fr_FR.ISO-8859-1-18.104.22.168-0.i386 rsct.basic.msg.fr_FR@euro-22.214.171.124-0.i386 rsct.basic.msg.fr_FR.UTF-8-126.96.36.199-0.i386 uninstallSAM: Uninstalling rsct.basic.msg.it_IT-188.8.131.52-0.i386 rsct.basic.msg.it_IT.ISO-8859-1-184.108.40.206-0.i386 rsct.basic.msg.it_IT@euro-220.127.116.11-0.i386 rsct.basic.msg.it_IT.UTF-8-18.104.22.168-0.i386 uninstallSAM: Uninstalling rsct.basic.msg.ja_JP.eucJP-22.214.171.124-0.i386 rsct.basic.msg.ja_JP.UTF-8-126.96.36.199-0.i386 uninstallSAM: Uninstalling rsct.basic.msg.ko_KR.eucKR-188.8.131.52-0.i386 rsct.basic.msg.ko_KR.UTF-8-184.108.40.206-0.i386 uninstallSAM: Uninstalling rsct.basic.msg.pt_BR-220.127.116.11-0.i386 rsct.basic.msg.pt_BR.UTF-8-18.104.22.168-0.i386 uninstallSAM: Uninstalling rsct.basic.msg.zh_CN.GB2312-22.214.171.124-0.i386 rsct.basic.msg.zh_CN.GB18030-126.96.36.199-0.i386 rsct.basic.msg.zh_CN.GBK-188.8.131.52-0.i386 rsct.basic.msg.zh_CN.UTF-8-184.108.40.206-0.i386 uninstallSAM: Uninstalling rsct.basic.msg.zh_TW-220.127.116.11-0.i386 rsct.basic.msg.zh_TW.Big5-18.104.22.168-0.i386 rsct.basic.msg.zh_TW.eucTW-22.214.171.124-0.i386 rsct.basic.msg.zh_TW.UTF-8-126.96.36.199-0.i386 uninstallSAM: Uninstalling rsct.basic-188.8.131.52-08249.i386 uninstallSAM: Uninstalling rsct.core.msg.de_DE-184.108.40.206-0.i386 rsct.core.msg.de_DE.ISO-8859-1-220.127.116.11-0.i386 rsct.core.msg.de_DE@euro-18.104.22.168-0.i386 rsct.core.msg.de_DE.UTF-8-22.214.171.124-0.i386 uninstallSAM: Uninstalling rsct.core.msg.es_ES-126.96.36.199-0.i386 rsct.core.msg.es_ES.ISO-8859-1-188.8.131.52-0.i386 rsct.core.msg.es_ES@euro-184.108.40.206-0.i386 rsct.core.msg.es_ES.UTF-8-220.127.116.11-0.i386 uninstallSAM: Uninstalling rsct.core.msg.fr_FR-18.104.22.168-0.i386 rsct.core.msg.fr_FR.ISO-8859-1-22.214.171.124-0.i386 rsct.core.msg.fr_FR@euro-126.96.36.199-0.i386 rsct.core.msg.fr_FR.UTF-8-188.8.131.52-0.i386 uninstallSAM: Uninstalling rsct.core.msg.it_IT-184.108.40.206-0.i386 rsct.core.msg.it_IT.ISO-8859-1-220.127.116.11-0.i386 rsct.core.msg.it_IT@euro-18.104.22.168-0.i386 rsct.core.msg.it_IT.UTF-8-22.214.171.124-0.i386 uninstallSAM: Uninstalling rsct.core.msg.ja_JP.eucJP-126.96.36.199-0.i386 rsct.core.msg.ja_JP.UTF-8-188.8.131.52-0.i386 uninstallSAM: Uninstalling rsct.core.msg.ko_KR.eucKR-184.108.40.206-0.i386 rsct.core.msg.ko_KR.UTF-8-220.127.116.11-0.i386 uninstallSAM: Uninstalling rsct.core.msg.pt_BR-18.104.22.168-0.i386 rsct.core.msg.pt_BR.UTF-8-22.214.171.124-0.i386 uninstallSAM: Uninstalling rsct.core.msg.zh_CN.GB2312-126.96.36.199-0.i386 rsct.core.msg.zh_CN.GB18030-188.8.131.52-0.i386 rsct.core.msg.zh_CN.GBK-184.108.40.206-0.i386 rsct.core.msg.zh_CN.UTF-8-220.127.116.11-0.i386 uninstallSAM: Uninstalling rsct.core.msg.zh_TW-18.104.22.168-0.i386 rsct.core.msg.zh_TW.Big5-22.214.171.124-0.i386 rsct.core.msg.zh_TW.eucTW-126.96.36.199-0.i386 rsct.core.msg.zh_TW.UTF-8-188.8.131.52-0.i386 uninstallSAM: Uninstalling rsct.core-184.108.40.206-08249.i386 uninstallSAM: Uninstalling rsct.core.utils.msg.de_DE-220.127.116.11-0.i386 rsct.core.utils.msg.de_DE.ISO-8859-1-18.104.22.168-0.i386 rsct.core.utils.msg.de_DE@euro-22.214.171.124-0.i386 rsct.core.utils.msg.de_DE.UTF-8-126.96.36.199-0.i386 uninstallSAM: Uninstalling rsct.core.utils.msg.es_ES-188.8.131.52-0.i386 rsct.core.utils.msg.es_ES.ISO-8859-1-184.108.40.206-0.i386 rsct.core.utils.msg.es_ES@euro-220.127.116.11-0.i386 rsct.core.utils.msg.es_ES.UTF-8-18.104.22.168-0.i386 uninstallSAM: Uninstalling rsct.core.utils.msg.fr_FR-22.214.171.124-0.i386 rsct.core.utils.msg.fr_FR.ISO-8859-1-126.96.36.199-0.i386 rsct.core.utils.msg.fr_FR@euro-188.8.131.52-0.i386 rsct.core.utils.msg.fr_FR.UTF-8-184.108.40.206-0.i386 uninstallSAM: Uninstalling rsct.core.utils.msg.it_IT-220.127.116.11-0.i386 rsct.core.utils.msg.it_IT.ISO-8859-1-18.104.22.168-0.i386 rsct.core.utils.msg.it_IT@euro-22.214.171.124-0.i386 rsct.core.utils.msg.it_IT.UTF-8-126.96.36.199-0.i386 uninstallSAM: Uninstalling rsct.core.utils.msg.ja_JP.eucJP-188.8.131.52-0.i386 rsct.core.utils.msg.ja_JP.UTF-8-184.108.40.206-0.i386 uninstallSAM: Uninstalling rsct.core.utils.msg.ko_KR.eucKR-220.127.116.11-0.i386 rsct.core.utils.msg.ko_KR.UTF-8-18.104.22.168-0.i386 uninstallSAM: Uninstalling rsct.core.utils.msg.pt_BR-22.214.171.124-0.i386 rsct.core.utils.msg.pt_BR.UTF-8-126.96.36.199-0.i386 uninstallSAM: Uninstalling rsct.core.utils.msg.zh_CN.GB2312-188.8.131.52-0.i386 rsct.core.utils.msg.zh_CN.GB18030-184.108.40.206-0.i386 rsct.core.utils.msg.zh_CN.GBK-220.127.116.11-0.i386 rsct.core.utils.msg.zh_CN.UTF-8-18.104.22.168-0.i386 uninstallSAM: Uninstalling rsct.core.utils.msg.zh_TW-22.214.171.124-0.i386 rsct.core.utils.msg.zh_TW.Big5-126.96.36.199-0.i386 rsct.core.utils.msg.zh_TW.eucTW-188.8.131.52-0.i386 rsct.core.utils.msg.zh_TW.UTF-8-184.108.40.206-0.i386 uninstallSAM: Uninstalling rsct.core.utils-220.127.116.11-08249.i386 uninstallSAM: Uninstalling src.msg.de_DE-18.104.22.168-0.i386 src.msg.de_DE.ISO-8859-1-22.214.171.124-0.i386 src.msg.de_DE@euro-126.96.36.199-0.i386 src.msg.de_DE.UTF-8-188.8.131.52-0.i386 uninstallSAM: Uninstalling src.msg.es_ES-184.108.40.206-0.i386 src.msg.es_ES.ISO-8859-1-220.127.116.11-0.i386 src.msg.es_ES@euro-18.104.22.168-0.i386 src.msg.es_ES.UTF-8-22.214.171.124-0.i386 uninstallSAM: Uninstalling src.msg.fr_FR-126.96.36.199-0.i386 src.msg.fr_FR.ISO-8859-1-188.8.131.52-0.i386 src.msg.fr_FR@euro-184.108.40.206-0.i386 src.msg.fr_FR.UTF-8-220.127.116.11-0.i386 uninstallSAM: Uninstalling src.msg.it_IT-18.104.22.168-0.i386 src.msg.it_IT.ISO-8859-1-22.214.171.124-0.i386 src.msg.it_IT@euro-126.96.36.199-0.i386 src.msg.it_IT.UTF-8-188.8.131.52-0.i386 uninstallSAM: Uninstalling src.msg.ja_JP.eucJP-184.108.40.206-0.i386 src.msg.ja_JP.UTF-8-220.127.116.11-0.i386 uninstallSAM: Uninstalling src.msg.ko_KR.eucKR-18.104.22.168-0.i386 src.msg.ko_KR.UTF-8-22.214.171.124-0.i386 uninstallSAM: Uninstalling src.msg.pt_BR-126.96.36.199-0.i386 src.msg.pt_BR.UTF-8-188.8.131.52-0.i386 uninstallSAM: Uninstalling src.msg.zh_CN.GB2312-184.108.40.206-0.i386 src.msg.zh_CN.GB18030-220.127.116.11-0.i386 src.msg.zh_CN.GBK-18.104.22.168-0.i386 src.msg.zh_CN.UTF-8-22.214.171.124-0.i386 uninstallSAM: Uninstalling src.msg.zh_TW-126.96.36.199-0.i386 src.msg.zh_TW.Big5-188.8.131.52-0.i386 src.msg.zh_TW.eucTW-184.108.40.206-0.i386 src.msg.zh_TW.UTF-8-220.127.116.11-0.i386 uninstallSAM: Uninstalling src-18.104.22.168-08249.i386
- For re-installing see: https://datageek.blog/2012/04/09/using-tsadb2haicu-to-automate-failover-part-1-the-preparation/ – the section called “Software Installed”
- Uninstalling looks like this:
Now that I have my prep work figured out, I can get a clean setup on the first try about 50-75% of the time. The rest of the time, I still have some sort of issue that I have to troubleshoot and deal with on setup or testing. So don’t be discouraged – just work through the issues. I hope this post can provide you with a good toolbox of things to try. Please comment or contact me if you have additional issues that you have seen and solved so others can benefit from your pain.
Other Posts In This Series
This series consists of four posts:
Using TSA/db2haicu to automate failover – Part 1: The Preparation
Using TSA/db2haicu to automate failover – Part 2: How it looks if it goes smoothly
Using TSA/db2haicu to Automate Failover Part 3: Testing, Ways Setup can go Wrong and What to do.
“Using TSA/db2haicu to automate failover Part 4: Dealing with Problems After Setup
Search this blog on “TSA” for other posts on TSA issues and tips.