This is my second post in my TSA/db2haicu series. The first porst, Using TSA/db2haicu to automate failover – Part 1: The Preparation, is a must-read before trying anything lined out in this post.
What You Should Have Ready After Part 1 of the Series
If you’ve done the preparation properly, you have the following already defined/done:
- HADR is set up and running, using hostnames(either fully qualified or short names)
- Properly configured Hosts file
- preprpnode run on both servers
- Public IP addresses for both servers
- Fully qualified host names for both servers
- IP Address of the Quorum Device
- (Optional) Virtual IP address and Subnet mask of that VIP
- (Optional) Private IP addresses if they’re being used
In our example, I’ll be using:
- Fully qualified host names for both servers:
174.13.101.192 spp05db01r 4032312-Prod-db1.adomainl.com
174.13.101.193 spp05db02r 4032313-Prod-db2.adomainl.com
- IP Address of the Quorum Device
174.13.101.2
- Virtual IP
174.13.101.231
- We are not using a Private Netwok
Actually Running db2haicu on the Standby
The idea here is that db2haicu asks you questions, and you answer them using the information you’ve prepared ahead of time. This seems like a simple approach once you’ve done it a time or two, but can be kind of intimidating the first time.
Ok, so you start out running db2haicu on the standby database server. What it looks like is below, with notes. Inputs are highlighted in red, because they can be hard to pick out. I’ve taken this output from a real setup that I did, but may have changed ip addresses and host names to protect the innocent:
[db2inst1@4032313-Prod-db2 ~]$ db2haicu Welcome to the DB2 High Availability Instance Configuration Utility (db2haicu). You can find detailed diagnostic information in the DB2 server diagnostic log file called db2diag.log. Also, you can use the util ity called db2pd to query the status of the cluster domains you create. For more information about configuring your clustered environment using db2haicu, see the topic called 'DB2 High Availability Ins tance Configuration Utility (db2haicu)' in the DB2 Information Center. db2haicu determined the current DB2 database manager instance is db2inst1. The cluster configuration that follows will apply to t his instance. db2haicu is collecting information on your current setup. This step may take some time as db2haicu will need to activate all data bases for the instance to discover all paths ... When you use db2haicu to configure your clustered environment, you create cluster domains. For more information, see the topic 'C reating a cluster domain with db2haicu' in the DB2 Information Center. db2haicu is searching the current machine for an existing active cluster domain ... db2haicu did not find a cluster domain on this machine. db2haicu will now query the system for information about cluster nodes to create a new cluster domain ... db2haicu did not find a cluster domain on this machine. To continue configuring your clustered environment for high availability, you must create a cluster domain; otherwise, db2haicu will exit. Create a domain and continue? [1] 1. Yes 2. No 1 Create a unique name for the new domain: prod_db2ha Nodes must now be added to the new domain. How many cluster nodes will the domain prod_db2ha contain? 2 Enter the host name of a machine to add to the domain: 4032312-Prod-db1.adomainl.com Enter the host name of a machine to add to the domain: 4032313-Prod-db2.adomainl.com db2haicu can now create a new domain containing the 2 machines that you specified. If you choose not to create a domain now, db2h aicu will exit. Create the domain now? [1] 1. Yes 2. No 1 Creating domain prod_db2ha in the cluster ... Creating domain prod_db2ha in the cluster was successful. You can now configure a quorum device for the domain. For more information, see the topic "Quorum devices" in the DB2 Information Center. If you do not configure a quorum device for the domain, then a human operator will have to manually intervene if subsets of machines in the cluster lose connectivity. Configure a quorum device for the domain called prod_db2ha? [1] 1. Yes 2. No 1 The following is a list of supported quorum device types: 1. Network Quorum Enter the number corresponding to the quorum device type to be used: [1] 1 Specify the network address of the quorum device: 174.13.101.2 Configuring quorum device for domain prod_db2ha ... Configuring quorum device for domain prod_db2ha was successful.
Ok, so the one thing that’s out of the ordinary with this setup, you will see in the next section. There are two network cards on each of these servers. We will not be using the ones called ‘eth2’, but only the ones called ‘bond0’. To do this through db2haicu, we have to initially say “yes” to adding the network card to a network we’re configuring, and then say “no” to the confirmation. Notice there’s no “no” option in the first place. This is a bit counter-intuitive, but that’s how it works.
The cluster manager found 4 network interface cards on the machines in the domain. You can use db2haicu to create networks for th ese network interface cards. For more information, see the topic 'Creating networks with db2haicu' in the DB2 Information Center. Create networks for these network interface cards? [1] 1. Yes 2. No 1 Enter the name of the network for the network interface card: eth2 on cluster node: spp05db01r 1. Create a new public network for this network interface card. 2. Create a new private network for this network interface card. Enter selection: 1 Are you sure you want to add the network interface card eth2 on cluster node spp05db01r to the network db2_public_network_0? [1] 1. Yes 2. No 2 Enter the name of the network for the network interface card: bond0 on cluster node: spp05db02r 1. Create a new public network for this network interface card. 2. Create a new private network for this network interface card. Enter selection: 1 Are you sure you want to add the network interface card bond0 on cluster node spp05db02r to the network db2_public_network_0? [1] 1. Yes 2. No 1 Adding network interface card bond0 on cluster node spp05db02r to the network db2_public_network_0 ... Adding network interface card bond0 on cluster node spp05db02r to the network db2_public_network_0 was successful. Enter the name of the network for the network interface card: eth2 on cluster node: spp05db02r 1. db2_public_network_0 2. Create a new public network for this network interface card. 3. Create a new private network for this network interface card. Enter selection: 2 Are you sure you want to add the network interface card eth2 on cluster node spp05db02r to the network db2_public_network_1? [1] 1. Yes 2. No 2 Enter the name of the network for the network interface card: bond0 on cluster node: spp05db01r 1. db2_public_network_0 2. Create a new public network for this network interface card. 3. Create a new private network for this network interface card. Enter selection: 1 Are you sure you want to add the network interface card bond0 on cluster node spp05db01r to the network db2_public_network_0? [1] 1. Yes 2. No 1 Adding network interface card bond0 on cluster node spp05db01r to the network db2_public_network_0 ... Adding network interface card bond0 on cluster node spp05db01r to the network db2_public_network_0 was successful. Retrieving high availability configuration parameter for instance db2inst1 ... The cluster manager name configuration parameter (high availability configuration parameter) is not set. For more information, se e the topic "cluster_mgr - Cluster manager name configuration parameter" in the DB2 Information Center. Do you want to set the hi gh availability configuration parameter? The following are valid settings for the high availability configuration parameter: 1.TSA 2.Vendor Enter a value for the high availability configuration parameter: [1] 1 Setting a high availability configuration parameter for instance db2inst1 to TSA. Adding DB2 database partition 0 to the cluster ... Adding DB2 database partition 0 to the cluster was successful. Do you want to validate and automate HADR failover for the HADR database WC005P01? [1] 1. Yes 2. No 1 Adding HADR database WCSP01 to the domain ... The HADR database WCSP01 has been determined to be valid for high availability. However, the database cannot be added to the cl uster from this node because db2haicu detected this node is the standby for the HADR database WCSP01. Run db2haicu on the prima ry for the HADR database WCSP01 to configure the database for automated failover.
All cluster configurations have been completed successfully. db2haicu exiting ...
At least once, I’ve seen this final message and thought that there was a failure. This is the message we expect to see.
Actually Running db2haicu on the Primary
Once you’ve gotten db2haicu to run successfully on the Standby, you also need to run it on the primary. Here’s what that looks like:
[db2inst1@4032312-Prod-db1 ~]$ db2haicu Welcome to the DB2 High Availability Instance Configuration Utility (db2haicu). You can find detailed diagnostic information in the DB2 server diagnostic log file called db2diag.log. Also, you can use the util ity called db2pd to query the status of the cluster domains you create. For more information about configuring your clustered environment using db2haicu, see the topic called 'DB2 High Availability Ins tance Configuration Utility (db2haicu)' in the DB2 Information Center. db2haicu determined the current DB2 database manager instance is db2inst1. The cluster configuration that follows will apply to t his instance. db2haicu is collecting information on your current setup. This step may take some time as db2haicu will need to activate all data bases for the instance to discover all paths ... When you use db2haicu to configure your clustered environment, you create cluster domains. For more information, see the topic 'C reating a cluster domain with db2haicu' in the DB2 Information Center. db2haicu is searching the current machine for an existing active cluster domain ... db2haicu found a cluster domain called prod_db2ha on this machine. The cluster configuration that follows will apply to this doma in. Retrieving high availability configuration parameter for instance db2inst1 ... The cluster manager name configuration parameter (high availability configuration parameter) is not set. For more information, se e the topic "cluster_mgr - Cluster manager name configuration parameter" in the DB2 Information Center. Do you want to set the hi gh availability configuration parameter? The following are valid settings for the high availability configuration parameter: 1.TSA 2.Vendor Enter a value for the high availability configuration parameter: [1] 1 Setting a high availability configuration parameter for instance db2inst1 to TSA. Adding DB2 database partition 0 to the cluster ... Adding DB2 database partition 0 to the cluster was successful. Do you want to validate and automate HADR failover for the HADR database WCSP01? [1] 1. Yes 2. No 1 Adding HADR database WCSP01 to the domain ... Adding HADR database WCSP01 to the domain was successful. Do you want to configure a virtual IP address for the HADR database WCSP01? [1] 1. Yes 2. No 1 Enter the virtual IP address: 174.13.101.231 Enter the subnet mask for the virtual IP address 174.13.101.231: [255.255.255.0] 255.255.255.0 Select the network for the virtual IP 174.13.101.231: 1. db2_public_network_0 Enter selection: 1 Adding virtual IP address 174.13.101.231 to the domain ... Adding virtual IP address 174.13.101.231 to the domain was successful. All cluster configurations have been completed successfully. db2haicu exiting ...
Verification After Running db2haicu
There are several ways to look at things to see how tsa is functioning. You’ll get intimately familiar with these if you don’t follow the proper procedures for stopping/starting a db2haicu/tsa cluster(as defined in 7.4 of this white paper: http://download.boulder.ibm.com/ibmdl/pub/software/dw/data/dm-0908hadrdb2haicu/HADR_db2haicu.pdf). But when I’m starting tsa/db2haicu up for the first time, I always copy the output of these to my build document so that if there are problems later, I can go back and see if they were always there or if they were introduced later. First is my favorite method of looking at things, though it does require root:
[root@4032312-Prod-db1 ~]# lssam Online IBM.ResourceGroup:db2_db2inst1_4032312-Prod-db1.adomainl.com_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_4032312-Prod-db1.adomainl.com_0-rs '- Online IBM.Application:db2_db2inst1_4032312-Prod-db1.adomainl.com_0-rs:spp05db01r Online IBM.ResourceGroup:db2_db2inst1_4032313-Prod-db2.adomainl.com_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_4032313-Prod-db2.adomainl.com_0-rs '- Online IBM.Application:db2_db2inst1_4032313-Prod-db2.adomainl.com_0-rs:spp05db02r Online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCSP01-rg Nominal=Online |- Online IBM.Application:db2_db2inst1_db2inst1_WCSP01-rs |- Online IBM.Application:db2_db2inst1_db2inst1_WCSP01-rs:spp05db01r '- Offline IBM.Application:db2_db2inst1_db2inst1_WCSP01-rs:spp05db02r '- Online IBM.ServiceIP:db2ip_174_13_101_231-rs |- Online IBM.ServiceIP:db2ip_174_13_101_231-rs:spp05db01r '- Offline IBM.ServiceIP:db2ip_174_13_101_231-rs:spp05db02r
One of the nice things about this method is that assuming you’re on Linux, this will be color-coded on your screen, with problems showing up in yellow or red, which is nice at a glance, and when you’re getting to know this stuff. If you have something showing “Pending-Online” status, you may be in trouble. I plan to write a complete post on this end of things in this series, so be on the lookout for that. The above is how it should look if things are OK.
Another way to check that I have yet to get used to looking at and interpreting can be done from the db2 instance owner’s command line:
[db2inst1@4032312-Prod-db1 ~]$ db2pd -ha DB2 HA Status Instance Information: Instance Name = db2inst1 Number Of Domains = 1 Number Of RGs for instance = 2 Domain Information: Domain Name = prod_db2ha Cluster Version = 2.5.1.4 Cluster State = Online Number of nodes = 2 Node Information: Node Name State --------------------- ------------------- 4032313-Prod-db2.adomainl.com Online 4032312-Prod-db1.adomainl.com Online Resource Group Information: Resource Group Name = db2_db2inst1_db2inst1_WCSP01-rg Resource Group LockState = Unlocked Resource Group OpState = Online Resource Group Nominal OpState = Online Number of Group Resources = 2 Number of Allowed Nodes = 2 Allowed Nodes ------------- 4032312-Prod-db1.adomainl.com 4032312-Prod-db2.adomainl.com Member Resource Information: Resource Name = db2_db2inst1_db2inst1_WCSP01-rs Resource State = Online Resource Type = HADR HADR Primary Instance = db2inst1 HADR Secondary Instance = db2inst1 HADR DB Name = WCSP01 HADR Primary Node = 4032312-Prod-db1.adomainl.com HADR Secondary Node = 4032313-Prod-db2.adomainl.com Resource Name = db2ip_174_13_101_231-rs Resource State = Online Resource Type = IP Resource Group Name = db2_db2inst1_4032312-Prod-db1.adomainl.com_0-rg Resource Group LockState = Unlocked Resource Group OpState = Online Resource Group Nominal OpState = Online Number of Group Resources = 1 Number of Allowed Nodes = 1 Allowed Nodes ------------- 4032312-Prod-db1.adomainl.com Member Resource Information: Resource Name = db2_db2inst1_4032312-Prod-db1.adomainl.com_0-rs Resource State = Online Resource Type = DB2 Partition DB2 Partition Number = 0 Number of Allowed Nodes = 1 Allowed Nodes ------------- 4032312-Prod-db1.adomainl.com Network Information: Network Name Number of Adapters ----------------------- ------------------ db2_public_network_0 2 Node Name Adapter Name ----------------------- ------------------ spp05db02r bond0 spp05db01r bond0 Quorum Information: Quorum Name Quorum State ------------------------------------ -------------------- Fail Offline db2_Quorum_Network_174_13_101_2:16_28_43 Online Operator Offline
My next post in this series will cover some of the errors and some different things to try.
Other Posts In This Series
This series consists of four posts:
Using TSA/db2haicu to automate failover – Part 1: The Preparation
Using TSA/db2haicu to automate failover – Part 2: How it looks if it goes smoothly
Using TSA/db2haicu to Automate Failover Part 3: Testing, Ways Setup can go Wrong and What to do.
“Using TSA/db2haicu to automate failover Part 4: Dealing with Problems After Setup
Search this blog on “TSA” for other posts on TSA issues and tips.
Hi
Excelent article, congratulations!!
Best regards from mexico
Expecting your part 3. (I wrote the db2cptsa:-))
Funny, I was just working on it yesterday after months of neglect. It is coming. It seems that about every 3-4 months I end up doing or helping with about 3 HADR/TSA setups in less than an month, and then nothing for another 3-4 months. Strange pattern.
Hi Ember,
when will a TSA troubleshooting part be available?
Regards
Dieter
I hope to finish writing it the last week of August. I’m not quite keeping up due to a vacation that I start tomorrow.
I too look forward to part 3. I’m doing a HADR/TSA POC (first time trying TSA) and it’s fighting me every step! Latest issue is that the db2haicu just stops here:
Create the domain now? [1]
1. Yes
2. No
1
Creating domain hadr_melaitludbpp01_domain in the cluster …
Creating domain hadr_melaitludbpp01_domain in the cluster was successful.
Without progressing to the Quorum.
If I try to rerun it errors and if I try db2haicu -delete it errors a la…
FUNCTION: DB2 UDB, oper system services, sqloInvokeVendorFunction, probe:50
MESSAGE : ZRC=0x870F0009=-2029060087=SQLO_EOF “the data does not exist”
DIA8506C Unexpected end of file was reached.
DATA #1 : String, 46 bytes
Pipe read from vendor process was interrupted.
I hope to finish writing it the last week of August. I’m a bit behind at the moment. Double check in excruciating detail the prep work that I outlined in part 1. There are scripts that come with each fixpack and the base code for uninstalling and reinstalling sam – I’ve had good luck with them.
Hi,
I have setup the HaDR and TSA in the same way as in your article. Everything seems fine, but I am unable to perform a takeover.. I tried using db2 takeover, and rgreq ,but both failed. Actually db2 takeover is at least trying to swich roles, but after switching in shows and error and Primary comes back to the previous node.
Takeover was working fine before configuring TSA. During configuration of TSA takeover was successful as well, but it was the last time….
Do you have any suggestions
If you’re on 10.5, make sure you’re on Fixpack 5. There were some issues in earlier fixpacks.
Also, call IBM support – they should be able to help you. I have seen this issue before at my old employer, but left before it was resolved. I think it was attributed to Fixpack 3 or 4 on 10.5 in that scenario.
Ember: In part 1, you write that the hosts file can be either in the form “IP Shortname Longname” or “IP Longname Shortname”. You then write “You need to have HADR set up with whatever comes first – shortname or long name.” In the above configuration, the hosts file has the Shortname first. I expected the configuration setup by dbahaicu to use this name. However, the Longname was used. Can you please explain. Is DB2 still this rigid on HADR setup naming? In the original document, it is written “IP addresses don’t work.” I have seen IBM docs (including the Knowledge Center) use IP addresses (and I have used them) with no issues. Thank you for your excellent articles.
I have now seen IP addresses in use, so I would agree that they now work. I had trouble getting them to work at the time of this article.
I’m not sure what you mean by the longname being used by db2haicu – YOU specify what is used. If something other than the first entry in /etc/hosts works, that’s great. It is my experience that this doesn’t always work.
Ember: Thank you for the extremely prompt response. I do appreciate it, as well as the documentation which you have written.
Hi Ember,
During my initial configuration of TSA, I choose not to configure VIP address with HADR. Now I have need to use VIP address and I am curious to know if I can do it now without having to reconfigure TSA from scratch?
Below I copied step on which I answered with NO and now I need YES 🙂
Do you want to configure a virtual IP address for the HADR database WCSP01? [1]
1. Yes
2. No
1
Enter the virtual IP address:
174.13.101.231
Enter the subnet mask for the virtual IP address 174.13.101.231: [255.255.255.0]
255.255.255.0
Thanks in advance,
Milos
Yes, you can easily use db2haicu’s maintenance mode, without reconfiguring or an outage. Just type db2haicu, and select the option to add a VIP.
Hi Ember,
While setting up HA, i’m facing this error. Could you kindly help me here..
2019-09-27-12.54.39.439173-300 E358380280E943 LEVEL: Error
PID : 8066 TID : 139706527115136 PROC : db2haicu
INSTANCE: db2inst1 NODE : 000
HOSTNAME: HOST1
FUNCTION: DB2 UDB, high avail services, sqlhaUpdateResource2, probe:450
MESSAGE : ECF=0x90000555=-1879046827=ECF_SQLHA_UPDATE_ATTR_FAILED
Attribute update failed
DATA #1 : String, 35 bytes
Error during vendor call invocation
DATA #2 : SQLHA Cluster Session Handle, PD_TYPE_SQLHA_CLUSTER_HANDLE, 4120 bytes
sqlhaClusterHandle->clusterHandle: 1
sqlhaClusterHandle->clusterFlags: 0
sqlhaClusterHandle->clusterErrorNum: 0
sqlhaClusterHandle->errorMessage: Line # : 16222—Error Number: 33—2632-072 The operation cannot be performed because a majority of quorum nodes or configuration daemons is not currently active in the domain, or because the quorum of the domain is not currently satisfied.
I wish I had an answer for you. I haven’t seen that one.
Hi Praveen/Ember,
Did we know the solution for the issue?
I am facing the same error today, if you know any solution please share.
If no solution then I may reach PMR and share you the details.
Thanks,
I don’t have an answer. Would love it if you’re willing to share your findings.
I was facing the same error when I was trying to execute ‘db2haicu’ from the primary side, but when I tried it from
the standby side error changed and to resolve it I reached on below link which says that this issue arises when the IP address for the server changes. As I was working on my own VM machine its IP got changed after restarting
the machine.
so executed the given solution and now it is running fine.
https://www.ibm.com/support/pages/user-may-get-2632-040-node-host1-cannot-be-pinged-and-therefore-not-reachable-db2haicu-processing
This is a GREAT series of posts, again. And saved me many times from “certain death by madness”.
I would like to add some useful information that I had the pleasure to find working with this damn thing called TSAMP and Db2.
1. I can’t understand why the heck but it seems (at least on Red Hat servers) that db2haicu ignores /etc/hosts file entries and performs low-level calls on the OS API directly to the current DNS server. It means, if you have a server called db2.server.com please ensure you DNS server is able to translate this FQDN into the right IP address. Check /etc/resolv.conf to ensure you’re using the correct DNS servers.
2. Ensure the /etc/hostname is filled with the correct DNS FQDN for the server
3. Ensure instance ~/sqllib/db2nodes.cfg has the correct FQDN found in /etc/hostname. Many times when using an alias entry in /etc/hosts in Red Hat 7.8+ and SUSE 11+ db2haicu fails to find the server during the node configuration saying that it couldn’t find it. Oddly enough it doesn’t happens always (on AIX it never happened). So, to avoid any problems, always use the FQDN that is already in /etc/hostname into db2nodes.cfg file (you don’t need to change anything if /etc/hostname is using your DNS FQDN).
4. One of the most important requirements that me as a DBA was ignoring completely. You’ll need to open the RIGHT TCP/UDP ports to make TSAMP/RCST to work! If you don’t open them correctly you’ll probably fall into errors like “16222—Error Number: 33—2632-072 The operation cannot be performed because a majority of quorum nodes or configuration daemons is not currently active in the domain, or because the quorum of the domain is not currently satisfied”.
The default ports (if you are using default/common Db2 installs) that must be opened both ways (in/out) among all cluster server nodes are these (change 3 first line ports by yours already defined for Db2):
# – 60000/tcp plain connections TAG = Db2
# – 61000/tcp SSL connections TAG = Db2 + SSL
# – 62000/tcp HADR connections TAG = Db2 + SSL + HADR
# – 657/tcp RMC Daemon for TSA TAG = Db2
# – 657/udp RMC Daemon for TSA TAG = Db2
# – 1191/tcp mmfsd and mmsdrserv
# – 12347/udp cthats
# – 12348/udp cthags
5. Remember to open firewall port on you server (firewall-cmd in Red Hat Linux, for example) and on your network firewall appliance (Palo Alto, pfSense, etc.).
6. Last but not less important: be sure to buy and apply a TSAMP license because it will work for only 90 days after all set up with Db2 embedded installs!
Awesome input, thanks!
On point 6, Db2 mostly comes with the license to use TSAMP for the purposes listed here. That may change as IBM moves to Pacemaker/Corosync for automating failover.