As always, when I feel pain, I share the knowledge that pain gained me with my blog readers. Man, that was a painful fixpack. I was upgrading an AIX HADR pair from 10.5 fixpack 3 to 10.5 fixpack 5. My experience has generally been that TSA is painful when patching DB2.
Problem Description
In this case, I was not patching TSAMP – A TSAMP update was not a part of this upgrade. However, somehow after I had applied the DB2 fixpack to both servers, one server was missing all TSAMP objects below the domain level. This is what that looked like:
root@server2 # lsrpdomain Name OpState RSCTActiveVersion MixedVersions TSPort GSPort qab2b_db2ha Online 3.2.0.0 No 12347 12348 root@server2 # lssam lssam: 2622-541 No resource groups defined or cluster is offline!
A true WTF moment, and with less experience I would have called IBM for support. However, I know that it’s at least two days to get anything useful out of IBM, so I moved forward to apply everything I knew to try to solve it myself.
Steps Tried
Being generally a prepared person, I had exported details of the sampolicy prior to the fixpack using this command:
root@server1 # sampolicy -s hadr_policy_YYYYMMDD.xml
However, when I tried to apply this file, it did not solve my problem:
root@server2 # sampolicy -a hadr_policy_20160111.xml ......................................... SAMP0002E: The specified policy hadr_policy_20160111.xml is not valid. EXPLANATION: The policy is not valid. You cannot perform any task with this policy. USER ACTION: Try to make the policy valid by analyzing the error messages following this message. Then resubmit the command. The following policy errors were found: 1.) SAMP0037E: The connection to the backend failed because the following exception occurred: EEZAdapterException: Return code: 0 com.ibm.eez.sdk.exceptions.EEZAdapterException: SAMA0037E No domain that is online was detected. Automation adapter is stopped. Contains no original Exception EXPLANATION: An exception occurred when trying to perform an operation on the backend. USER ACTION: Analyze the exception description and try to correct the problem. Policy could not be verified. SAMP1005I: The activation task ends. EXPLANATION: The automation policy could not be activated. USER ACTION: No action required.
I also tried running both db2haicu and later running db2haicu -delete. Both failed in a variety of ways depending on whether the domain remaining was up, down, or deleted. db2haicu -delete worked on the other server where the objects had not mysteriously disappeared. I therefore resorted to killing and deleting all TSAMP resources in a harsh way. Was there a better way to manually add in the missing resource groups and resources? Probably. But db2haicu setup is pretty quick and easy for me at this point, so I try to kill everything and start over because I can do that in less than an hour.
Information Needed to Reconfigure TSAMP
Note that it is best to keep documentation for TSAMP including EVERY value you need to set it up exactly the way it is. If you still have at least one server fully configured, you can get most of the information for that from lssam output. You cannot get the name of the network interface cards from there, so you need to have that or be able to figure that out. The IP of the quorum device is also not there, but can be obtained using this command:
lsrsrc -c -Ab IBM.PeerNode
The output looks something like this:
Resource Class Persistent and Dynamic Attributes for IBM.PeerNode resource 1: CommittedRSCTVersion = "" ActiveVersionChanging = 0 OpQuorumOverride = 0 CritRsrcProtMethod = 1 OpQuorumTieBreaker = "db2_Quorum_Network_000_00_000_0:12_16_0" QuorumType = 0 QuorumGroupName = "" Fanout = 32 OpFenceGroup = ""
In the above, 000_00_000_0
will be the IP address of the quorum device. This assumes you are using a network quorum device, which is the only option that db2haicu supports, though you can configure other options manually.
Killing TSA Everywhere with Fire
DO NOT TRY THIS AT HOME, THIS WILL KILL YOUR TSA IMPLEMENTATION. Bad things will happen, and there is no guarantee this will work or solve your problem. Seriously, do not use this process, call IBM instead.
Here’s what I did to fully delete everything on the problem server:
root@server2 # stoprpdomain -f qab2b_db2ha
That last command will stop the domain that still remained in my case.
Next, I need to actually delete that domain. This command is dangerous and will delete your TSA domain:
root@server2 # rmrpdomain -f qab2b_db2ha
After that and the db2haicu -delete on the other node, I was able to actually run db2haicu and get nearly everything set up properly. The part that didn’t work was removing the virtual IP addresses. There were 4 databases on this HADR pair, and per IBM’s recommended configuration, each database had it’s own Virtual IP address. When I got to that part of db2haicu setup, it looked like this:
Select an administrative task by number from the list below: 1. Add or remove cluster nodes. 2. Add or remove a network interface. 3. Add or remove HADR databases. 4. Add or remove an IP address. 5. Move DB2 database partitions and HADR databases for scheduled maintenance. 6. Create a new quorum device for the domain. 7. Destroy the domain. 8. Exit. Enter your selection: 4 Do you want to add or remove IP addresses to or from the cluster? [1] 1. Add 2. Remove 1 Which HADR database do you want this IP address to be associated with? SAMPLE Enter the virtual IP address: 192.0.2.0 The IP address 192.0.2.0 is already in use. Enter an IP address that is not used anywhere on the network for your high availability setup deactivate the alias and try again. db2haicu restricts active IP aliases from being added to the network to avoid IP duplication and subseqt routing issues. Enter the virtual IP address: 192.0.2.0 The IP address 192.0.2.0 is already in use. Enter an IP address that is not used anywhere on the network for your high availability setup deactivate the alias and try again. db2haicu restricts active IP aliases from being added to the network to avoid IP duplication and subseqt routing issues. Enter the virtual IP address:
So here is the neat thing about that. Even when TSAMP fails spectacularly, in every case that I have seen, the VIP addresses continue to work on whatever node was the primary at the time of the failure. This is a really, really, really, really good thing and has saved many DBAs (including myself) many, many, many times.
However, it also means that a DBA has to know how to clear them up manually when we need to reuse them. The IP addresses are not defined in any permanent way, so chdev is not your friend. Nor is SMIT – at least in the very limited way I know how to use it for IP addresses. Here is what worked for me:
db2inst1@server2>ifconfig -a en2: flags=1e084863,c0inet 192.0.2.0 netmask 0xffffff00 broadcast 192.0.2.255 inet 198.51.100.0 netmask 0xffffff00 broadcast 198.51.100.255 tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0 lo0: flags=e08084b,c0 inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255 inet6 ::1%1/0 tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1
This command allows you to verify that this VIP (example of 192.0.2.0) is indeed defined at this level. In this example, the server address was actually 198.51.100.0. As with all the examples on db2commerce.com, the addresses have been changed in a consistent way to protect the innocent.
In order to remove the alias created, I used commands like this:
ifconfig en2 192.0.2.0 netmask 255.255.255.0 delete
This removed the IP address at that level, and I was then able to finish running db2haicu to re-define all of the Virtual IP addresses. Note that there was a brief outage to new, incoming connections for the period of time between when I ran that ifconfig delete command to remove the ip and when I ran db2haicu to add the VIP.
Summary
I am not recommending that anyone try any of the steps in this post. But with these steps, I was able to resolve my issue in an hour or two, instead of the days that it would have taken me with support to get to the perfect solution.
Great information! Thanks for sharing.
Hi Ember, just wan’t to add my experience with TSAMP config delete, maybe it helps somebody else. In my case I had two db2 servers set up with TSAMP, and because of some misunderstandings servers were reconfigured to new IPs (different range), while TSAMP was active, and servers were restarted. It was not productive system, so no big loss, but the problem was to get the TSAMP back on. It was not possible to start the standby database or to delete tsamp config with db2haicu. I stopped and removed nodes and domain using rsct commands. On the new db2haicu config attempt I was getting the error that second server is not reachable (probably had old ip somewhere stuck) so I needed additionally to clear cluster config (/usr/sbin/rsct/install/bin/recfgct ) on the nodes and redo preprpnode command. After that db2haicu config worked ok. Also described here http://www-01.ibm.com/support/docview.wss?uid=swg21385581 (my version is 10.1.5). Thank you for the effort and keep up the good work.
wow this is handy, ran into exact same issue. TSA sucks while configuring but working on troubleshooting it uncovers so many things I didn’t know. It was good learning but learnt with the pain 🙂
we set up a HADR/TSA for one of our production servers but we have an Issue described down below,
Environment background
Applications : SAP
Database : DB2 11.1 in HADR with TSA MP with virtual IP setup
OS : SUSE LINUX 12
The above set up in on Microsoft Azure cloud .
.31 – is the primary db server
.32 – is the standby db server
Problem description
Failover issue seen as below when .32 becomes the primary.
Load balancer part of the issue
We are currently using a Microsoft Azure load balancer which will do heartbeat checks for the db2 hadr port 55001 ( HADR_LOCAL_SVC ) on .31 server and when the db2 failover happens
( .32 becoming the primary ) , the expectation is the port 55001 should stop listening on .31 and it should start listening at .32 server for the load balancer to failover the sap application connections.
This is not working as per the expectation stated.
Observation made at the database for above failover test
When there is a takeover at .32 , sap application connections are not failing over to .32 database from .31 system. We notice in our db2diag.logs the application connections are trying to connect to
.31 server even after the db failover was successful.
Expectation
We are looking for transparent failover from primary to standby server for the sap applications.
So, as you observe the VIP configured which is Load balancer IP still points to the old Primary Db though the failover happens successfully and Standby becomes primary. Please suggest if any thing needs to be done.
Are you even using TSA? The problem is clearly with your load balancer – I’m not familiar with Azure’s load balancers or how to configure them with Db2 properly. Usually there is a lot of scripting involved with using something other than TSAMP for the heartbeat/failover.