Some of my clients, instead of engaging me for day-to-day support, engage me for expert assistance only when it all really hits the fan. This issue occurred for one of those clients, who had other support performing the HADR failovers while the Linux kernel was upgraded. The version of RedHat did not change, but the kernel did.
The kernel upgrade was entirely needed. Two servers of the four in an HADR cluster were randomly crashing every month or two, and the kernel upgrade is supposed to fix this according to some very smart Linux people.
So I got the call at 1:30 in the morning. The client tells me they have been fully down for a full hour. It was described to me that things happened in this order:
- The kernel was patched on the principal standby (aux standbys were done last week)
- Takeover was done, and the database and apps were running just fine on the former principal standby
- The kernel was patched on the original primary
- Takeover was performed to move the database back to the original primary
- The database takeover was successful, but the virtual ip addresses entirely dissapppeared. The apps could not connect to the database, causing a full outage.
I would hope that they actually disabled TSAMP during kernel patching, but I’m not 100% sure.
When I logged in, the first thing I saw was that HADR was not in a correct state. That’s a prerequisite for TSAMP working, so I investigated the problem there, to find that the DIAGPATH had become 100% full on the principal standby. We cleared up some space there, and started HADR, and HADR was then fine.
Initial Investigation and Details
Investigating TSAMP showed nothing other than a domain. lssam returned nothing. lsrpdomain returned only a domain that could not be started:
# lsrpdomain Name OpState RSCTActiveVersion MixedVersions TSPort GSPort HADR_DO Offline 220.127.116.11 No 12347 12348
I’ve often found that the fastest solution for TSAMP problems when things are already down is to delete everything and set it up all over again. So after I made sure that the primary support had all the details, that’s what I did here. First, I ran
db2haicu -delete on both the primary and the principal standby. That did not remove the domain, so I foricbly removed the domain like this (use extreme caution with commands like these!):
# rmrpdomain -f HADR_DO # lsrpdomain Name OpState RSCTActiveVersion MixedVersions TSPort GSPort HADR_DO Pending offline 18.104.22.168 No 12347 12348 # lsrpdomain
This took some patience. It was minutes from issuing the rmrpdomain command until the domain was gone. Running the lsrpdomain command told me when it was done. When the domain was really gone, it would return nothing. This also had to be done on both the primary and the principal standby.
When I went to re-configure TSAMP using db2haicu, it gave me this error trying to create the domain:
... Create the domain now?  1. Yes 2. No 1 Creating domain 'HADR_DO' in the cluster ... Updating attribute failed. Refer to db2diag.log for the current member and the DB2 Information Center for details.
The Real Issue
It was at this point I realized that something bigger was wrong. The client at this point configured the VIPs at the server level on the primary server to get the database available to the applications again. I called IBM Support. I got a call back from an excellent service person who knew immediately what the issue was.
I had seen tweets about this issue earlier in the week, but I didn’t realize that a kernel update would trigger it. This is the issue: https://developer.ibm.com/answers/questions/406023/tsamp-issues-after-updating-to-red-hat-74.html
Messages like these in /var/log/messages indicate this issue may be the problem:
Mar 24 06:47:25 server1 hatsd: hadms: Loading watchdog softdog, timeout = 8000 ms. Mar 24 06:47:25 server1 hatsd: hadms: remove_watchdogs(): Call to glob() returned with value 3 Mar 24 06:47:25 server1 hatsd: hadms: remove_watchdogs(): Call to glob() returned with value 3 Mar 24 06:47:25 server1 kernel: traps: hatsd general protection ip:xxxx sp:xxxx error:0 in libc-2.17.so[7ffff617d000+1b8000] Mar 24 06:47:25 server1 hatsd: hadms: remove_watchdogs(): Call to glob() returned with value 3 Mar 24 06:47:25 server1 hatsd: hadms: remove_watchdogs(): Call to glob() returned with value 3 Mar 24 06:47:25 server1 cthags: (Recorded using libct_ffdc.a cv 2):::Error ID: 822....xmWhO/lcK.867B3....................:::Reference ID: :::Template ID: 0:::Details File: :::Location: RSCT,PMClient.C,1.117,1483 :::GS_TS_RETCODE_ER#012Connection failure between Group Services and Topology Services#012DIAGNOSTIC EXPLANATION#012hats subsystem died with hb_errno = 16, cthags will also exit. Mar 24 06:47:25 server1 srcmstr: src_error=-9017, errno=0, module='srchevn.c'@line:'418', 0513-017 The cthats Subsystem ended abnormally. You must manually restart it.
The fix was astonishingly easy to apply. I had to download the efix for apar IJ00283. This was a small file. After transfering it to the server and uncompressing/untaring it, as root, I just had to execute install.sh:
[root@server1 IJ00283.x86_64]# ./install.sh Install completed
After this, I was able to easily re-establish all of my TSAMP settings using db2haicu. Had I been proactive, I would have had to take the domain down to install the fix, but it was done without an outage to Db2 itself.
The biggest lesson learned here is that even though you don’t need TSAMP in a non-production environment, you need a way to test changes like this with TSAMP. This client will be adding TSAMP to a load-test environment to have a way to test this sort of change in the future.