RSCT APAR Affecting TSAMP

Posted by

Some of my clients, instead of engaging me for day-to-day support, engage me for expert assistance only when it all really hits the fan. This issue occurred for one of those clients, who had other support performing the HADR failovers while the Linux kernel was upgraded. The version of RedHat did not change, but the kernel did.

The kernel upgrade was entirely needed. Two servers of the four in an HADR cluster were randomly crashing every month or two, and the kernel upgrade is supposed to fix this according to some very smart Linux people.

So I got the call at 1:30 in the morning. The client tells me they have been fully down for a full hour. It was described to me that things happened in this order:

  1. The kernel was patched on the principal standby (aux standbys were done last week)
  2. Takeover was done, and the database and apps were running just fine on the former principal standby
  3. The kernel was patched on the original primary
  4. Takeover was performed to move the database back to the original primary
  5. The database takeover was successful, but the virtual ip addresses entirely dissapppeared. The apps could not connect to the database, causing a full outage.

I would hope that they actually disabled TSAMP during kernel patching, but I’m not 100% sure.

HADR

When I logged in, the first thing I saw was that HADR was not in a correct state. That’s a prerequisite for TSAMP working, so I investigated the problem there, to find that the DIAGPATH had become 100% full on the principal standby. We cleared up some space there, and started HADR, and HADR was then fine.

TSAMP

Initial Investigation and Details

Investigating TSAMP showed nothing other than a domain. lssam returned nothing. lsrpdomain returned only a domain that could not be started:

# lsrpdomain
Name    OpState RSCTActiveVersion MixedVersions TSPort GSPort
HADR_DO Offline 3.2.1.2           No            12347  12348

I’ve often found that the fastest solution for TSAMP problems when things are already down is to delete everything and set it up all over again. So after I made sure that the primary support had all the details, that’s what I did here. First, I ran db2haicu -delete on both the primary and the principal standby. That did not remove the domain, so I foricbly removed the domain like this (use extreme caution with commands like these!):

# rmrpdomain -f HADR_DO
# lsrpdomain
Name    OpState         RSCTActiveVersion MixedVersions TSPort GSPort
HADR_DO Pending offline 3.2.1.2           No            12347  12348
# lsrpdomain

This took some patience. It was minutes from issuing the rmrpdomain command until the domain was gone. Running the lsrpdomain command told me when it was done. When the domain was really gone, it would return nothing. This also had to be done on both the primary and the principal standby.

When I went to re-configure TSAMP using db2haicu, it gave me this error trying to create the domain:

...
Create the domain now? [1]
1. Yes
2. No
1
Creating domain 'HADR_DO' in the cluster ...
Updating attribute failed. Refer to db2diag.log for the current member and the DB2 Information Center for details.

The Real Issue

It was at this point I realized that something bigger was wrong. The client at this point configured the VIPs at the server level on the primary server to get the database available to the applications again. I called IBM Support. I got a call back from an excellent service person who knew immediately what the issue was.

I had seen tweets about this issue earlier in the week, but I didn’t realize that a kernel update would trigger it. This is the issue: https://developer.ibm.com/answers/questions/406023/tsamp-issues-after-updating-to-red-hat-74.html

Messages like these in /var/log/messages indicate this issue may be the problem:

Mar 24 06:47:25 server1 hatsd[7952]: hadms: Loading watchdog softdog, timeout = 8000 ms.
Mar 24 06:47:25 server1 hatsd[7952]: hadms: remove_watchdogs(): Call to glob() returned with value 3
Mar 24 06:47:25 server1 hatsd[7952]: hadms: remove_watchdogs(): Call to glob() returned with value 3
Mar 24 06:47:25 server1 kernel: traps: hatsd[7952] general protection ip:xxxx sp:xxxx error:0 in libc-2.17.so[7ffff617d000+1b8000]
Mar 24 06:47:25 server1 hatsd[7952]: hadms: remove_watchdogs(): Call to glob() returned with value 3
Mar 24 06:47:25 server1 hatsd[7952]: hadms: remove_watchdogs(): Call to glob() returned with value 3
Mar 24 06:47:25 server1 cthags[8278]: (Recorded using libct_ffdc.a cv 2):::Error ID: 822....xmWhO/lcK.867B3....................:::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,PMClient.C,1.117,1483                    :::GS_TS_RETCODE_ER#012Connection failure between Group Services and Topology Services#012DIAGNOSTIC EXPLANATION#012hats subsystem died with hb_errno = 16, cthags will also exit.
Mar 24 06:47:25 server1 srcmstr: src_error=-9017, errno=0, module='srchevn.c'@line:'418', 0513-017 The cthats Subsystem ended abnormally. You must manually restart it.

The Fix

The fix was astonishingly easy to apply. I had to download the efix for apar IJ00283. This was a small file. After transfering it to the server and uncompressing/untaring it, as root, I just had to execute install.sh:

[root@server1 IJ00283.x86_64]# ./install.sh
Install completed

After this, I was able to easily re-establish all of my TSAMP settings using db2haicu. Had I been proactive, I would have had to take the domain down to install the fix, but it was done without an outage to Db2 itself.

Lessons Learned

The biggest lesson learned here is that even though you don’t need TSAMP in a non-production environment, you need a way to test changes like this with TSAMP. This client will be adding TSAMP to a load-test environment to have a way to test this sort of change in the future.

Lead Db2 Database Engineer and Service Delivery Manager , XTIVIA
Ember is always curious and thrives on change. Working in IT provides a lot of that change, but after 17 years developing a top-level expertise on Db2 for mid-range servers and more than 7 years blogging about it, Ember is hungry for new challenges and looks to expand her skill set to the Data Engineering role for Data Science. With in-depth SQL and RDBMS knowledge, Ember shares both posts about her core skill set and her journey into Data Science. Ember lives in Denver and work from home for XTIVIA, leading a team of Db2 DBAs.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.