An Unfortunate Series of TSAMP Events

A story of fail and recover.

Problem Discovery and Description

Sometimes, I think that I subconsciously knew that something was wrong. I woke up before 5 AM and couldn’t get back to sleep for no real reason that I could figure out. I gave up on sleep around 5:15 and went to take a shower. On the way to the shower, I read work email and found this:

TSAMP_issue

At the same time, I received a note from the project manager indicating that the hosting provider had apparently rebooted all of our “PROD” and “UAT” database servers in the night in an attempt to uninstall HACMP.

This is a newer hosting provider for us, chosen by our client without much input from us. The databases in question are not yet supporting a “live” site – only being configured and developed on by developers from several companies. With tight timelines and global companies, they’re used most of the time.

We had discovered two weeks earlier that the hosting provider was not taking any system level backups because the systems were not yet considered live. This was a surprise to us, as we are used to hosting providers providing basic services like backup from the moment servers are provided. We had brought it up as a risk and a problem with the hosting provider.

Anyway, I went straight to my home office, and logged on to the server in question. The first issue I saw was that the many-line /etc/services file had been replaced by a simple two-line file that indeed did not list any database lines at all. That’s an easy fix on it’s own, and the SA and I quickly worked to get the required lines back in place.

Did I mention that this happened on Friday the 13th?

Once /etc/services was corrected, db2 still would not start, this time with:

db2start
06/16/2014 16:25:13     0   0   SQL1042C  An unexpected system error occurred.
SQL1032N  No start database manager command was issued.  SQLSTATE=57019

An error message that strikes fear into the heart of any DB2 DBA.

I found that I could get the db2 instance up on the primary server in each TSA/HADR cluster by doing the following. I don’t pretend to understand why, but doing parts of it in various combinations did not work.

  1. db2iupdt
  2. installSAM – which failed because it found TSA was already installed
  3. uninstallSAM – which failed because it thought TSA was still active
  4. installSAM – which failed because it found TSA was already installed
  5. db2iupdt

Once I had the db2 instance up, I could not use db2haicu -delete. My lssam on the primary at this point looked like this:
TSAMP_Issue2
This was ugly. But it actually wasn’t as scary as what I got on the standby. I couldn’t get the standby DB2 instance to start with any combination of commands, and it similarly would not let me install or uninstall TSAMP. This is what I got from an lssam there:
TSAMP_issue3
On the standby, any attempt at db2haicu failed because the DB2 instance was down.

Clearly, the hosting provider uninstalling HACMP had uninstalled some files that TSAMP uses, and had also altered /etc/services and wiped out most of the entries there. Because the hosting provider had not been taking system backups, there was no way to restore the system and apparently no rollback plan. Reportedly, the hosting provider was following a series of steps provided by IBM. To exacerbate the problem, the hosting provider performed these steps on 12 database servers in 6 HADR/TSAMP clusters at the same time.

Resolving the problem

After opening a PMR with DB2 support (and getting DB2 support to consult someone with TSA expertise), our system admin was able to get TSAMP to a point where I could uninstall it successfully. I uninstalled it and re-installed it. I was then still unable to start the DB2 instances. I got the same error:

db2start
06/16/2014 16:25:13     0   0   SQL1042C  An unexpected system error occurred.
SQL1032N  No start database manager command was issued.  SQLSTATE=57019

And in the db2diag.log, I found this:

2014-06-16-16.25.13.072800+000 E1823213A907         LEVEL: Error
PID     : 40960086             TID : 1              PROC : db2star2
INSTANCE: db2inst2             NODE : 000
HOSTNAME: redacted
EDUID   : 1
FUNCTION: DB2 UDB, high avail services, sqlhaGetObjectState2, probe:400
MESSAGE : ECF=0x90000552=-1879046830=ECF_SQLHA_OBJECT_DOES_NOT_EXIST
          Cluster object does not exist
DATA #1 : String, 35 bytes
Error during vendor call invocation
DATA #2 : unsigned integer, 4 bytes
29
DATA #3 : String, 28 bytes
db2_db2inst2_redacted02s_0-rs
DATA #4 : signed integer, 4 bytes
4
DATA #5 : unsigned integer, 4 bytes
1
DATA #6 : String, 0 bytes
Object not dumped: Address: 0x000000011018C324 Size: 0 Reason: Zero-length data
DATA #7 : unsigned integer, 8 bytes
1
DATA #8 : signed integer, 4 bytes
0
DATA #9 : String, 0 bytes
Object not dumped: Address: 0x000000011018B11C Size: 0 Reason: Zero-length data

2014-06-16-16.25.13.073839+000 E1824121A586         LEVEL: Error
PID     : 40960086             TID : 1              PROC : db2star2
INSTANCE: db2inst2             NODE : 000
HOSTNAME: redacted_02s
EDUID   : 1
FUNCTION: <0>, <0>, <0>, probe:1164
RETCODE : ECF=0x90000552=-1879046830=ECF_SQLHA_OBJECT_DOES_NOT_EXIST
          Cluster object does not exist
DATA #1 : String, 63 bytes
libsqlha: sqlhaGetObjectState() call error from wrapper library
DATA #2 : String, 0 bytes
Object not dumped: Address: 0x000000011018B11C Size: 0 Reason: Zero-length data
DATA #3 : signed integer, 4 bytes
0

2014-06-16-16.25.13.097492+000 E1824708A387         LEVEL: Error
PID     : 40960086             TID : 1              PROC : db2star2
INSTANCE: db2inst2             NODE : 000
HOSTNAME: redacted02s
EDUID   : 1
FUNCTION: DB2 UDB, high avail services, sqlhaSetStartPreconditions, probe:18246
RETCODE : ECF=0x90000557=-1879046825=ECF_SQLHA_CLUSTER_ERROR
          Error reported from Cluster

2014-06-16-16.25.13.097733+000 E1825096A465         LEVEL: Severe
PID     : 40960086             TID : 1              PROC : db2star2
INSTANCE: db2inst2             NODE : 000
HOSTNAME: redacted02s
EDUID   : 1
FUNCTION: DB2 UDB, base sys utilities, DB2StartMain, probe:5104
MESSAGE : ZRC=0x827300D4=-2106392364=HA_ZRC_CLUSTER_ERROR
          "Error reported from Cluster"
DATA #1 : String, 66 bytes
An error was encountered when interacting with the cluster manager

I could not get db2haicu -delete to work at any point. Apparently in 10.1/10.5 the CLUSTER_MGR DBM cfg parameter became informational and can only be set through db2haicu. This meant that in order to un-set it so I could get back to a point where I could reconfigure TSAMP, I had to drop and re-create every DB2 instance. There were 14 of them.

I made sure I had database backups before undertaking this. I then did the following:

  1. db2cfexp backup.db2cfexp backup
  2. db2 get dbm cfg |tee dbmcfg.out
  3. db2 list db directory |tee dbdir.out
  4. db2set -all |tee db2set.out
  5. db2 list node directory
  6. switched to root and did a db2idrop
  7. re-created the instance with a db2icrt
  8. db2cfimp backup.db2cfexp
  9. Set the parameters not covered by cfexp. In my case, this included:
    1. all DFT_MON parameters
    2. SVCENAME
    3. SYSMON group
  10. I then compared the dbm cfg and db2set from before and after to make sure everything was fine

The db2cfimp re-cataloged the database for me, meaning I did not have to restore it.

After I had all of the DB2 instances re-created and started, I was then able to fully re-do the TSAMP configuration on all 6 HADR/TSAMP clusters. This is where I was so grateful I had taken complete documentation when I originally set up the clusters. I had Word documents with all the info I needed to re-do each and every cluster.

At some point, DB2 support referred to the approach I was taking as a Sledgehammer approach. It may well have been, but having software partially uninstalled is a scary thing to me. How do I know I’m not missing some critical files somewhere? Also when I asked, DB2 support confirmed that CLUSTER_MGR is not configurable other than through db2haicu, and that dropping and recreating the db2 instances was my only option at that point. I call on IBM to make the CLUSTER_MGR parameter configurable again! I could have saved several hours of work by not having to do that part, at least.

I thought I would share this issue in the hopes that it helps someone else. I don’t claim that the actions above are the best – if you’re in a similar scenario, please consult IBM support to get what you need to recover. With luck, others are taking system level backups or have hosting providers that have a rollback plan for every system change, so no one will ever encounter this but me.

Ember Crooks
Ember Crooks

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, spent a year working with high-volume MySQL, and is now learning Snowflake. Ember shares both posts about her core skill sets and her journey learning Snowflake.

Ember lives in Denver and work from home

Articles: 544

12 Comments

  1. Hi Ember, great post and history.

    I did not know its not possible to set CLUSTER_MGR anymore except by db2haicu.

    I think what you could have done to completely erase TSA configuration its to list and remove the domain using root. Like this:

    [root@server1 ~]# lsrpdomain
    Name OpState RSCTActiveVersion MixedVersions TSPort GSPort
    dpa_domain Pending online 3.1.5.2 No 12347 12348
    [root@server1 ~]#

    [root@server1 ~]# rmrpdomain -f dpa_domain

    I’ve done this a few time when I couldn’t use db2haicu to delete TSA configuration. But the instance parameter CLUSTER_MGR is still there.

    I don’t know if its an IBM best practice but it worked.

      • Hello Ember,

        Great Post. Intially i tried your solution

        When i had same issue again, i carefully removed all the resource associated with database and was able to manage to bring instance up.

        Thanks

  2. Hi,

    Nice article, great blog.

    Not sure if this was an option at your DB2 level (or would have helped even if it was) but I had a similar issue and the following allowed me to start DB2 again:

    db2haicu -disable

    I notice after running this, CLUSTER_MGR gets unset.

  3. Hi

    Been following you for a while now, so nice that you share your experience and knowledge. Been playing around so much with TSAMP, forgot about the “-f” option and because “TSA” was the cluster manager …my instances fully setup with HADR would not start. I did not want to spend hours on an IBM sev 3 ticket as this is not production. I knew, I could trust your article and it was fab!!

    Have a question regarding db2 setup with hadr(11.1.3.3) and TSAMP setup (4.1)
    If I need to rebuild the db2 databases on standby only. (disk corruption). Is it better to db2haicu -delete and rmrpdomain -f and rebuild TSAMP after hadr setup? Seems the cleanest way so far.
    I tried dbhaicu -disable and stoprpdomain -f and got all this headache after db2haicu enable…

    Thanks Ember, You are my star!!

    • Thanks!

      If all you’re doing is re-initializing the standby, it is fine to just disable TSAMP and then re-enable after you have things back up. There are a few rare cases where you end up with things in odd states, and occasionally, I end up having to rebuild TSAMP. One of the big things for me is to remember to be patient and give TSAMP a while to get back into a good state (10 minutes is fine). Sometimes I’m not very patient with TSAMP. Usually, full rebuild is only required after something that affects TSAMP directly, not just something that impacts HADR. There is a resetrsrc command that is useful for manually forcing a reset after HADR is back in sync.

  4. I fell into this hole recently – DB2 instance had come down and wouldn’t restart as Cluster Manager = TSA; db2haicu wouldn’t work because instance not started.
    So, following your steps I was waiting for my AIX administrator to drop/recreate instance (I don’t have root access). While waiting I did some more Googling. I came across the db2gcf command. So I tried:
    db2gcf -u -i
    – it started?
    So, Cluster Manager was still set to TSA.
    Tried db2haicu -disable, got response:
    “The DB2 database high availability configuration parameter is set, but a valid db2instanceinfo.reg file could not be found.”
    Tried db2haicu -delete – worked ok:
    db2 get dbm cfg | grep -i cluster
    Cluster manager =

  5. I know it been a long time since this post. But today i hit the same error and your steps which did not work well for you, had actually helped me to get back to normal.

    Thanks Ember, always a big fan of your knowledge sharing and writing

    Cheers !!!

Leave a Reply to KishoreCancel Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.