Using TSA/db2haicu to automate failover Part 4: Dealing with Problems After Setup

Updated March 2019 with the command to get the first output

Most of what you’ll need to set up and test TSA using db2haicu is in my first few posts on the topic:
Using TSA/db2haicu to automate failover – Part 1: The Preparation
Using TSA/db2haicu to automate failover – Part 2: How it looks if it goes smoothly
Using TSA/db2haicu to Automate Failover Part 3: Testing, Ways Setup can go Wrong and What to do.

But there are is one ongoing issue that I’ve seen that I thought I would share. Most of the time, this issue relates to not shutting down the two database servers properly in the right order when they are both shut down. Most of my clients never, ever, ever shut down both servers at once anyway.

TSA States

From the time you first get db2haicu set up, you should be looking at the states of the TSA resources and resource groups, so you know what looks normal for your implementation. I’ve found minor differences in different implementations done in the same way – I don’t know if that’s tied to the Fix Pack or what, but there are a few different things that can be normal.

Viewing States Using TSA Commands as Root

One system I have, the following is what TSA states are what is normal:

$ lssam
Online IBM.ResourceGroup:db2_db2inst1_Prod-db1.adomain.com_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs
                '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs:Prod-db1
Online IBM.ResourceGroup:db2_db2inst1_Prod-db2.adomain.com_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs
                '- Online IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs:Prod-db2
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCP01-rg Nominal=Online
        |- Online IBM.Application:db2_db2inst1_db2inst1_WCP01-rs
                |- Online IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:Prod-db1
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:Prod-db2
        '- Online IBM.ServiceIP:db2ip_172_12_12_12-rs
                |- Online IBM.ServiceIP:db2ip_172_12_12_12-rs:Prod-db1
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:Prod-db2
Online IBM.Equivalency:db2_db2inst1_Prod-db1.adomain.com_0-rg_group-equ
        '- Online IBM.PeerNode:Prod-db1.adomain.com:Prod-db1
Online IBM.Equivalency:db2_db2inst1_Prod-db2.adomain.com_0-rg_group-equ
        '- Online IBM.PeerNode:Prod-db2.adomain.com:Prod-db2
Online IBM.Equivalency:db2_db2inst1_db2inst1_WCP01-rg_group-equ
        |- Online IBM.PeerNode:Prod-db1.adomain.com:Prod-db1
        '- Online IBM.PeerNode:Prod-db2.adomain.com:Prod-db2
Online IBM.Equivalency:db2_public_network_0
        |- Online IBM.NetworkInterface:bond0:Prod-db2
        '- Online IBM.NetworkInterface:bond0:Prod-db1

Now, if you’re viewing that on Linux, the “Online”s are all green, and the expected “Offline”s are all blue. If there’s a problem it will be in red.

This is my favorite way of looking at it. The red highlighting made it easy to understand if there was a problem, even when I understood very little about what it all meant.

Viewing States Using db2pd

You can also use db2pd to look at the states. I’m not as big of a fan of this method, but I think it’s a matter of preference. Here’s what the same system as above looks like using that method:

$ db2pd -d wc005p01 -ha

Option -ha is an instance scope option.  The database option has been ignored.
           DB2 HA Status
Instance Information:
Instance Name                  = db2inst1
Number Of Domains              = 1
Number Of RGs for instance     = 2

Domain Information:
Domain Name                    = prod_db2ha
Cluster Version                = 3.1.0.3
Cluster State                  = Online
Number of nodes                = 2

Node Information:
Node Name                     State
---------------------         -------------------
Prod-db1.adomain.com         Online
Prod-db2.adomain.com          Online

Resource Group Information:
Resource Group Name            = db2_db2inst1_db2inst1_WCP01-rg
Resource Group LockState       = Unlocked
Resource Group OpState         = Online
Resource Group Nominal OpState = Online
Number of Group Resources      = 2
Number of Allowed Nodes        = 2
   Allowed Nodes
   -------------
   Prod-db1.adomain.com
   Prod-db2.adomain.com
Member Resource Information:
   Resource Name                  = db2_db2inst1_db2inst1_WCP01-rs
   Resource State                 = Online
   Resource Type                  = HADR
   HADR Primary Instance          = db2inst1
   HADR Secondary Instance        = db2inst1
   HADR DB Name                   = WCP01
   HADR Primary Node              = Prod-db1.adomain.com
   HADR Secondary Node            = Prod-db2.adomain.com

   Resource Name                  = db2ip_172_12_12_12-rs
   Resource State                 = Online
   Resource Type                  = IP

Resource Group Name            = db2_db2inst1_Prod-db1.adomain.com_0-rg
Resource Group LockState       = Unlocked
Resource Group OpState         = Online
Resource Group Nominal OpState = Online
Number of Group Resources      = 1
Number of Allowed Nodes        = 1
   Allowed Nodes
   -------------
   Prod-db1.adomain.com
Member Resource Information:
   Resource Name                  = db2_db2inst1_Prod-db1.adomain.com_0-rs
   Resource State                 = Online
   Resource Type                  = DB2 Partition
   DB2 Partition Number           = 0
   Number of Allowed Nodes        = 1
      Allowed Nodes
      -------------
      Prod-db1.adomain.com

Network Information:
Network Name                  Number of Adapters
-----------------------       ------------------
db2_public_network_0          2

   Node Name                     Adapter Name
   -----------------------       ------------------
   Prod-db2                      bond0
   Prod-db1                      bond0

Quorum Information:
Quorum Name                                  Quorum State
------------------------------------         --------------------
Operator                                     Offline
db2_Quorum_Network_172_10_10_10:11_36_34     Online
Fail                                         Offline

I guess I can see how this method might be more understandable. But it doesn’t highlight problems in red!

It also has the advantage of being something you can execute as the db2 instance owner rather than as root.

Changing States

So, what do you do if things are highlighted in red?

Well, the first course of action is to check into HADR. First make sure that neither database is waiting on the other to start. Verify that HADR shows as “Connected” in “Peer” status with little or no log gap, using db2 -d -hadr:

$ db2pd -d wcp01 -hadr

Database Partition 0 -- Database WCP01 -- Active -- Up 71 days 16:06:16 -- Date 01/29/2013 20:33:23

HADR Information:
Role    State                SyncMode HeartBeatsMissed   LogGapRunAvg (bytes)
Primary Peer                 Nearsync 0                  1238

ConnectStatus ConnectTime                           Timeout
Connected     Mon Nov 19 04:27:21 2012 (1353320841) 120

PeerWindowEnd                         PeerWindow
Tue Jan 29 20:37:59 2013 (1359513479) 300

LocalHost                                LocalService
Prod-db1.adomain.com                     18819

RemoteHost                               RemoteService      RemoteInstance
Prod-db2.adomain.com                     18820              db2inst1

PrimaryFile  PrimaryPg  PrimaryLSN
S0009993.LOG 9847       0x00000081FF427CE6

StandByFile  StandByPg  StandByLSN
S0009993.LOG 9846       0x00000081FF426FF3

If HADR is working properly, then you may want to try to disable and re-enable db2haicu.

Finally, if your situation matches the one below, you can try (at your own risk) the following procedure.

“Pending Online”

This is an issue that pops up sometimes with a running setup. If you ever have to down both servers, please follow the steps in section 7 of this document: http://download.boulder.ibm.com/ibmdl/pub/software/dw/data/dm-0908hadrdb2haicu/HADR_db2haicu.pdf. If you don’t you’re likely to get TSA in an inconsistent state and mess with it for a while. I’m going to share the steps that I use to get TSA out of this pending online state – but please note, these can be extremely dangerous, and if you don’t understand what you’re doing, you probably don’t want to use them – contact IBM to see if these steps work for you or not. Use at your own risk. I got these steps from a colleague who got them from support, but later support told him they might be dangerous.

You’ll need to run these as root. Even if your instance owner can run lssam, you still need root for the rest of these commands.

After you have verified that HADR is properly running, look at the states of the resources to ensure that your problem matches the one I am describing:

> su - root
# lssam
Pending online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Nominal=Online
        |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs Control=SuspendedPropagated
                |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02
        '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs Control=SuspendedPropagated
                |- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02
Online IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01
Online IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs Control=SuspendedPropagated
                '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02

After you have confirmed that this matches your issue, find which is the master in the resource group:

# lssamctrl –V   
Starting to list SAM Control information.
lssamctrl: Executed on Fri Apr 22 12:45:08 2011 at "dbserver01", master node "dbserver01".
Displaying SAM Control information:
SAMControl:
        TimeOut                = 60
        RetryCount             = 3
        Automation             = Auto
        ExcludedNodes          = {}
        ResourceRestartTimeOut = 5
        ActiveVersion          = [3.1.0.1,Fri Mar 11 16:10:54 EST 2011]
        EnablePublisher        = Disabled
        TraceLevel             = 31
        ActivePolicy           = []
        CleanupList            = {}
        PublisherList          = {}
Completed Listing SAM Control information.

That told us: master node “dbserver01”
Now, on the master node, get the process id for the recovery manager:

# ps -ef |grep -i recoveryrm  
    root  7929864  3866752   0   Apr 07      -  0:36 /usr/sbin/rsct/bin/IBM.RecoveryRMd

Now kill that process id:

# kill 7929864

Next, confirm that the recovery manager starts a new process:

# ps -ef |grep -i recoveryrm 
    root  7929866  3866752   1 12:54:17      -  0:00 /usr/sbin/rsct/bin/IBM.RecoveryRMd 

Validate that the “In Config State” is TRUE:

# lssrc -ls IBM.RecoveryRM |grep "In Config State"
   In Config State      : TRUE

Now see the changes in status. The Pending Status is now Offline, the Nominal changed to Offline, and the the Control=SuspendedPropegated is removed:

# lssam  
Offline IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Nominal=Offline
        |- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs
                |- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02
        '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs
                |- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02
Offline IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Offline
        '- Offline IBM.Application:db2_db2inst1_dbserver01_0-rs
                '- Offline IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01
Offline IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Offline
        '- Offline IBM.Application:db2_db2inst1_dbserver02_0-rs
                '- Offline IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02

Now issue commands to properly set the Resource Groups – first set the Resource Group online for the Master server, and then set it online for the Standby server:

# chrg -o online db2_db2inst1_dbserver01_0-rg 
# chrg -o online db2_db2inst1_dbserver02_0-rg 

Check the status again, and note the differences – the Resource groups at the bottom now show as online:

# lssam
Offline IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Nominal=Offline
        |- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs
                |- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02
        '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs
                |- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02
Online IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01
Online IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02

Now, set the Resource Group online for the Database:

# chrg -o online db2_db2inst1_db2inst1_WCQ01-rg

You may note a Lock state while the Resource Group switches to ONLINE:

# lssam  
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Request=Lock Nominal=Online
        |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs Control=SuspendedPropagated
                |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02
        '- Online IBM.ServiceIP:db2ip_172_12_12_12-rs Control=SuspendedPropagated
                |- Online IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02
Online IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01
Online IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02

After a bit, everything should show as normal again:

# lssam 
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Nominal=Online
        |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs
                |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02
        '- Online IBM.ServiceIP:db2ip_172_12_12_12-rs
                |- Online IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02
Online IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01
Online IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs
                '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02

What TSA Looks Like if HADR is Simply Down

Always make sure you get HADR up before digging into TSA states. It looks similar (but slightly different) if HADR is just down. Notice the “Request=Lock” that’s in there – that’s different than the issue above.

Online IBM.ResourceGroup:db2_db2inst1_Prod-db1.adomain.com_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs
                '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs:dbserver01
Online IBM.ResourceGroup:db2_db2inst1_Prod-db2.adomain.com_0-rg Nominal=Online
        '- Online IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs
                '- Online IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs:dbserver02
Pending online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCP01-rg Request=Lock Nominal=Online
        |- Offline IBM.Application:db2_db2inst1_db2inst1_WCP01-rs Control=SuspendedPropagated
                |- Offline IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:dbserver01
                '- Offline IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:dbserver02
        '- Online IBM.ServiceIP:db2ip_172_12_12_12-rs Control=SuspendedPropagated
                |- Online IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01
                '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02
Online IBM.Equivalency:db2_db2inst1_Prod-db1.adomain.com_0-rg_group-equ
        '- Online IBM.PeerNode:Prod-db1.adomain.com:dbserver01
Online IBM.Equivalency:db2_db2inst1_Prod-db2.adomain.com_0-rg_group-equ
        '- Online IBM.PeerNode:Prod-db2.adomain.com:dbserver02
Online IBM.Equivalency:db2_db2inst1_db2inst1_WCP01-rg_group-equ
        |- Online IBM.PeerNode:Prod-db1.adomain.com:dbserver01
        '- Online IBM.PeerNode:Prod-db2.adomain.com:dbserver02
Online IBM.Equivalency:db2_public_network_0
        |- Online IBM.NetworkInterface:bond0:dbserver02
        '- Online IBM.NetworkInterface:bond0:dbserver01

I’d love to hear problems that others have encountered and how you’ve resolved them to help others! Leave a comment with your situation and solution.

Other Posts In This Series

This series consists of four posts:
Using TSA/db2haicu to automate failover – Part 1: The Preparation
Using TSA/db2haicu to automate failover – Part 2: How it looks if it goes smoothly
Using TSA/db2haicu to Automate Failover Part 3: Testing, Ways Setup can go Wrong and What to do.
“Using TSA/db2haicu to automate failover Part 4: Dealing with Problems After Setup

Search this blog on “TSA” for other posts on TSA issues and tips.

Ember Crooks
Ember Crooks

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, spent a year working with high-volume MySQL, and is now learning Snowflake. Ember shares both posts about her core skill sets and her journey learning Snowflake.

Ember lives in Denver and work from home

Articles: 549

16 Comments

  1. Situation:
    a server which hold standby database down, then after it was up,
    you can see Control=SuspendedPropagated
    no lock on resource group .
    What should I do to remove this flag?

    Thank you.
    DB21085I Instance “db2pb1” uses “64” bits and DB2 code release “SQL09075” with
    level identifier “08060107”.
    Informational tokens are “DB2 v9.7.0.5”, “special_28492”, “IP23285_28492”, and
    Fix Pack “5”.
    Product is installed at “/db2/db2pb1/db2_software”.
    arlpb1ci:db2pb1 7> oslevel -s
    7100-01-05-1228

    Online IBM.ResourceGroup:db2_db2pb1_db2pb1_PB1-rg Nominal=Online
    |- Online IBM.Application:db2_db2pb1_db2pb1_PB1-rs Control=SuspendedPropagated
    |- Online IBM.Application:db2_db2pb1_db2pb1_PB1-rs:arlpsap11
    ‘- Offline IBM.Application:db2_db2pb1_db2pb1_PB1-rs:arlpsap12
    |- Online IBM.ServiceIP:db2ip_10_180_0_111-rs Control=SuspendedPropagated
    |- Online IBM.ServiceIP:db2ip_10_180_0_111-rs:arlpsap11
    ‘- Offline IBM.ServiceIP:db2ip_10_180_0_111-rs:arlpsap12
    ‘- Online IBM.ServiceIP:db2ip_10_194_6_209-rs Control=SuspendedPropagated
    |- Online IBM.ServiceIP:db2ip_10_194_6_209-rs:arlpsap11
    ‘- Offline IBM.ServiceIP:db2ip_10_194_6_209-rs:arlpsap12
    Resource Group Information:
    Resource Group Name = db2_db2pb1_db2pb1_PB1-rg
    Resource Group LockState = Unlocked
    Resource Group OpState = Online
    Resource Group Nominal OpState = Online
    Number of Group Resources = 3
    Number of Allowed Nodes = 2

    • The only series of steps I have to try are the ones in this blog entry. Did you resolve this? Sorry for the late response, I was taking a vacation – camping with the family.

  2. […] As stated before, I wish there was an option on db2haicu that basically said “I’ve fixed the original problem, reset the TSA states”. This one is a bit easier than the problem and reset I describe in Using TSA/db2haicu to automate failover Part 4: Dealing with Problems After Setup […]

  3. On the Pending Online issue, my problems were as follows:

    Softdog issues:
    I viewed the lssam output and can see that the instance on db2prod02 is showing “Pending online”. The reason for this is a 3rd party watchdog module that is preventing IBM’s cluster software from loading its own (there can only be one watchdog module active for a given server). The syslog show the problem :

    Feb 24 14:21:51 db2prod02 hatsd[19978]: hadms: Loading watchdog softdog, timeout = 8000 ms.
    Feb 24 14:21:51 db2prod02 hatsd[19978]: hadms: Found loaded iTCO_vendor_support with count 1
    Feb 24 14:21:51 db2prod02 hatsd[19978]: hadms: iTCO_vendor_support has a use count of 1 and cannot be unloaded

    The “iTCO_vendor_support” module needs to be disabled (preferably uninstalled). You should check db2prod01 as well so there is no unexpected issue in the future. This is the advise I asked Adam to pass onto you last Friday. Looks like you’re still working on this, with your SysAdmin I’m assuming.

    Once the instance is able to reach an “Online” state, db2haicu will be able to add HADR databases again.

    and then just permissions issues getting db2haicu to run:

    I had to do the following to get it to work as well as to do a hadr takeover before it would let me add secondary and tertiary db’s into the cluster. On the primary, it would refuse to add databases into the cluster stating a problem with error:

    2014-02-27-15.11.02.709792-420 E51459483E655 LEVEL: Error
    PID : 28178 TID : 139851322767136PROC : db2haicu
    INSTANCE: atlinst NODE : 000
    FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure, sqlhaUICreateHADR, p
    robe:900
    RETCODE : ECF=0x9000056F=-1879046801=ECF_SQLHA_HADR_VALIDATION_FAILED
    The HADR DB failed validation before being added to the cluster
    MESSAGE : Please verify that HADR_REMOTE_INST and HADR_REMOTE_HOST are correct
    and in the exact format and case as the Standby instance name and
    hostname.
    DATA #1 : String, 7 bytes
    atlinst
    DATA #2 : String, 9 bytes
    db2prod02

    On new instances, I would get the following technote issue regarding db2havend and the library file:

    http://www-01.ibm.com/support/docview.wss?uid=swg21649212

    Also had issue on CT_MANAGEMENT_SCOPE:

    http://www-01.ibm.com/support/docview.wss?uid=swg1IC64785
    db2set DB2_DIRECT_IO=false
    export CT_MANAGEMENT_SCOPE=2

    But my main hurdle I spent all of last Fri/Sat night on was:
    — change setsuid permissions on db2havend(s) and lib32
    –http://www-01.ibm.com/support/docview.wss?uid=swg21649212

    MUST BE:
    -r-sr-xr-x 1 root db2inst1 4642211 Apr 3 18:17 db2havend
    -r-sr-xr-x 1 root db2inst1 3990657 Apr 3 18:17 db2havend32

    lrwxrwxrwx 1 root root 14 Apr 11 13:10 libdb2tsa.so -> libdb2tsa.so.1
    -r-xr-xr-x 1 bin bin 152529 Mar 19 01:32 libdb2tsa.so.1

    check by using
    ls -l | grep db2have

    FIX by using:

    chmod 555 on libdb2tsa.so.1 in dir sqllib\lib64
    chmod 4555 on db2havend and db2havend64 in sqllib\adm

    Thank you as your post did help me… Not same issue but it did good to know I wasn’t alone … Thank you Ember

  4. Hi Ember,

    Can you please let me know what can be done in below situation.

    Failed offline IBM.ResourceGroup:db2_tdbin02_tdbin02_XXX-rg Nominal=Online
    |- Failed offline IBM.Application:db2_tdbin02_tdbin02_XXX-rs
    |- Failed offline IBM.Application:db2_tdbin02_tdbin02_XXX-rs:IDOCTOHADR01
    ‘- Failed offline IBM.Application:db2_tdbin02_tdbin02_XXX-rs:IDOCTOHADR02
    ‘- Offline IBM.ServiceIP:db2ip_172_20_62_108-rs
    |- Offline IBM.ServiceIP:db2ip_172_20_62_108-rs:IDOCTOHADR01
    ‘- Offline IBM.ServiceIP:db2ip_172_20_62_108-rs:IDOCTOHADR02

    When I’m trying to switvh over from server 1 to server 2 some of the db’s goes into Failed Offline mode. There are 14 DB’s in one instance.

    • Does only one database go into failed offline or all 14? Do you have all 14 fully configured in TSAMP? How are you doing the failover – through TAKEOVER command or db2haicu?

      Multiple databases on one instance can be problematic with TSAMP – especially when using the VIP as you are, as you have to ensure that all databases fail over at the same time or you have to define different virtual IP addresses for each database.

      • Hi Ember,

        I’m doing failover by using db2haicu command..
        all the 14 DB’s are configured in TSAMP with different VIP … Out of 14 sometimes 3 or 4 Db’s goes in Failed Offline mode.

    • Maximum time should be hadr_peer_window plus hadr_timeout. The actual failover, when initiated depends on volume, but us frequently less than 30 seconds.

  5. Hi Ember,
    I have gone through your article and it is really very descriptive and easy to understand. However, I am recently facing one strange issue and I am unable to figure it out what is going wrong in this case. If you can give input on this it will be very helpful.
    Recently one our server which hosts the PRIMARY database of HADR server went down. But it did not automatically failed over to STANDBY. I had to manually do a TAKEOVER. Once, the PRIMARY came up I switched back to original setup.
    To find the cause of not having automatic failover worked, I issued the lssam command first. I am seeing this unusual output as below. The HADR db status shows as Pending Online and Unknown. Googling it out did not server much purpose, however one link (http://www-01.ibm.com/support/docview.wss?uid=swg21961711) I found where it suggests that TSAMP is not able to monitor the db2 HADR status. I tried the to run the db2pd -hadr command from root but it works perfectly fine.
    Can you please suggest what can be done to diagnose further?

    Pending online IBM.ResourceGroup:db2_dbins371_dbins371_DSIMPR-rg Nominal=Online
    |- Unknown IBM.Application:db2_dbins371_dbins371_DSIMPR-rs
    |- Unknown IBM.Application:db2_dbins371_dbins371_DSIMPR-rs:server1
    ‘- Unknown IBM.Application:db2_dbins371_dbins371_DSIMPR-rs:server2

    • TSAMP state problems can be difficult. You can try running db2haicu and see if it just needs to be enabled after an extended outage. There are some suggestions on other approaches in my blog articles, but they should be used at your own risk.

  6. Your post about HADR – TSA was very helpful for me, and I would like to make some questions.

    I have an HADR environment with TSA db2 v10.5, actually it works well. And I have the intention to add an auxiliary standby, and add it to the Cluster TSA. I have read about it and that the cluster TSA does not support a second standby to make the switch role, but in my test environment, I have created a cluster with 3 nodes (Primary, Principal standby and auxiliary standby

    Online IBM.ResourceGroup:db2_db2inst1_db2inst1_TESTDB-rg Nominal=Online
    |- Online IBM.Application:db2_db2inst1_db2inst1_TESTDB-rs
    |- Online IBM.Application:db2_db2inst1_db2inst1_TESTDB-rs:primary1
    ‘- Offline IBM.Application:db2_db2inst1_db2inst1_TESTDB-rs:standby1
    ‘- Online IBM.ServiceIP:db2ip_10_120_202_58-rs
    |- Online IBM.ServiceIP:db2ip_10_120_202_58-rs:primary1
    ‘- Offline IBM.ServiceIP:db2ip_10_120_202_58-rs:standby1
    Online IBM.ResourceGroup:db2_db2inst1_db2inst1_QADB-rg Nominal=Online
    |- Online IBM.Application:db2_db2inst1_db2inst1_QADB-rs
    |- Online IBM.Application:db2_db2inst1_db2inst1_QADB-rs:primary1
    ‘- Offline IBM.Application:db2_db2inst1_db2inst1_QADB-rs:standby1
    ‘- Online IBM.ServiceIP:db2ip_10_120_202_59-rs
    |- Online IBM.ServiceIP:db2ip_10_120_202_59-rs:primary1
    ‘- Offline IBM.ServiceIP:db2ip_10_120_202_59-rs:standby1
    Online IBM.ResourceGroup:db2_db2inst1_primary1_0-rg Nominal=Online
    ‘- Online IBM.Application:db2_db2inst1_primary1_0-rs
    ‘- Online IBM.Application:db2_db2inst1_primary1_0-rs:primary1
    Online IBM.ResourceGroup:db2_db2inst1_standby1_0-rg Nominal=Online
    ‘- Online IBM.Application:db2_db2inst1_standby1_0-rs
    ‘- Online IBM.Application:db2_db2inst1_standby1_0-rs:standby1
    Online IBM.ResourceGroup:db2_db2inst1_standby2_0-rg Nominal=Online
    ‘- Online IBM.Application:db2_db2inst1_standby2_0-rs
    ‘- Online IBM.Application:db2_db2inst1_standby2_0-rs:standby2
    Online IBM.Equivalency:db2_db2inst1_db2inst1_TESTDB-rg_group-equ
    |- Online IBM.PeerNode:primary1:primary1
    ‘- Online IBM.PeerNode:standby1:standby1
    Online IBM.Equivalency:db2_db2inst1_db2inst1_QADB-rg_group-equ
    |- Online IBM.PeerNode:primary1:primary1
    ‘- Online IBM.PeerNode:standby1:standby1
    Online IBM.Equivalency:db2_db2inst1_primary1_0-rg_group-equ
    ‘- Online IBM.PeerNode:primary1:primary1
    Online IBM.Equivalency:db2_db2inst1_standby1_0-rg_group-equ
    ‘- Online IBM.PeerNode:standby1:standby1
    Online IBM.Equivalency:db2_db2inst1_standby2_0-rg_group-equ
    ‘- Online IBM.PeerNode:standby2:standby2
    Online IBM.Equivalency:db2_public_network_0
    |- Online IBM.NetworkInterface:eth1:standby1
    |- Online IBM.NetworkInterface:eth1:primary1
    ‘- Online IBM.NetworkInterface:eth1:standby2
    Online IBM.Equivalency:db2_public_network_1
    |- Online IBM.NetworkInterface:eth0:standby1
    |- Online IBM.NetworkInterface:eth0:primary1
    ‘- Online IBM.NetworkInterface:eth0:standby2
    [db2inst1@primary1 ~]$ db2pd -db deltas -hadr

    But something wrong happen when I swith the roles from de Primary to the auxiliary standby. Manually from de Auxiliary Stanby – “db2 takaover hadr on db testdb” The command executes successfully,

    HADR_ROLE = PRIMARY
    REPLAY_TYPE = PHYSICAL
    HADR_SYNCMODE = NEARSYNC
    STANDBY_ID = 1
    LOG_STREAM_ID = 0
    HADR_STATE = PEER
    HADR_FLAGS =
    PRIMARY_MEMBER_HOST = standby2
    PRIMARY_INSTANCE = db2inst1
    PRIMARY_MEMBER = 0
    STANDBY_MEMBER_HOST = primary1
    STANDBY_INSTANCE = db2inst1
    STANDBY_MEMBER = 0
    HADR_CONNECT_STATUS = CONNECTED
    HADR_CONNECT_STATUS_TIME = 10/03/2017 07:49:06.422433 (1507042146)
    HEARTBEAT_INTERVAL(seconds) = 30
    HEARTBEAT_MISSED = 0
    HEARTBEAT_EXPECTED = 83
    HADR_TIMEOUT(seconds) = 300
    TIME_SINCE_LAST_RECV(seconds) = 0
    PEER_WAIT_LIMIT(seconds) = 0
    LOG_HADR_WAIT_CUR(seconds) = 0.000
    LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000000
    LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.000
    LOG_HADR_WAIT_COUNT = 0
    SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 16384
    SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 87380
    PRIMARY_LOG_FILE,PAGE,POS = S0000002.LOG, 86, 49264705
    STANDBY_LOG_FILE,PAGE,POS = S0000002.LOG, 86, 49264705
    HADR_LOG_GAP(bytes) = 0
    STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000002.LOG, 86, 49264705
    STANDBY_RECV_REPLAY_GAP(bytes) = 0
    PRIMARY_LOG_TIME = 10/03/2017 08:18:32.000000 (1507043912)
    STANDBY_LOG_TIME = 10/03/2017 08:18:32.000000 (1507043912)
    STANDBY_REPLAY_LOG_TIME = 10/03/2017 08:18:32.000000 (1507043912)
    STANDBY_RECV_BUF_SIZE(pages) = 512
    STANDBY_RECV_BUF_PERCENT = 0
    STANDBY_SPOOL_LIMIT(pages) = 13000
    STANDBY_SPOOL_PERCENT = 0
    STANDBY_ERROR_TIME = NULL
    PEER_WINDOW(seconds) = 300
    PEER_WINDOW_END = 10/03/2017 08:26:24.000000 (1507044384)
    TAKEOVER_APP_REMAINING_PRIMARY = 0
    READS_ON_STANDBY_ENABLED = N

    And sudenly It goes down

    db2pd -db testdb -hadr

    Database TESTDB not activated on database member 0 or this database name cannot be found in the local database directory.

    And the Primary automatically took over control of the database. And I dont know why.

    I would like to know what is the correct way to make the stanby auxiliary the new primary.

    • Adding a third node to the TSAMP domain is not an officially supported solution from IBM. It is also not one that I have ever attempted to implement. One of the reasons is that often the auxiliary standby(s) are used for DR purposes and not HA. Often they are in a geographically separate location with a similar set of application servers that would be activated if they were ever used. Usually communication between the app servers and the database servers if they were across a significant distance would not achieve reliable or acceptable performance for the end-user. I have a number of clients using auxiliary standbys, but never integrating them into the TSAMP domain.

      I suspect you have some sort of problem in the TSAMP setup that is causing TSAMP to issue commands to re-establish the primary. Clustering scripts can get complex if you’ve ever tried to build them yourself – I suspect there is additional scripting you would need to do for the cluster to make this work.

  7. Hello,
    I dint see the lock on the resources but still pending online
    Online IBM.ResourceGroup:db2_db2inst1_s00114_0-rg Nominal=Online
    ‘- Online IBM.Application:db2_db2inst1_s00114_0-rs
    ‘- Online IBM.Application:db2_db2inst1_s00114_0-rs:s00114
    Pending online IBM.ResourceGroup:db2_db2inst1_s00115_0-rg Nominal=Online
    ‘- Pending online IBM.Application:db2_db2inst1_s00115_0-rs
    ‘- Pending online IBM.Application:db2_db2inst1_s00115_0-rs:s00115

    Can some one help on this please.

    • Hello All,

      I solved this issue, by running the below commands
      rmmod iTCO_wdt
      rmmod iTCO_vendor_support

      Thanks
      Harshavardhan L

Leave a Reply to Ember CrooksCancel Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.