Updated March 2019 with the command to get the first output
Most of what you’ll need to set up and test TSA using db2haicu is in my first few posts on the topic:
Using TSA/db2haicu to automate failover – Part 1: The Preparation
Using TSA/db2haicu to automate failover – Part 2: How it looks if it goes smoothly
Using TSA/db2haicu to Automate Failover Part 3: Testing, Ways Setup can go Wrong and What to do.
But there are is one ongoing issue that I’ve seen that I thought I would share. Most of the time, this issue relates to not shutting down the two database servers properly in the right order when they are both shut down. Most of my clients never, ever, ever shut down both servers at once anyway.
TSA States
From the time you first get db2haicu set up, you should be looking at the states of the TSA resources and resource groups, so you know what looks normal for your implementation. I’ve found minor differences in different implementations done in the same way – I don’t know if that’s tied to the Fix Pack or what, but there are a few different things that can be normal.
Viewing States Using TSA Commands as Root
One system I have, the following is what TSA states are what is normal:
$ lssam Online IBM.ResourceGroup:db2_db2inst1_Prod-db1.adomain.com_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs:Prod-db1 Online IBM.ResourceGroup:db2_db2inst1_Prod-db2.adomain.com_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs '- Online IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs:Prod-db2 Online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCP01-rg Nominal=Online |- Online IBM.Application:db2_db2inst1_db2inst1_WCP01-rs |- Online IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:Prod-db1 '- Offline IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:Prod-db2 '- Online IBM.ServiceIP:db2ip_172_12_12_12-rs |- Online IBM.ServiceIP:db2ip_172_12_12_12-rs:Prod-db1 '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:Prod-db2 Online IBM.Equivalency:db2_db2inst1_Prod-db1.adomain.com_0-rg_group-equ '- Online IBM.PeerNode:Prod-db1.adomain.com:Prod-db1 Online IBM.Equivalency:db2_db2inst1_Prod-db2.adomain.com_0-rg_group-equ '- Online IBM.PeerNode:Prod-db2.adomain.com:Prod-db2 Online IBM.Equivalency:db2_db2inst1_db2inst1_WCP01-rg_group-equ |- Online IBM.PeerNode:Prod-db1.adomain.com:Prod-db1 '- Online IBM.PeerNode:Prod-db2.adomain.com:Prod-db2 Online IBM.Equivalency:db2_public_network_0 |- Online IBM.NetworkInterface:bond0:Prod-db2 '- Online IBM.NetworkInterface:bond0:Prod-db1
Now, if you’re viewing that on Linux, the “Online”s are all green, and the expected “Offline”s are all blue. If there’s a problem it will be in red.
This is my favorite way of looking at it. The red highlighting made it easy to understand if there was a problem, even when I understood very little about what it all meant.
Viewing States Using db2pd
You can also use db2pd to look at the states. I’m not as big of a fan of this method, but I think it’s a matter of preference. Here’s what the same system as above looks like using that method:
$ db2pd -d wc005p01 -ha Option -ha is an instance scope option. The database option has been ignored. DB2 HA Status Instance Information: Instance Name = db2inst1 Number Of Domains = 1 Number Of RGs for instance = 2 Domain Information: Domain Name = prod_db2ha Cluster Version = 3.1.0.3 Cluster State = Online Number of nodes = 2 Node Information: Node Name State --------------------- ------------------- Prod-db1.adomain.com Online Prod-db2.adomain.com Online Resource Group Information: Resource Group Name = db2_db2inst1_db2inst1_WCP01-rg Resource Group LockState = Unlocked Resource Group OpState = Online Resource Group Nominal OpState = Online Number of Group Resources = 2 Number of Allowed Nodes = 2 Allowed Nodes ------------- Prod-db1.adomain.com Prod-db2.adomain.com Member Resource Information: Resource Name = db2_db2inst1_db2inst1_WCP01-rs Resource State = Online Resource Type = HADR HADR Primary Instance = db2inst1 HADR Secondary Instance = db2inst1 HADR DB Name = WCP01 HADR Primary Node = Prod-db1.adomain.com HADR Secondary Node = Prod-db2.adomain.com Resource Name = db2ip_172_12_12_12-rs Resource State = Online Resource Type = IP Resource Group Name = db2_db2inst1_Prod-db1.adomain.com_0-rg Resource Group LockState = Unlocked Resource Group OpState = Online Resource Group Nominal OpState = Online Number of Group Resources = 1 Number of Allowed Nodes = 1 Allowed Nodes ------------- Prod-db1.adomain.com Member Resource Information: Resource Name = db2_db2inst1_Prod-db1.adomain.com_0-rs Resource State = Online Resource Type = DB2 Partition DB2 Partition Number = 0 Number of Allowed Nodes = 1 Allowed Nodes ------------- Prod-db1.adomain.com Network Information: Network Name Number of Adapters ----------------------- ------------------ db2_public_network_0 2 Node Name Adapter Name ----------------------- ------------------ Prod-db2 bond0 Prod-db1 bond0 Quorum Information: Quorum Name Quorum State ------------------------------------ -------------------- Operator Offline db2_Quorum_Network_172_10_10_10:11_36_34 Online Fail Offline
I guess I can see how this method might be more understandable. But it doesn’t highlight problems in red!
It also has the advantage of being something you can execute as the db2 instance owner rather than as root.
Changing States
So, what do you do if things are highlighted in red?
Well, the first course of action is to check into HADR. First make sure that neither database is waiting on the other to start. Verify that HADR shows as “Connected” in “Peer” status with little or no log gap, using db2 -d -hadr:
$ db2pd -d wcp01 -hadr Database Partition 0 -- Database WCP01 -- Active -- Up 71 days 16:06:16 -- Date 01/29/2013 20:33:23 HADR Information: Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes) Primary Peer Nearsync 0 1238 ConnectStatus ConnectTime Timeout Connected Mon Nov 19 04:27:21 2012 (1353320841) 120 PeerWindowEnd PeerWindow Tue Jan 29 20:37:59 2013 (1359513479) 300 LocalHost LocalService Prod-db1.adomain.com 18819 RemoteHost RemoteService RemoteInstance Prod-db2.adomain.com 18820 db2inst1 PrimaryFile PrimaryPg PrimaryLSN S0009993.LOG 9847 0x00000081FF427CE6 StandByFile StandByPg StandByLSN S0009993.LOG 9846 0x00000081FF426FF3
If HADR is working properly, then you may want to try to disable and re-enable db2haicu.
Finally, if your situation matches the one below, you can try (at your own risk) the following procedure.
“Pending Online”
This is an issue that pops up sometimes with a running setup. If you ever have to down both servers, please follow the steps in section 7 of this document: http://download.boulder.ibm.com/ibmdl/pub/software/dw/data/dm-0908hadrdb2haicu/HADR_db2haicu.pdf. If you don’t you’re likely to get TSA in an inconsistent state and mess with it for a while. I’m going to share the steps that I use to get TSA out of this pending online state – but please note, these can be extremely dangerous, and if you don’t understand what you’re doing, you probably don’t want to use them – contact IBM to see if these steps work for you or not. Use at your own risk. I got these steps from a colleague who got them from support, but later support told him they might be dangerous.
You’ll need to run these as root. Even if your instance owner can run lssam, you still need root for the rest of these commands.
After you have verified that HADR is properly running, look at the states of the resources to ensure that your problem matches the one I am describing:
> su - root # lssam Pending online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Nominal=Online |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs Control=SuspendedPropagated |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01 '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02 '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs Control=SuspendedPropagated |- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01 '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02 Online IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01 Online IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs Control=SuspendedPropagated '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02
After you have confirmed that this matches your issue, find which is the master in the resource group:
# lssamctrl –V Starting to list SAM Control information. lssamctrl: Executed on Fri Apr 22 12:45:08 2011 at "dbserver01", master node "dbserver01". Displaying SAM Control information: SAMControl: TimeOut = 60 RetryCount = 3 Automation = Auto ExcludedNodes = {} ResourceRestartTimeOut = 5 ActiveVersion = [3.1.0.1,Fri Mar 11 16:10:54 EST 2011] EnablePublisher = Disabled TraceLevel = 31 ActivePolicy = [] CleanupList = {} PublisherList = {} Completed Listing SAM Control information.
That told us: master node “dbserver01”
Now, on the master node, get the process id for the recovery manager:
# ps -ef |grep -i recoveryrm root 7929864 3866752 0 Apr 07 - 0:36 /usr/sbin/rsct/bin/IBM.RecoveryRMd
Now kill that process id:
# kill 7929864
Next, confirm that the recovery manager starts a new process:
# ps -ef |grep -i recoveryrm root 7929866 3866752 1 12:54:17 - 0:00 /usr/sbin/rsct/bin/IBM.RecoveryRMd
Validate that the “In Config State” is TRUE:
# lssrc -ls IBM.RecoveryRM |grep "In Config State" In Config State : TRUE
Now see the changes in status. The Pending Status is now Offline, the Nominal changed to Offline, and the the Control=SuspendedPropegated is removed:
# lssam Offline IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Nominal=Offline |- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs |- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01 '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02 '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs |- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01 '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02 Offline IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Offline '- Offline IBM.Application:db2_db2inst1_dbserver01_0-rs '- Offline IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01 Offline IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Offline '- Offline IBM.Application:db2_db2inst1_dbserver02_0-rs '- Offline IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02
Now issue commands to properly set the Resource Groups – first set the Resource Group online for the Master server, and then set it online for the Standby server:
# chrg -o online db2_db2inst1_dbserver01_0-rg # chrg -o online db2_db2inst1_dbserver02_0-rg
Check the status again, and note the differences – the Resource groups at the bottom now show as online:
# lssam Offline IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Nominal=Offline |- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs |- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01 '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02 '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs |- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01 '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02 Online IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01 Online IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02
Now, set the Resource Group online for the Database:
# chrg -o online db2_db2inst1_db2inst1_WCQ01-rg
You may note a Lock state while the Resource Group switches to ONLINE:
# lssam Online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Request=Lock Nominal=Online |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs Control=SuspendedPropagated |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01 '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02 '- Online IBM.ServiceIP:db2ip_172_12_12_12-rs Control=SuspendedPropagated |- Online IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01 '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02 Online IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01 Online IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02
After a bit, everything should show as normal again:
# lssam Online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCQ01-rg Nominal=Online |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs |- Online IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver01 '- Offline IBM.Application:db2_db2inst1_db2inst1_WCQ01-rs:dbserver02 '- Online IBM.ServiceIP:db2ip_172_12_12_12-rs |- Online IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01 '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02 Online IBM.ResourceGroup:db2_db2inst1_dbserver01_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs '- Online IBM.Application:db2_db2inst1_dbserver01_0-rs:dbserver01 Online IBM.ResourceGroup:db2_db2inst1_dbserver02_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs '- Online IBM.Application:db2_db2inst1_dbserver02_0-rs:dbserver02
What TSA Looks Like if HADR is Simply Down
Always make sure you get HADR up before digging into TSA states. It looks similar (but slightly different) if HADR is just down. Notice the “Request=Lock” that’s in there – that’s different than the issue above.
Online IBM.ResourceGroup:db2_db2inst1_Prod-db1.adomain.com_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs '- Online IBM.Application:db2_db2inst1_Prod-db1.adomain.com_0-rs:dbserver01 Online IBM.ResourceGroup:db2_db2inst1_Prod-db2.adomain.com_0-rg Nominal=Online '- Online IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs '- Online IBM.Application:db2_db2inst1_Prod-db2.adomain.com_0-rs:dbserver02 Pending online IBM.ResourceGroup:db2_db2inst1_db2inst1_WCP01-rg Request=Lock Nominal=Online |- Offline IBM.Application:db2_db2inst1_db2inst1_WCP01-rs Control=SuspendedPropagated |- Offline IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:dbserver01 '- Offline IBM.Application:db2_db2inst1_db2inst1_WCP01-rs:dbserver02 '- Online IBM.ServiceIP:db2ip_172_12_12_12-rs Control=SuspendedPropagated |- Online IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver01 '- Offline IBM.ServiceIP:db2ip_172_12_12_12-rs:dbserver02 Online IBM.Equivalency:db2_db2inst1_Prod-db1.adomain.com_0-rg_group-equ '- Online IBM.PeerNode:Prod-db1.adomain.com:dbserver01 Online IBM.Equivalency:db2_db2inst1_Prod-db2.adomain.com_0-rg_group-equ '- Online IBM.PeerNode:Prod-db2.adomain.com:dbserver02 Online IBM.Equivalency:db2_db2inst1_db2inst1_WCP01-rg_group-equ |- Online IBM.PeerNode:Prod-db1.adomain.com:dbserver01 '- Online IBM.PeerNode:Prod-db2.adomain.com:dbserver02 Online IBM.Equivalency:db2_public_network_0 |- Online IBM.NetworkInterface:bond0:dbserver02 '- Online IBM.NetworkInterface:bond0:dbserver01
I’d love to hear problems that others have encountered and how you’ve resolved them to help others! Leave a comment with your situation and solution.
Other Posts In This Series
This series consists of four posts:
Using TSA/db2haicu to automate failover – Part 1: The Preparation
Using TSA/db2haicu to automate failover – Part 2: How it looks if it goes smoothly
Using TSA/db2haicu to Automate Failover Part 3: Testing, Ways Setup can go Wrong and What to do.
“Using TSA/db2haicu to automate failover Part 4: Dealing with Problems After Setup
Search this blog on “TSA” for other posts on TSA issues and tips.
Situation:
a server which hold standby database down, then after it was up,
you can see Control=SuspendedPropagated
no lock on resource group .
What should I do to remove this flag?
Thank you.
DB21085I Instance “db2pb1” uses “64” bits and DB2 code release “SQL09075” with
level identifier “08060107”.
Informational tokens are “DB2 v9.7.0.5”, “special_28492”, “IP23285_28492”, and
Fix Pack “5”.
Product is installed at “/db2/db2pb1/db2_software”.
arlpb1ci:db2pb1 7> oslevel -s
7100-01-05-1228
Online IBM.ResourceGroup:db2_db2pb1_db2pb1_PB1-rg Nominal=Online
|- Online IBM.Application:db2_db2pb1_db2pb1_PB1-rs Control=SuspendedPropagated
|- Online IBM.Application:db2_db2pb1_db2pb1_PB1-rs:arlpsap11
‘- Offline IBM.Application:db2_db2pb1_db2pb1_PB1-rs:arlpsap12
|- Online IBM.ServiceIP:db2ip_10_180_0_111-rs Control=SuspendedPropagated
|- Online IBM.ServiceIP:db2ip_10_180_0_111-rs:arlpsap11
‘- Offline IBM.ServiceIP:db2ip_10_180_0_111-rs:arlpsap12
‘- Online IBM.ServiceIP:db2ip_10_194_6_209-rs Control=SuspendedPropagated
|- Online IBM.ServiceIP:db2ip_10_194_6_209-rs:arlpsap11
‘- Offline IBM.ServiceIP:db2ip_10_194_6_209-rs:arlpsap12
Resource Group Information:
Resource Group Name = db2_db2pb1_db2pb1_PB1-rg
Resource Group LockState = Unlocked
Resource Group OpState = Online
Resource Group Nominal OpState = Online
Number of Group Resources = 3
Number of Allowed Nodes = 2
The only series of steps I have to try are the ones in this blog entry. Did you resolve this? Sorry for the late response, I was taking a vacation – camping with the family.
[…] As stated before, I wish there was an option on db2haicu that basically said “I’ve fixed the original problem, reset the TSA states”. This one is a bit easier than the problem and reset I describe in Using TSA/db2haicu to automate failover Part 4: Dealing with Problems After Setup […]
On the Pending Online issue, my problems were as follows:
Softdog issues:
I viewed the lssam output and can see that the instance on db2prod02 is showing “Pending online”. The reason for this is a 3rd party watchdog module that is preventing IBM’s cluster software from loading its own (there can only be one watchdog module active for a given server). The syslog show the problem :
Feb 24 14:21:51 db2prod02 hatsd[19978]: hadms: Loading watchdog softdog, timeout = 8000 ms.
Feb 24 14:21:51 db2prod02 hatsd[19978]: hadms: Found loaded iTCO_vendor_support with count 1
Feb 24 14:21:51 db2prod02 hatsd[19978]: hadms: iTCO_vendor_support has a use count of 1 and cannot be unloaded
The “iTCO_vendor_support” module needs to be disabled (preferably uninstalled). You should check db2prod01 as well so there is no unexpected issue in the future. This is the advise I asked Adam to pass onto you last Friday. Looks like you’re still working on this, with your SysAdmin I’m assuming.
Once the instance is able to reach an “Online” state, db2haicu will be able to add HADR databases again.
and then just permissions issues getting db2haicu to run:
I had to do the following to get it to work as well as to do a hadr takeover before it would let me add secondary and tertiary db’s into the cluster. On the primary, it would refuse to add databases into the cluster stating a problem with error:
2014-02-27-15.11.02.709792-420 E51459483E655 LEVEL: Error
PID : 28178 TID : 139851322767136PROC : db2haicu
INSTANCE: atlinst NODE : 000
FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure, sqlhaUICreateHADR, p
robe:900
RETCODE : ECF=0x9000056F=-1879046801=ECF_SQLHA_HADR_VALIDATION_FAILED
The HADR DB failed validation before being added to the cluster
MESSAGE : Please verify that HADR_REMOTE_INST and HADR_REMOTE_HOST are correct
and in the exact format and case as the Standby instance name and
hostname.
DATA #1 : String, 7 bytes
atlinst
DATA #2 : String, 9 bytes
db2prod02
On new instances, I would get the following technote issue regarding db2havend and the library file:
http://www-01.ibm.com/support/docview.wss?uid=swg21649212
Also had issue on CT_MANAGEMENT_SCOPE:
http://www-01.ibm.com/support/docview.wss?uid=swg1IC64785
db2set DB2_DIRECT_IO=false
export CT_MANAGEMENT_SCOPE=2
But my main hurdle I spent all of last Fri/Sat night on was:
— change setsuid permissions on db2havend(s) and lib32
–http://www-01.ibm.com/support/docview.wss?uid=swg21649212
MUST BE:
-r-sr-xr-x 1 root db2inst1 4642211 Apr 3 18:17 db2havend
-r-sr-xr-x 1 root db2inst1 3990657 Apr 3 18:17 db2havend32
lrwxrwxrwx 1 root root 14 Apr 11 13:10 libdb2tsa.so -> libdb2tsa.so.1
-r-xr-xr-x 1 bin bin 152529 Mar 19 01:32 libdb2tsa.so.1
check by using
ls -l | grep db2have
FIX by using:
chmod 555 on libdb2tsa.so.1 in dir sqllib\lib64
chmod 4555 on db2havend and db2havend64 in sqllib\adm
Thank you as your post did help me… Not same issue but it did good to know I wasn’t alone … Thank you Ember
Hi Ember,
Can you please let me know what can be done in below situation.
Failed offline IBM.ResourceGroup:db2_tdbin02_tdbin02_XXX-rg Nominal=Online
|- Failed offline IBM.Application:db2_tdbin02_tdbin02_XXX-rs
|- Failed offline IBM.Application:db2_tdbin02_tdbin02_XXX-rs:IDOCTOHADR01
‘- Failed offline IBM.Application:db2_tdbin02_tdbin02_XXX-rs:IDOCTOHADR02
‘- Offline IBM.ServiceIP:db2ip_172_20_62_108-rs
|- Offline IBM.ServiceIP:db2ip_172_20_62_108-rs:IDOCTOHADR01
‘- Offline IBM.ServiceIP:db2ip_172_20_62_108-rs:IDOCTOHADR02
When I’m trying to switvh over from server 1 to server 2 some of the db’s goes into Failed Offline mode. There are 14 DB’s in one instance.
Does only one database go into failed offline or all 14? Do you have all 14 fully configured in TSAMP? How are you doing the failover – through TAKEOVER command or db2haicu?
Multiple databases on one instance can be problematic with TSAMP – especially when using the VIP as you are, as you have to ensure that all databases fail over at the same time or you have to define different virtual IP addresses for each database.
Hi Ember,
I’m doing failover by using db2haicu command..
all the 14 DB’s are configured in TSAMP with different VIP … Out of 14 sometimes 3 or 4 Db’s goes in Failed Offline mode.
I don’t know the issue off the top of my head. Sounds like a PMR with IBM might be in order.
hi Ember
how much time standby will take to takeover if primary is failed in tsa concept in db2
Maximum time should be hadr_peer_window plus hadr_timeout. The actual failover, when initiated depends on volume, but us frequently less than 30 seconds.
Hi Ember,
I have gone through your article and it is really very descriptive and easy to understand. However, I am recently facing one strange issue and I am unable to figure it out what is going wrong in this case. If you can give input on this it will be very helpful.
Recently one our server which hosts the PRIMARY database of HADR server went down. But it did not automatically failed over to STANDBY. I had to manually do a TAKEOVER. Once, the PRIMARY came up I switched back to original setup.
To find the cause of not having automatic failover worked, I issued the lssam command first. I am seeing this unusual output as below. The HADR db status shows as Pending Online and Unknown. Googling it out did not server much purpose, however one link (http://www-01.ibm.com/support/docview.wss?uid=swg21961711) I found where it suggests that TSAMP is not able to monitor the db2 HADR status. I tried the to run the db2pd -hadr command from root but it works perfectly fine.
Can you please suggest what can be done to diagnose further?
Pending online IBM.ResourceGroup:db2_dbins371_dbins371_DSIMPR-rg Nominal=Online
|- Unknown IBM.Application:db2_dbins371_dbins371_DSIMPR-rs
|- Unknown IBM.Application:db2_dbins371_dbins371_DSIMPR-rs:server1
‘- Unknown IBM.Application:db2_dbins371_dbins371_DSIMPR-rs:server2
TSAMP state problems can be difficult. You can try running db2haicu and see if it just needs to be enabled after an extended outage. There are some suggestions on other approaches in my blog articles, but they should be used at your own risk.
Your post about HADR – TSA was very helpful for me, and I would like to make some questions.
I have an HADR environment with TSA db2 v10.5, actually it works well. And I have the intention to add an auxiliary standby, and add it to the Cluster TSA. I have read about it and that the cluster TSA does not support a second standby to make the switch role, but in my test environment, I have created a cluster with 3 nodes (Primary, Principal standby and auxiliary standby
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_TESTDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst1_db2inst1_TESTDB-rs
|- Online IBM.Application:db2_db2inst1_db2inst1_TESTDB-rs:primary1
‘- Offline IBM.Application:db2_db2inst1_db2inst1_TESTDB-rs:standby1
‘- Online IBM.ServiceIP:db2ip_10_120_202_58-rs
|- Online IBM.ServiceIP:db2ip_10_120_202_58-rs:primary1
‘- Offline IBM.ServiceIP:db2ip_10_120_202_58-rs:standby1
Online IBM.ResourceGroup:db2_db2inst1_db2inst1_QADB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst1_db2inst1_QADB-rs
|- Online IBM.Application:db2_db2inst1_db2inst1_QADB-rs:primary1
‘- Offline IBM.Application:db2_db2inst1_db2inst1_QADB-rs:standby1
‘- Online IBM.ServiceIP:db2ip_10_120_202_59-rs
|- Online IBM.ServiceIP:db2ip_10_120_202_59-rs:primary1
‘- Offline IBM.ServiceIP:db2ip_10_120_202_59-rs:standby1
Online IBM.ResourceGroup:db2_db2inst1_primary1_0-rg Nominal=Online
‘- Online IBM.Application:db2_db2inst1_primary1_0-rs
‘- Online IBM.Application:db2_db2inst1_primary1_0-rs:primary1
Online IBM.ResourceGroup:db2_db2inst1_standby1_0-rg Nominal=Online
‘- Online IBM.Application:db2_db2inst1_standby1_0-rs
‘- Online IBM.Application:db2_db2inst1_standby1_0-rs:standby1
Online IBM.ResourceGroup:db2_db2inst1_standby2_0-rg Nominal=Online
‘- Online IBM.Application:db2_db2inst1_standby2_0-rs
‘- Online IBM.Application:db2_db2inst1_standby2_0-rs:standby2
Online IBM.Equivalency:db2_db2inst1_db2inst1_TESTDB-rg_group-equ
|- Online IBM.PeerNode:primary1:primary1
‘- Online IBM.PeerNode:standby1:standby1
Online IBM.Equivalency:db2_db2inst1_db2inst1_QADB-rg_group-equ
|- Online IBM.PeerNode:primary1:primary1
‘- Online IBM.PeerNode:standby1:standby1
Online IBM.Equivalency:db2_db2inst1_primary1_0-rg_group-equ
‘- Online IBM.PeerNode:primary1:primary1
Online IBM.Equivalency:db2_db2inst1_standby1_0-rg_group-equ
‘- Online IBM.PeerNode:standby1:standby1
Online IBM.Equivalency:db2_db2inst1_standby2_0-rg_group-equ
‘- Online IBM.PeerNode:standby2:standby2
Online IBM.Equivalency:db2_public_network_0
|- Online IBM.NetworkInterface:eth1:standby1
|- Online IBM.NetworkInterface:eth1:primary1
‘- Online IBM.NetworkInterface:eth1:standby2
Online IBM.Equivalency:db2_public_network_1
|- Online IBM.NetworkInterface:eth0:standby1
|- Online IBM.NetworkInterface:eth0:primary1
‘- Online IBM.NetworkInterface:eth0:standby2
[db2inst1@primary1 ~]$ db2pd -db deltas -hadr
But something wrong happen when I swith the roles from de Primary to the auxiliary standby. Manually from de Auxiliary Stanby – “db2 takaover hadr on db testdb” The command executes successfully,
HADR_ROLE = PRIMARY
REPLAY_TYPE = PHYSICAL
HADR_SYNCMODE = NEARSYNC
STANDBY_ID = 1
LOG_STREAM_ID = 0
HADR_STATE = PEER
HADR_FLAGS =
PRIMARY_MEMBER_HOST = standby2
PRIMARY_INSTANCE = db2inst1
PRIMARY_MEMBER = 0
STANDBY_MEMBER_HOST = primary1
STANDBY_INSTANCE = db2inst1
STANDBY_MEMBER = 0
HADR_CONNECT_STATUS = CONNECTED
HADR_CONNECT_STATUS_TIME = 10/03/2017 07:49:06.422433 (1507042146)
HEARTBEAT_INTERVAL(seconds) = 30
HEARTBEAT_MISSED = 0
HEARTBEAT_EXPECTED = 83
HADR_TIMEOUT(seconds) = 300
TIME_SINCE_LAST_RECV(seconds) = 0
PEER_WAIT_LIMIT(seconds) = 0
LOG_HADR_WAIT_CUR(seconds) = 0.000
LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000000
LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.000
LOG_HADR_WAIT_COUNT = 0
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 16384
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 87380
PRIMARY_LOG_FILE,PAGE,POS = S0000002.LOG, 86, 49264705
STANDBY_LOG_FILE,PAGE,POS = S0000002.LOG, 86, 49264705
HADR_LOG_GAP(bytes) = 0
STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000002.LOG, 86, 49264705
STANDBY_RECV_REPLAY_GAP(bytes) = 0
PRIMARY_LOG_TIME = 10/03/2017 08:18:32.000000 (1507043912)
STANDBY_LOG_TIME = 10/03/2017 08:18:32.000000 (1507043912)
STANDBY_REPLAY_LOG_TIME = 10/03/2017 08:18:32.000000 (1507043912)
STANDBY_RECV_BUF_SIZE(pages) = 512
STANDBY_RECV_BUF_PERCENT = 0
STANDBY_SPOOL_LIMIT(pages) = 13000
STANDBY_SPOOL_PERCENT = 0
STANDBY_ERROR_TIME = NULL
PEER_WINDOW(seconds) = 300
PEER_WINDOW_END = 10/03/2017 08:26:24.000000 (1507044384)
TAKEOVER_APP_REMAINING_PRIMARY = 0
READS_ON_STANDBY_ENABLED = N
And sudenly It goes down
db2pd -db testdb -hadr
Database TESTDB not activated on database member 0 or this database name cannot be found in the local database directory.
And the Primary automatically took over control of the database. And I dont know why.
I would like to know what is the correct way to make the stanby auxiliary the new primary.
Adding a third node to the TSAMP domain is not an officially supported solution from IBM. It is also not one that I have ever attempted to implement. One of the reasons is that often the auxiliary standby(s) are used for DR purposes and not HA. Often they are in a geographically separate location with a similar set of application servers that would be activated if they were ever used. Usually communication between the app servers and the database servers if they were across a significant distance would not achieve reliable or acceptable performance for the end-user. I have a number of clients using auxiliary standbys, but never integrating them into the TSAMP domain.
I suspect you have some sort of problem in the TSAMP setup that is causing TSAMP to issue commands to re-establish the primary. Clustering scripts can get complex if you’ve ever tried to build them yourself – I suspect there is additional scripting you would need to do for the cluster to make this work.
Hello,
I dint see the lock on the resources but still pending online
Online IBM.ResourceGroup:db2_db2inst1_s00114_0-rg Nominal=Online
‘- Online IBM.Application:db2_db2inst1_s00114_0-rs
‘- Online IBM.Application:db2_db2inst1_s00114_0-rs:s00114
Pending online IBM.ResourceGroup:db2_db2inst1_s00115_0-rg Nominal=Online
‘- Pending online IBM.Application:db2_db2inst1_s00115_0-rs
‘- Pending online IBM.Application:db2_db2inst1_s00115_0-rs:s00115
Can some one help on this please.
Hello All,
I solved this issue, by running the below commands
rmmod iTCO_wdt
rmmod iTCO_vendor_support
Thanks
Harshavardhan L