It has taken me a while to fully understand the difference between HADR_TIMEOUT and HADR_PEER_WINDOW. I think there is some confusion here, so I’d like to address what each means and some considerations when setting them. In general, you’ll only need HADR_TIMEOUT when using HADR and only need HADR_PEER_WINDOW when using TSA(db2haicu) or some other automated failover tool.
HADR Timeout defines, in seconds, the time after unavailability of the other HADR server is first noticed that the HADR state will change from connected to disconnected. If you are starting HADR on the primary server, then if the primary server cannot connect to the standby in this number of seconds, the start will fail and HADR will not be running. Assuming no failover software and the setting of HADR_PEER_WINDOW to 0, The primary server will continue processing transactions without sending them to the standby. It will periodically retry the connection to the standby, and if the standby becomes available it will again start processing transactions with commits tied to the requirements of the SYNCMODE being used.
If attempting a takeover without force, DB2 will wait this amount of time to attempt to communicate with the other server before failing and returning an error message.
The real point of this time period is to allow minor network hiccups to occur without other action being taken, but yet to consider the connection failed so as not to impede transactions after a reasonable period of time.
Setting this value depends on your network. I have a client with frequent network issues where I keep this value at 300. I have other clients where I use simply 120, which seems to work well for most environments. I have seen it set as low as 10 seconds for a very highly available network where seconds of slowdown are not very acceptable, but would be very cautious setting it that low.
This parameter is not usually used when only HADR is in place with manual failover. But it is critical if using an automated failover for HADR such as TSA(db2haicu) or others. This tells DB2 how long AFTER the connection is considered failed to continue to behave as if the connection were not failed. Now that may sound a bit odd. But the real intention here is to allow the connection to be considered failed, and then give time for that failure to be detected by the failover automation software before any transactions are allowed to complete and compromise the data. This means you can easily have connections waiting for as much as HADR_TIMEOUT plus HADR_PEER_WINDOW before a failover is completed and your database is again available.
Most frequently I see HADR_PEER_WINDOW set to 300 out of an abundance of caution – actual takeovers do not generally take that long, though in a failure state there may be multiple factors slowing down the failover.
Good to know !
We have just implemented HADR + TSA and our settings are
HADR timeout value (HADR_TIMEOUT) = 20
HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 60
Nobody has complained, so far and HADR + TSA behave as expected (very well) -for now-.
Let’s give hope a chance. 🙂
Thanks for sharing, Amber.
in our db2 hadr / haicu system , db2 acts so weird if there was a network issue, db2 initiate takeover , then new primary will crash, hadr shutdown to avoid split brain later takeover back to old primary then both will be in a hung state.
manual takover works perfect, also any system down scenario. having issues only with network failure. we are using VIP also.
i found both HADR_TIMEOUT and HADR_PEER_WINDOW is 120 secs, does that cause this issues ?
I would a bit disagree with “This parameter (HADR_PEER_WINDOW) is not usually used when only HADR is in place with manual failover.”
It’s critical especially in the case we use manual failover.
Your business continuity planning policy must contain a decision, what is more important, reducing outage time in the case of hardware failures or data integrity.
In the case we set peer window to large value (hours or even days) we may be sure, that Primary will not commit any transaction before it’s processed by Standby. That gives us an ability to switch to the Standby and be sure, that no transaction is lost even in the case of BY FORCE switching.
1) Request data center admins to shutdown former primary.
2) Takeover hadr for db by forse peer window only.
The disadvantage of this solution is outages caused by hardware/software failures at standby.
Primary will also be stuck because of this failure. So, overall probability of stuck will increase twice (overall expected outage time will be also increased twice).
I saw HADR solutions which were built ideal from the technical point of view, but which didn’t actually work when disaster happened because no one could make a decision about acceptability of possible data loss in the case of TAKEOVER BY FORCE action.
It’s system architecture or client agreement level decision and it must be taken in advance.
BTW Mentioned systems successfully passed BCP tests, but all of them were done in “graceful” mode (when connectivity between HADR pair was active).