HADR does an awesome job of replicating all logged operations to 1-3 standby databases. It is remarkably simple to use and pretty resilient. More than once I’ve started talking to a client about PureScale only to discover their actual high availability and/or disaster recovery needs can easily be met by a 4-server HADR implementation. Sometimes even by a two-server implementation.
There are a few sticking points, though. These include:
- Keeping registy, DBM, and database configuration in sync
- Keeping automatically tuned buffer pools in sync
- Preventing non-logged operations
- Identifying if a non-logged operation has slipped through
EVERY HADR database should have the BLOCKNONLOGGED database configuration parameter set to “YES”. This parameter can be configured online, so it is super easy to change. BLOCKNONLOGGED prevents most non-logged operations from occurring on the primary database, thus offering no conflict in non-logged changes that don’t make it to the standbys. When BLOCKNONLOGGED is set, the following operations fail:
- CREATE or ALTER TABLE statements that specify
- The NOT LOGGED INITIALLY parameter
- A LOB column with the NOT LOGGED parameter
- A CLOB, DBCLOB, or BLOB column defined as not logged
- LOAD using the NONRECOVERABLE option
- LOAD using the COPY NO option
- Using EXPLAIN.DDL to create explain tables (use SYSINSTALLOBJECTS instead)
BLOCKNONLOGGED set to YES is a huge step in the right direction, however there’s a gap that I have experienced in the real world. Luckily I discovered it before I was stuck in a recovery situation. (Always pay attention if the diag log on your standby has a lot of messages.) BLOCKNONLOGGED means that the only LOAD syntax that will work has to use “COPY YES”, but there is no way to force the LOAD COPY location to a filesystem that is available on all standbys. In my particular situation, I had a shared location for this purpose, but users decided to specify a different location. I now run a script to query the database history to detect this situation. I have an RFE open, so if this issue affects you, please go vote on it: http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=83961
What Happens If Something Slips Through the Cracks?
If some non-logged operation happens to tables on your primary database, then if a failover ever occurs after that point, the table(s) in question will be inoperative. In this case, that means that literally the only thing you can do with them is to drop them. You cannot even keep them empty. If even one table becomes inoperative, you must restore (tablespace or database) to get that table back to an operative state on all standbys. This is why it is critical to prevent non-logged operations and to monitor for them happening anyway.
Detecting Inoperative Tables on the Standby
There are several ways to detect inoperative tables on your standby. There are other method in addition to the ones detailed in this blog article in a technote.
If you have Reads on Standby (ROS) enabled, you can connect to your standby database and run this query:
$ db2 connect to sample Database Connection Information Database server = DB2/LINUXX8664 10.5.8 SQL authorization ID = VDBA Local database alias = SAMPLE select substr(TABSCHEMA,1,12) as tabschema , substr(TABNAME,1,35) as tabname , TABTYPE , AVAILABLE from TABLE(ADMIN_GET_TAB_INFO(null, null)) where AVAILABLE='N' TABSCHEMA TABNAME TABTYPE AVAILABLE ------------ ----------------------------------- ------- --------- 0 record(s) selected.
db2dart can also be used to find tables in an inoperative state on the standby, but this requires the database to be deactivated to be reliable. For each tablespace, you can issue a command like this:
/TS /TSI /QCK 15
IBM-Supplied Perl Script
IBM offers a perl script to perform this work as well. This script really mostly does the db2dart method for you.
Diagnostic Log on Standby
If you have this scenario occurring on the standby, there will be error messages written to the diagnostic log. Unfortunately, they are some of the least useful error messages I have seen. I was unable to decipher them, and opened a PMR with support to understand why my diag log was spinning with so many error messages.
My strategy for avoiding inoperative tables on the standby database is to enable BLOCKNONLOGGED, run a script to check for incorrect load copy locations, and to monitor the diagnostic log on the standby database server. Whatever strategies you employ, this is an important area to be aware of when supporting HADR.
Hi. As of 10.5FP9 and 188.8.131.52 Db2 has introduced two new features that some customers may find interesting to help with table spaces going unavailable on the standby:
1. Monitoring and identifying table spaces in invalid or error state, or tables in Inoperative state on the HADR Standby database
A new flag in STANDBY_TABLESPACE_ERROR was added to the HADR_FLAGS field of db2pd -hadr (or the MON_GET_HADR() table function).
For more details about monitoring this error condition, please see:
2. How to recover from table space errors on an HADR Standby database
A table space backup taken on the primary can now be restored on the standby. Yes, an offline table space restore can now be issued on the standby in order to repair a bad error state, while all along keeping the HADR role as a standby.
For more details on how to recover from this condition using a table space backup image, please see:
I read the documentation on this feature carefully, and thought it was just for tablespace errors – are they caused by inoperative tables?
Recently in my PROD ENV , one of the table went to roll-forward pendin state on standby DB, i have 2 standby server,but this happened to only 1 standby which is connect through LAN with my primary DB. so if you could tell me what could be the reason for this and how can we prevent it further ? Because the only option left for me to recover was Restore or re-initialize tablespace which was again very lenghty process. so any input on this would be helpful 🙂
Hi Ember, thanks for the post. Your RFE link is broken… can’t access it.
It works for me in both Chrome and firefox. RFE links have major problems with them in some scenarios. If you log in with your IBM ID first, then it is more likely to work.
Quoted from the TechNote,
3. A TRUNCATE operation is performed on the table on the Primary database.
4. An IMPORT operation with the REPLACE option is performed on the table on the Primary database.
What is the best practice to deal with them in HADR environment?
That technote is inaccurate. TRUNCATE is supported in HADR, and I think IMPORT/REPLACE is as well.
Thanks for pointing this out. We’ll look to update in the near future.
Thank you very much