HADR does an awesome job of replicating all logged operations to 1-3 standby databases. It is remarkably simple to use and pretty resilient. More than once I’ve started talking to a client about PureScale only to discover their actual high availability and/or disaster recovery needs can easily be met by a 4-server HADR implementation. Sometimes even by a two-server implementation.
There are a few sticking points, though. These include:
- Keeping registy, DBM, and database configuration in sync
- Keeping automatically tuned buffer pools in sync
- Preventing non-logged operations
- Identifying if a non-logged operation has slipped through
EVERY HADR database should have the BLOCKNONLOGGED database configuration parameter set to “YES”. This parameter can be configured online, so it is super easy to change. BLOCKNONLOGGED prevents most non-logged operations from occurring on the primary database, thus offering no conflict in non-logged changes that don’t make it to the standbys. When BLOCKNONLOGGED is set, the following operations fail:
- CREATE or ALTER TABLE statements that specify
- The NOT LOGGED INITIALLY parameter
- A LOB column with the NOT LOGGED parameter
- A CLOB, DBCLOB, or BLOB column defined as not logged
- LOAD using the NONRECOVERABLE option
- LOAD using the COPY NO option
- Using EXPLAIN.DDL to create explain tables (use SYSINSTALLOBJECTS instead)
BLOCKNONLOGGED set to YES is a huge step in the right direction, however there’s a gap that I have experienced in the real world. Luckily I discovered it before I was stuck in a recovery situation. (Always pay attention if the diag log on your standby has a lot of messages.) BLOCKNONLOGGED means that the only LOAD syntax that will work has to use “COPY YES”, but there is no way to force the LOAD COPY location to a filesystem that is available on all standbys. In my particular situation, I had a shared location for this purpose, but users decided to specify a different location. I now run a script to query the database history to detect this situation. I have an RFE open, so if this issue affects you, please go vote on it: http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=83961
What Happens If Something Slips Through the Cracks?
If some non-logged operation happens to tables on your primary database, then if a failover ever occurs after that point, the table(s) in question will be inoperative. In this case, that means that literally the only thing you can do with them is to drop them. You cannot even keep them empty. If even one table becomes inoperative, you must restore (tablespace or database) to get that table back to an operative state on all standbys. This is why it is critical to prevent non-logged operations and to monitor for them happening anyway.
Detecting Inoperative Tables on the Standby
There are several ways to detect inoperative tables on your standby. There are other method in addition to the ones detailed in this blog article in a technote.
If you have Reads on Standby (ROS) enabled, you can connect to your standby database and run this query:
$ db2 connect to sample Database Connection Information Database server = DB2/LINUXX8664 10.5.8 SQL authorization ID = VDBA Local database alias = SAMPLE select substr(TABSCHEMA,1,12) as tabschema , substr(TABNAME,1,35) as tabname , TABTYPE , AVAILABLE from TABLE(ADMIN_GET_TAB_INFO(null, null)) where AVAILABLE='N' TABSCHEMA TABNAME TABTYPE AVAILABLE ------------ ----------------------------------- ------- --------- 0 record(s) selected.
db2dart can also be used to find tables in an inoperative state on the standby, but this requires the database to be deactivated to be reliable. For each tablespace, you can issue a command like this:
/TS /TSI /QCK 15
IBM-Supplied Perl Script
IBM offers a perl script to perform this work as well. This script really mostly does the db2dart method for you.
Diagnostic Log on Standby
If you have this scenario occurring on the standby, there will be error messages written to the diagnostic log. Unfortunately, they are some of the least useful error messages I have seen. I was unable to decipher them, and opened a PMR with support to understand why my diag log was spinning with so many error messages.
My strategy for avoiding inoperative tables on the standby database is to enable BLOCKNONLOGGED, run a script to check for incorrect load copy locations, and to monitor the diagnostic log on the standby database server. Whatever strategies you employ, this is an important area to be aware of when supporting HADR.