My goal here is to inspire you to test at least some small part of your recovery strategy. This post is tied to my spot in the Talent Search portion of the DB2’s Got Talent competition on the DB2Night Show, but it can also stand alone.
Your Recovery Strategy
If you’re like me, you spend a fair amount of time thinking about your recovery strategy. Imagining disasters and how you could recover from them. Debating if you really should add in another incremental. Worrying that you’ve made a mistake and won’t have all the log files.
The thing is that I’ve been shocked at some of the actual disasters that have happened. Last year, I lost an entire RAID array (two disks at once) just 3 hours after HADR decided to shut itself down. And of course alerting on HADR was disabled because we were getting other alerts due to the one bad disk we had just identified. The entire production server was essentially gone. Luckily we had a hour or so of read-only access to some of the data, and that included the transaction logs. So I snaged them and restored. But the restore took forever because I had to use a backup two weeks old due to the spontaneous failure of the last week’s backup. And then, there had been a table loaded non-recoverable since then as part of a data pruning process, so we had to load the table from data somewhere else.
And that makes me sound like I’m not doing my job, but sometimes the bad breaks just pile up like that. Despite all the problems, I still got every scrap of data back and did it within 12 hours. Because I had backup plans to my backup plans. I lost a production (standby) server last week, and there was no impact at all, and I restored it in less than an hour once the hardware issues were resolved. I’m not even sure if the client was aware of that one.
The best DBA is a paranoid, detail oriented control freak.
Sometimes, you put a lot of thought into a recovery strategy. Sometimes you just go with an already-used standard recovery strategy. Sometimes you inherit one from another DBA or a vendor or a contractor. Wherever your recovery strategy comes from, think about it, analyze it, change it if needed, and by all means test it!
Why Test your Recovery Strategy
This seems pretty obvious to me. You test your recovery strategy to make sure that you could get your database back in an emergency. Any emergency you can imagine that would not also immediately make your business useless (Global Thermoneuclear Warfare you probably don’t need to prepare for, though maybe there are some systems out there that would).
In testing your recovery strategy, you’ll figure out the answers to questions like:
- How do you actually do a restore command? Do you need to do an incremental restore? can you do an automatic one or will you need to go manual? Do you need to extract logs or do a redirected restore?
- How do you retreive needed files? Do you need to request OS level backup files from someone else? Can you access them 24/7 in an emergency? Do you know the TSM extraction commands?
- Can you actually access all the files and data you need to restore?
- How long does it take when things go well and when they go badly?
When to Test your Recovery Strategy
Interestingly enough, I see varying levels of willingness to test recovery strategies on build. I have one client who three months after go-live is still balking at giving me a window for a basic HADR failover test. I have another who spent two or three days testing failovers during load testing to determine what exactly happened, weeks before go-live. I cannot say it strongly enough that you must do at least minimal testing of your recovery strategy during your Build phase – prior to go-live.
It’s also good to have a test scheduled periodically. Maybe you’re testing different parts of your strategy quarterly. Or maybe you’re doing it yearly. In any case, an unhurried test is sure nice, and you can schedule it for times that are less hectic.
If you make any major environmental changes (moving servers, changing storage, changing backup strategy/location), then it is also a good idea to test your recovery strategy to make sure these changes have not impacted your ability to fail over or to recover.
Testing HADR
I’m going to address HADR here because it is the High Availability strategy that I deal with most often. Some of this will apply to other technologies that perform the same function as well. I’m also addressing this for DB2 9.5/9.7 where we have tight integration with TSA. There are actually a surprising number of tests you can/should perform to make sure you can handle various kinds of failures.
A few guidelines
- These tests are most frequently done directly on production at a low-volume time (frequently with the application unavailable to users), and with the approval to cause outages – even if everything goes perfectly, it may still cause minutes to recover from some of these tests.
- Though the application(s) may not be available to users, you still want to have testers able to use the application to ensure that the application is functioning – it is NOT enough to simply see that the database is up and thinks it’s primary in the right place.
- Things to check for each test:
- Output of lssam on each node
- Output of db2pd -dbname -hadr on each node
- Ability to connect to the database from application server or other common location
- Basic application functionality
- You also need to test any components you may have in your HA strategy in additon to HADR (HACMP, whatever)
- Just to be safe, take a backup of your database before starting (yes, this is probably overly cautious)
Tests you can try
- Basic db2stop/db2stop force on each server – ensure that your HA solution doesn’t try to get the database started on your standby when you perform on the primary
- Manual failover test – Issue the takeover hadr command on the standby and then after testing on the primary
- Power off test – take each server down hard in turn – unplug it if possible
- Kill test – abnormally stop the instance using db2_kill – on each of primary and standby
- Persistent instance failure test – be cautious with this one. It’s described in detail here: http://download.boulder.ibm.com/ibmdl/pub/software/dw/data/dm-0908hadrdb2haicu/HADR_db2haicu.pdf The steps essentially are:
-
> mv $HOME/sqllib/adm/db2star2 db2star2.mv
-
> db2_kill
-
Undo: > mv db2star2.mv $HOME/sqllib/adm/db2star2
- Network failure test – for each and every network and server, unplug them one at a time and see what happens. Your hosting provider or hardware support folks may balk at this one
Note: I monitor HADR and treat HADR down as a sev 1, call me in the middle of the night kind of thing. I highly recommend this because if you have another failure shortly after an HADR failure, it sucks.
Testing your restore
I recently had a client ask if we even needed to do backups if we had HADR. The answer is a resounding yes. Not just to deal with unlikely situations I’ve already discussed, but also to deal with:
- Human Error – say someone accidentally deletes data from a table and commits it – HADR cannot help you with that.
- HADR Failure, as mentioned before
- Re-syncing HADR – if HADR is down for a while you’ll have to use a restore to get your databases back in sync
- Data Movement – in some situations, restores from one location to another are a part of normal business
Finding a target
There are several choices when looking for somewhere to test your restore:
- Cloud or VM – the advantage being you can allocate space only while you need it and then release it for other uses.
- Spare Server – sometimes when you’re upgrading hardware or re-working your environment you may have a spare server for a short period of time – this can be an excellent opportunity to test your restore process
- Development, QA, or other test system – you may have such an environment that you can get exclusive use of for long enough to test your restore
- Production during off hours – When all else fails, you may be able to use production itself during a maintenance window – assuming that is acceptable for your up-time requirements
Performing the restore
When performing the actual restore, the most important thing is to pretend you cannot access your source production server. We’re trying to simulate the real-world condition that your production server has temporarily died and is gone. This will force you to make sure you have things like the dbm cfg, db2 registry, and filesystem layout documented somehwere.
You’ll also want to retrieve files from tape, tape library, or whatever backup resource you use at that level. This will help in the act of pretending you cannot access your source server, and will also let you make sure you know how to retrieve these files. Are there commands you can use yourself to do it? Do you need to make the request of another person or group, and do you know how to contact them in the middle of the night?
You’ll want to note how long the the process takes – not just the restore/rollforward, but also retreiving the needed files and getting configs set correctly.
If at all possible, you’ll also want to have someone use a non-production implementation of the application access the restored database. This can help you identify environment-specific tables or other things that would need to change if you moved your prod database from one IP/server to another.
A final part of the test that I don’t usually include in my overall timing is to get copies of my basic maintenance scripts out there and working. This is especially important if you don’t do many builds and don’t have to set up such scripts often. It would not be good to have the only copy of some script on the server that you lost.
Below is what I call a “quick and dirty” script. It doesn’t meet my general scripting guidelines. I’d minimally add error checking and email of the output. But if you need something in the interim, this can gather some of the information for you, and is easy to add on to if you need to. It also keeps only one file per month and writes over the output after a year so you won’t fill up a disk with it:
#!/bin/ksh # quick and dirty script to gather configuration info
DATABASE=sample DATE_SUFFIX=`date '+%m'` INSTANCE=`echo $DB2INSTANCE` OUTPUTDIR='/db_adm/output/'
db2 list db directory > $OUTPUTDIR/dbdir.$INSTANCE.$DATE_SUFFIX.out db2 list node directory > $OUTPUTDIR/nodedir.$INSTANCE.$DATE_SUFFIX.out db2set -all >$OUTPUTDIR/db2reg.$INSTANCE.$DATE_SUFFIX.out db2 get dbm cfg > $OUTPUTDIR/dbmcfg.$INSTANCE.$DATE_SUFFIX.out db2 db2 get db cfg for $DATABASE > $OUTPUTDIR/dbcfg.$INSTANCE.$DATABASE.$DATE_SUFFIX.out db2cfexp $OUTPUTDIR/cfexp.backup.$INSTANCE.$DATE_SUFFIX backup db2 get cli cfg > $OUTPUTDIR/clicfg.$INSTANCE.$DATE_SUFFIX.out db2 get snapshot for tablespaces on $DATABASE > $OUTPUTDIR/tbspsnap.$INSTANCE.$DATABASE.$DATE_SUFFIX.out df -k > filesystems.$DATE_SUFFIX.out
Defining Success
There are several ways I define success in this process. Generally success means I was able to:
- Restore the database
- Accomplish the whole process without accessing the production server
- Get all the settings in sync
- Get test applications to connect to the restored database
- Accomplish the whole process in a reasonable time frame
Timing is the one I most commonly end up making corrections on. Obviously what is reasonable depends on your environment. An hour or two is unreasonable for most data warehouses, but is entirely reasonable for a small e-commerce database. This is even more important if you have a defined SLA.
Dealing with Failure
Especially if it’s the first time you’ve tested, you’ll probably run into at least one small problem. The big thing is not to get discouraged. Any problem you run into has a fix, and sometimes they’re easy. Did the process take too long? Consider taking more frequent backups or looking at how you are connecting to your TSM server – or any one of a dozen other factors that could affect timing.
The important thing here, especially if you had project management or management involved, is to sell the failures as successes of the testing process and as a reason to repeat the testing again – both after you’ve corrected the problems and periodically. If you had no failures, great, then you’re good until next quarter or next year when you schedule another test.
So, anyone inspired to go test their recovery strategy?