What is HADR?
HADR is DB2’s implementation of log shipping. Which means it’s a shared-nothing kind of product. But it is log shipping at the Log Buffer level instead of the Log File level, so it can be extremely up to date. It even has a Synchronous mode that would guarantee that committed transactions on one server would also be on another server. (in years of experience on dozens of clients, I’ve only ever seen NEARSYNC used) It can only handle two servers (there’s no adding a third in), and is active/warm spare – only with 9.7 Fixpack 1 and later can you do reads on the standby and you cannot do writes on the standby.
How much does it cost?
As always verify with IBM because licensing changes by region and other factors I’m not aware of. But generally HADR is included with DB2 licensing – the rub is usually in licensing DB2 on the standby server. Usually the standby server can be licensed at only 100 PVU, which is frequently much cheaper than full DB2 licensing. If you want to be able to do reads on the standby, though, you’ll have to go in for full licensing. Usually clients run HADR only in production, though I have seen a couple lately doing it in QA as well to have a testing ground.
What failures does it protect against?
HADR protects against hardware failures – CPU, disk, memory and the controllers and other hardware components. Tools like HACMP and Veritas use a shared-disk implementation, so cannot protect against disk failure. I have seen both SAN failures and RAID array (the whole array) failures, so it may seem like one in a million, but even the most redundant disks can fail. It can also be used to facilitate rolling hardware maintenance and rolling FixPacks. You are not guaranteed to be able to keep the database up during a full DB2 version upgrade. It must be combined with other (included) products to automatically sense failures and fail over.
What failures does it not protect against?
HADR does not protect against human error, data issues, and HADR failures. If someone deletes everything from a table and commits the delete, HADR is not going to be able to recover from that. It is not a replacement for a good backup and recovery strategy. You must also monitor HADR – I treat HADR down in production as a sev 1 issue where a DBA needs to be called out of bed to fix it. I have actually lost a production raid array around 5 am when HADR had gone down around 1 am. Worst case scenarios do happen.
How to set it up
HADR is really not too difficult to set up on it’s own. Configuring automatic failover is a bit more difficult, though DB2 has made it significantly easier in 9.5 and above with the introduction of bundled TSA and the haicu tool. I’m not going to list every detail here because there are half a dozen white papers out there on how to set it up. The general idea is:
- Set the HADR parameters on each server
HADR local host name (HADR_LOCAL_HOST) = your.primary.hostname
HADR local service name (HADR_LOCAL_SVC) = 18819
HADR remote host name (HADR_REMOTE_HOST) = your.secondary.hostname
HADR remote service name (HADR_REMOTE_SVC) = 18820
HADR instance name of remote server (HADR_REMOTE_INST) = inst1
HADR timeout value (HADR_TIMEOUT) = 120
HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 120
- Set the Alternate Servers on the Primary and the standby (for Automatic Client Reroute)
- Set db configuration parameters INDEXREC to RESTART and LOGINDEXBUILD to ON
- Take a backup (preferably Offline) of the database on the primary server
- Restore the database on the standby server, leaving it in rollforward pending state
-
Start HADR on the standby
-
Start HADR on the primary
-
Wait 5 minutes and check HADR status
-
Run db2haicu to set up TSA for automated failover
-
Test multiple failure scenarios at the app and database level
For chunks of this, your database will be unavailable. There are also a number of inputs you need to have ready for running db2haicu, and you will need ongoing sudo authority to execute at least one TSA related command.
Remember that the primary and standby servers should be as identical as possible – filesystems, hardware, and software.
Some clients also neglect step #10 – testing of failovers. This is an important step to make sure you really can failover. It is possible to think you have everything set up right, do a failover and then not have it work properly from the application’s perspective.
Gotchas
This section represents hours spent troubleshooting different problems or recovering from them. I hope it can help someone find an issue faster.
HADR is extremely picky about its variables. They must be exactly right with no typos, or HADR will not work. I have, on several occasions had numbers reversed or the instance name off, and spent a fair amount of time looking for the error before finding it. Because of this, it can help if you have another dba look over the basics if things aren’t working on setup. HADR is also picky on hosts file and/or db2nodes.cfg set up, and in some cases you may end up using an IP address in the db cfg parameters instead of a hostname.
HADR also sometimes fails after it tells you it has successfully started, so you must check the status after you start it.
Occasionally HADR doesn’t like to work from an Online backup, so an Offline one will be required. I have one note about it not going well with a compressed backup, but that was years ago, and I frequently used compressed backups without trouble.
HADR does not copy things that aren’t logged – so it is not a good choice if you have non-logged LOBs or if you do non-recoverable loads. If you are using HADR and you do a non-recoverable load, you have to take a backup on the primary and restore it into the standby – if you don’t, any table with a non-recoverable load will not be copied over, nor will future changes, and if you go to failover, then you will not be able to access that table. For this reason, I wouldn’t use it in a scenario where you don’t have good control over data being loaded into the database. If you do run into that, then you have to backup your primary database, restore it into your standby database, and start HADR.
HADR does go down sometimes without warning – so you must monitor it using whatever monitoring tools you have, and ensure that you respond very quickly when it goes down. I use db2pd to monitor(parsing output with scripts), partially because db2pd works when other monitoring tools hang. We look at ConnectStatus, State, and LogGapRunAvg.
On reboot, HADR comes up with database activation. Which means it usually comes up just fine on your primary database, but not on your standby database (no connections to prompt activation). So you’ll generally need to manually start hadr on your standby after a reboot. The primary database will not allow connections on activation until after it can communicate with the standby. This is to prevent a DBA’s worst nightmare – ‘Split Brain’. DB2’s protections against split-brain are pretty nifty. But this means that if you reboot both your primary and your standby at the same time and your primary comes up first, then your primary will not allow any connections until your standby is also up. This can be very confusing the first time or two that you see it. You can manually force the primary to start if you’re sure that the standby is not also up and taking transactions. Or if you’re rebooting both, just do the standby first and do the primary after the standby is back up and activated. If you need your standby down for a while, then stop HADR before you stop the servers. I would recommend NOT stopping HADR automatically on reboot, because the default behavior protects you from split-brain.
What is split-brain? It is simply both your primary and standby databases thinking they are the primary and taking transactions – getting you into a nearly impossible to resolve data conflict.
You must keep the same ids/groups on the primary and standby database servers. I’ve seen a situation on initial set up where the id that Commerce was using to connect to the database was only on the primary server, and not on the standby server, and thus on failover, the database looked fine, but Commerce could not connect.
You also want to be aware of any batch-jobs, data loads, or even scheduled maintenance like runstats or backups – when you fail over, you’ll need to run these on the other database server. Or you can also run them from a client that will get the ACR value and always point to the active database server. Frequently we don’t care which database server the database is running on, and may have it on what was initially the “standby” for months at a time.
Overall, I really like HADR and it’s ease of administration. The level of integration for TSA in 9.5/9.7 is great.
Very good contribution, I really watch for posts by you.
Thanks! It’s always good to hear that someone is reading!
We are using HACMP and HADR for configuration. Do we have to do somethign special to capture Disk failure or HACMP would detect it.
It depends on what kind of architecture you have. Do you have two shared-nothing database servers with HACMP *ONLY* providing heartbeat? Or are you doing four servers – an HADR pair with each member of that pair also in an HACMP pair with other servers?
Also what version of DB2?
Yes, We have two shared-nothing database servers with HACMP ONLY providing heartbeat? Also, We are currently using DB2 9.7.
It really depends on exactly how you’ve configured HACMP. This doc – ftp://ftp.software.ibm.com/software/data/pubs/papers/hadr-hacmp.pdf uses the main method of detecting a failure using a script that uses the db2gcf command to get the status of the instance. The redbook: http://www.redbooks.ibm.com/redbooks/pdfs/sg247363.pdf seems to focus mainly on simple ip pings. Frequently in this kind of setup, you are not monitoring the disk itself – if the disk has a serious problem, then db2 (the instance and the databases) will become unavailable. And generally that unavailability is the main thing you’re looking for.
On 9.7, most distributions of (enterprise) DB2 come with TSA (Tivoli Systems Automation) integrated into the database. The db2haicu command (see this doc for intimate details: http://download.boulder.ibm.com/ibmdl/pub/software/dw/data/dm-0908hadrdb2haicu/HADR_db2haicu.pdf) is actually pretty easy to use, and they’ve got the DB2 monitoring down pat. It’s certainly my tool of choice to use with HADR on db2 9.5 and above.
Did that answer your question adequately?
Hi
Do you have an updated link for
ftp://ftp.software.ibm.com/software/data/pubs/papers/hadr-hacmp.pdf
please, as this one does not seem to be working?
Thank you.
I can’t find it on IBM’s site any more. Here’s an uploaded copy of it. Thanks for asking.
Thank you so much for the documentation and information. It should help me to resolve my outstanding questions and issues.
Hi Ember,
I´m trying set the HADR in my environment, by the way, for set HADR is very very easy, but for manager that I´m having some issues.
I forced a failover broking the network connection on the primary database and ran the takeover ant standby using the parameter BY FORCE, off course, it was ok, but, How can i do for up the takeover now wothout issue? Because, now i have 2 primary databases, what procedure i need to do?
Thank you very much!
Robelis
When you end up with two primaries, you have to pick one database as the desired Primary, and then back up that database and restore into the other (standby) before you can start HADR again. Assuming this is testing, that’s not usually a problem, and it’s actually a procedure you should be familiar with because there are a number of scenarios where you end up doing this restore to get things back in sync. Does that answer your question?
First, Thank you very much for your fast answer …
wow…in this case only restoring the old primary?
So, Can I do that with ONLINE backup(Remember..now the application will be connected in the old standby now..new primary) and what about the parameters? Do I need re-setup the HADR all again ?
And only one more question, what the best Tool to manager the failover/takeover for DB2 V9.7 for Windows ?
Thank you very much
Róbelis Mussatto
Yes, restoring is the only option that I am aware of. Having two primaries is the split-brain situation I talked about, and since both databases could be taking traffic, the databases are no longer considered to be in sync by DB2. This is one of the reasons you want to be very careful with a takeover by FORCE.
The vast majority of the time, online backup will work. I’ve seen a few scenarios where it didn’t, but it was back when I was learning hadr, so I’m not sure of the factors that make it not work. I’m working with at least 6 implementations of HADR at any one time, and build at least another 4 each year, so that’s quite a lot over the last 3-5 years where online backups have worked. I’m someone who if I have the chance for an offline backup will always take it, but don’t get the chance often after build is complete.
If you restore “INTO’ the existing database, your parameter settings will be retained. If you drop the database before restoring, your parameters will be lost and you’ll have to set them up all over again. Because of this, I’m a big fan of restoring into the existing database. You don’t even have to extract logs with the logtarget keyword because HADR will pick up and rollforward through them as long as your backup image is not too old.
I’m a command line person, so I don’t use the GUIs, and couldn’t recommend a GUI. I imagine the Control Center has something for it, and since the Control Center is being deprecated, I imagine Optim/Data Studio must have something for it too. For plain takeover commands, I issue them from the command line. If you’re looking at hadr status, my favorite methods are using db2pd -db db_name -hadr or using db2top. If you’re talking about automating failover, I’ve used TSA (via db2haicu) with a lot of success, though I have only done it on Linux and AIX – I haven’t tried it on Windows. There are also some issues in it (especially when rebooting one or both database servers), so be ready to call db2support. I or a colleague will be writing something up on TSA soon. db2haicu is bundled with db2 on 9.7. A good whitepaper on it is: http://download.boulder.ibm.com/ibmdl/pub/software/dw/data/dm-0908hadrdb2haicu/HADR_db2haicu.pdf
Hi I am new to db2, Can you please help with the command for “Restore the database on the standby server, leaving it in rollforward pending state”, How will I know that the restored database is rollforward pending state”.
A sample restore command would be something like this: from taken at without prompting” from taken at into replace existing without prompting”
db2 “restore db
This post assumes the database already exists on the standby, so you might need:
db2 “restore db
It’s critical not to include the keywords “without rolling forward” on the restore command, as that will both restore the database and take it out of rollforward pending state.
To veryify after the restore that it’s in rollforward pending, you can use:
> db2 get db cfg for
Hi Ember
Thanks for your quick reply. I managed to restore the database with rollforward pending state. (Rollforward pending = DATABASE). So when I try to configure Hadr, i got stuck when I try to start the dabatabse as primary. h:\>db2 START HADR ON DB BANKFRS AS PRIMARY
SQL1768N Unable to start HADR. Reason code = “7”. Can you please help me out.
Here’s RC 7 from the info center:
7
The primary database failed to establish a connection to its standby database within the HADR timeout interval. There are multiple scenarios that can cause this error condition. For example, this message can be returned with reason code 7 if the value of the HADR_TIMEOUT configuration parameter or the HADR_PEER_WINDOW configuration parameter on the primary database is not the same as the value on the standby database.
I would double-check all of your configuration on both servers. You also started the standby before the primary, right? And you know via ping or telnet or whatever that a connection is possible between the hostnames and ports specified in the cfg, right? There are no firewalls in the way?
Thanks Ember,
My environment is Windows, just for testing purpose I have created the standby instance on the same server and my HADR worked. but its not automatic. If I wanted to move from primary to standby I do it manually. but how do I make it automatic. I tried stopping the database service for the primary instance but my standby wasnt active.
The command for take over that I used.
SET DB2INSTANCE=TESTDBH
db2 TAKEOVER HADR ON DB BANKFLS
SET DB2INSTANCE=TESTDB
db2 TAKEOVER HADR ON DB BANKFLS
Thanks
That’s a post I’ve been meaning to write. It’s actually a bit complicated. You have to use a utility called db2haicu. This whitepaper seems a bit complicated to look at, but it’s actually really good: http://download.boulder.ibm.com/ibmdl/pub/software/dw/data/dm-0908hadrdb2haicu/HADR_db2haicu.pdf
Going in, you minimally need to know:
1. fully qualified hostnames of both hosts
2. IP addresses of each host (and of private network, if you’re using one)
3. VIP if you’re using one (you don’t have to)
4. IP of a quorum device. A quorum device can be any very highly available pingable IP address. Domain controllers make good choices.
Pay attention to the use of lssam – it is critical for troubleshooting.
If you have problems with a db2haicu set up, you may need to alter your hosts file or db2nodes.cfg
Also be aware that you may have to follow a special process to reboot without getting yourself in a pickle.
Support has extra documents and procedures for db2haicu issues.
[…] HADR Share this:ShareEmailIf you enjoyed this post, please subscribe! Subscribe in a reader or by Email Cancel reply […]
[…] HADR […]
Hi Ember
We have a requirement to use online reports with real time data. Now we have large transactional tables and want to have separate reporting server. So can you suggest if it would be good idea to use HADR to have secondary db for reporting purpose?
Also can we create additional index’s and views on the secondary db to optimise the reporting while using HADR (not have these index’s / views on the primary db)?
Your comments will be of lot help.
Thanks & Kind Regards
You’ve targeted in on the problems of using an HADR standby for reporting purposes. The first disadvantage is that if you have to fail over, you lose your reporting server, which is unacceptable to some people. The second is that you cannot use views or indexes that don’t exist on your production server. Yet another disadvantage is that you cannot have different data retention policies for your reporting server.
Those disadvantages may lead you away from HADR for reporting purposes. If they do for you, you may want to consider using data propagator to move data to another database server. That shouldn’t involve any additional licensing, since homogeneous dprop is free (at least last I checked, please verify with IBM), and to use your HADR for reporting purposes, you have to fully license the standby server instead of the reduced standby licensing that is usually done.
I supported Data Propagator back on version 7, and it was a real pain to support. I hear it has gotten better, but don’t have any more recent experience with it.
[…] forward HADR setup – you’ll want to make sure you have your HADR peer window set. See my post on HADR for more details on […]
Hello Ember
I have already setup DB2 9.5 HADR and now I have requirement to restore data in this HADR environment from another DB2 database with same name of DB but previous version.
When I restore the data in Primary Server then it breaks the HADR with secondary server.Do I have to restore the data on Secondary server also?AS the business requirement is that I have to restore the database many times,so it is impossible for me to restore db and then go for the HADR
Please tell me the steps to restore the data in the HADR environment.
Expecting for soon reply.
So sorry – my spam filter hid this from me until now.
Yes, you have to restore on the secondary server as well. But the restore and start hadr commands should be all that’s required IF you do the restore properly.
I generally do it this way:
I believe it would also be technically valid (but have not tested) just restoring the same database image into both databases, ensuring that you use the “replace existing” syntax for both. I think the database seed would have to be the same to work, but again, have not tested it.
If at any time you drop either of the databases, you will lose your HADR settings, and have to re-set them (which before version 10 requires a recycle of the database to take effect).
Does that answer your questions?
Hi Ember,
In our environment we have HADR with TSA..
the DB2 version is 9.7 fp 7..
my question is that.. if there is a requirement to restore database both on primary and secondary..
is it enough to stop TSA before restoring the database.. or is it required to reconfigure the TSA again after HADR setup..
thank you in advance..
It should be enough to stop TSA as long as you’re not changing any HADR parameters. Restoring INTO an existing database should allow you to not change them. See my post yesterday for a link to detailed instructions on the right way to stop TSA.
Hi Ember.
I am planning to relocate standby to a different location. I need to do it while the application is up and running. So I have to use online compressed backups for preparing the standby at different location. I have a few doubts.
1. The transaction logs that were generated between the period of online backup restoration and initialisation. Does it cause any issues if we have too many of them and will db2 able to start from the point where it needs to?
2. You mentioned that there are some issues with online compressed backup? What are those please that you have experienced. we are using 9.5 version. Do you still get issues now a days?
3. If we want to go back to old standby, say after some time we found issues with new standby, then do we need to re-initialise again Or will the parameter changes should suffice and DB2 will be able to recognise the point where it had left old standby.
Thanks in advance.
Looking forward for your reply!
Ok, answers to your questions and another issue:
1. If you have too many of them, it will cause issues, but “too many” is relative. Most sites that I support can easily be a week off, as long as any archived transaction logs can still be accessed by the database. That said, a high volume site or inaccessibility of archived transaction logs on the primary could be a problem. Usually it is an easy to resolve one – just take another backup.
2. I haven’t had an issue with an online compressed backup since 8.2 about 5 years ago. I’ve done several HADR setups on 9.5 and probably a dozen on 9.7, so I would be surprised if you had an issue. I just like to mention it in case someone else bangs their head against the wall of that problem.
3. Ok, so here’s the big problem you’re going to have. You CANNOT make HADR parameter changes online. You can make them and then quickly deactivate and reactivate the primary for them to take effect, but there is an outage required. That goes away in db2 10.1, but that’s not much help to you. There’s no way I can see to leave the application up UNLESS you change the IP and hostname of the new standby to match the IP and hostname of the old standby – which I’m guessing is not an option. Going back to your old standby will, like #1, depend on how much activity there has been. If there has been a month of activity, I’m guessing it won’t be able to re-initialize, and you’ll have to restore. But if it has been a day or two and you still have all the log files, then no restore should be needed.
Also, make sure the hosts file on the new standby is pristine. I spent an hour or so last week on an HADR problem where the hostname for the local server was under the entry for 127.0.0.1.
Thanks Ember.
It was really helpful.
My new standby will be having the hostname same as that of Primary but with different IP. I will use IP addresses in HADR parameters.Do you think that it could still cause issues ?
Kind Regards
Unless you’re on version 10, you will have to recycle the database for the HADR parameter changes to take effect.
In order for a DB2 LOAD to be replicated from an HADR primary server to a standby server, DB2 must be able to access the load data on the standby
This is done using a shared file system and the COPY YES option on the LOAD command at the primary server. “Copy yes” makes the load a RCOVERABLE ONE. meaning restore load files will be created and saved on a the specified file system on the primary.
I see Websphere Commerce use dataload utility via dataload.sh for numerous loads into inventory, price, catalog , attribute, etc tables.
The question is, is “copy yes” already part of the WC dataload.sh?
In an HADR environment, will the loaded data in the primary tables get replicated to the standby withouth the DBA having to do anything else?
Is using shared filesystem for recovery load files applicable in WC?
Will there be a need to configure the dataload.sh to look like: “db2 load from data1.del OF DEL XML FROM /home/user/xmlpath replace wcomusr.inventory copy yes to /mysharedloadfile”
I would appreciate if you could share your experience with websphere commerce dataloads in an HADR environment
In my experience, it is rare for people to actually use the “LOAD” methods of the dataload utilities, and almost unheard of for ongoing work. Because of that, I haven’t dug into them much. Past experience would lead me to guess that they don’t play nice with specifying things like COPY YES. Many of the WC data load utilities may be called “dataload”, but use import or even insert. Make sure the utility in question is actually using LOAD.
Ember,
I captured the data below from WC info center.
This is a load output for one of the dataload utilities run.
Looking at the output tells me WC data load utility uses the same ‘db2 load from…..’ behind the covers and not import/export
If my observation is true, this will mean that in an HADR environment since loads are not logged, the same data is not going to be replicated on the standby after load completes on the primary unless the registry variable DB2_LOAD_COPY_NO_OVERRIDE is set
I believe most times the standby is in roll-forward pending mode and therefore not accessible unless its configured as ROS. Hence most people will not even notice they have invalid table/tablespace until they need it to use the standby
what do you think?
===============================================================================
WebSphere Commerce Data Load
===============================================================================
Load started at: Thu Sep 05 09:05:59 EDT 2013
Initialization completed in 2.229 seconds.
Processing CatalogEntry…
———————————————————————————-
Load summary for load item: CatalogEntry.
———————————————————————————-
Business Object Configuration: wc-loader-catalog-entry.xml
Data loader mode: Replace.
Batch size: 1.
Commit count: 100.
Error Tolerance Level: 1.
Error Count: 0.
Amount of business objects processed: 13.
Amount of business objects committed: 13.
Data loader initialization time: 0 seconds.
Data loader execution began: Thu Sep 05 09:06:07 EDT 2013
Data loader execution ended: Thu Sep 05 09:06:13 EDT 2013
Data loader completed in 5.765 seconds.
Total flush time: 0 seconds.
Total commit time: 0.003 seconds.
CSV file location: C:\IBM\WCDE_ENT70\bin\..\samples\DataLoad\Catalog\CatalogEntries.csv.
Affected tables (13):
Table name: BASEITEM, Affected number of rows: 5.
Table name: BASEITMDSC, Affected number of rows: 5.
Table name: STOREITEM, Affected number of rows: 5.
Table name: ITEMVERSN, Affected number of rows: 5.
Table name: DISTARRANG, Affected number of rows: 5.
Table name: CATENTRY, Affected number of rows: 13.
Table name: STORECENT, Affected number of rows: 13.
Table name: CATENTDESC, Affected number of rows: 13.
Table name: CATGPENREL, Affected number of rows: 13.
Table name: LISTPRICE, Affected number of rows: 13.
Table name: ITEMSPC, Affected number of rows: 8.
Table name: VERSIONSPC, Affected number of rows: 8
Table name: CATENTREL, Affected number of rows: 8
———————————————————————————-
Program exiting with exit code: 0.
Load completed successfully with no errors.
Load ended at: Thu Sep 05 09:06:13 EDT 2013
Load completed in 13.901 seconds.
What here leads you believe it’s using LOAD? The data load utilities I’ve worked with require special effort to use LOAD. Use of the word “load” or even “replace” is no guarantee.
I agree with you that it’s a concern. It’s just that based on past WCS experience, I’m doubting the LOAD utility is even being used. Note the commitcount – not something you generally specify on a non-logged load, but do on an import.
[…] HADR […]
[…] HADR […]
[…] blog entry comes from a system that I set up using the guidlines set forth in my HADR/TSA Series: HADR Using TSA/db2haicu to automate failover – Part 1: The Preparation Using TSA/db2haicu to automate […]
I have a client using DB2 9.7 HADR with ACR. While connect to primary DB host/port works fine from WC, connect request to standby db doesnt route the connection to primary. As per Infocenter this should work. Alternate server definition is all OK as well.
The “update alternate server” commands have to be run on each of the primary and the standby. Have you run them on the standby? Does a LIST DB DIRECTORY on the standby show the IP/Hostname of the primary?
Finally(and I suspect this may be your issue?), for a client to actually get the alternate server, it must successfully make a database connection – meaning that if you want the alternate server cached by clients connecting to the standby, you would have to fail over so that successful connection can be made. Have you done that?
I made some checks in to Alternate server config and on both the nodes (using list db directory). They are pointing to correct IP and port. In the last issue, node hosting primary role went down due to some reasons. Standby took over HADR role. When node 1 came back online it took standby role. Clients pointing to node 1 couldn’t connect to node2 where DB was primary. then to resolve the issue, we had to run HADR takeover on node1 to make it primary and then application could connect to DB. As per your second comment, do you mean to say we have to do a HADR takeover on node 2 and then App pointing to node 1 will be redirected to primary db on node2 (as alternate server details will be cached by Java clients)?
No, basically what I’m saying is that the alternate server must be set on both the primary and the standby, and whichever one clients connect to, they must make one successful connection before the alternate server value will be cached.
The most common ACR error I see is when people expect the alternate server value to be cached when a connection has not succeeded from that client to the server.
There can be other issues – such as the ids the applications are using to connect exist on the primary, but not on the standby, or a client type that does not work with ACR ( JDBC type 4 drivers?)
another potential problem: firewall between the clients and the standby server that is configured to not stop communication with the primary server.
Is it possible to reboot both the HADR primary and stand by at same time ?
In short, yes, just make sure the standby comes up first when they start up. Check out this blog entry to avoid a common problem: https://datageek.blog/2012/10/26/when-both-hadr-servers-are-rebooted-at-the-same-time-db2-wont-work/
preferably Offline backup is required , why not online ?? what are the drawbacks while using online backup for HADR configuration.
In previous versions, I’ve had issues with using an online backup. Though we’re talking the old days of HADR, so it may be a bug that was fixed. Whenever I have a choice, I take an offline backup. That said, I have supported plenty of databases where I NEVER got to take an offline backup, outside of version upgrades.
In 9.7 b/w FP6-10, we have been using scheduled online backup from Primary to restore on Standby servers for our WC clients and it has worked well in each and every single instance till date. Its been a while we have stopped asking clients for any downtime for offline backups, especially for this purpose.
I agree, I haven’t seen an issue caused by using an online backup for a solid 6 or more years now. But if it’s a new system and I have the ability for an outage, I still take an offline. Speaking of no downtime, the online configuration of HADR parameters in 10.5 is really nice!
Hi Ember,
I have a Power HA DB2 environment ( on two different hardware but in same lab) and I am planning to extend it to another location for Disaster recovery using HADR. In this case primary of Power HA would act as Primary of HADR as well and Standby of HADR would be in the other location. I have gotten 25 DBs on the Power HA cluster. But I plan for only 5-6 critical DBs on the HADR. Can you suggest whether combining existing Power HA environment with HADR is practical ? If so what are the precautions I should be taking for this kind of setup?
I have seen exactly this kind of setup, using Veritas or RHCS, so it’s absolutely doable. Power-HA will have a floating IP address that is failed over as a part of the shared-disk failover. That is usually your source for HADR. Sometimes the HADR target is also a Power-HA cluster with a floating IP address.
In the implementations I’ve seen this in, HADR is used for Disaster Recovery, so there is a distance component involved. I’ve seen ASYNC used as the HADR syncmode in these situations. It is also possible that you may need to use SUPERASYNC depending on the network. The other syncmodes are really only appropriate when you have a very fast and reliable network and a distance of less than 100 km between the data centers.
When used in this manner, failover is usually not automated as it is often a human decision to fail over to a disaster recovery site. One gotcha that I recall, is that db2nodes.cfg may need to be changed when the power-ha failover occurs. This makes it more likely that you will use IP addresses for your local and remote servers instead of host names. Even if it requres an application outage, you must thoroughly test failover to make sure that it works as you expect it to.
Hi,
Those Discussions are very good, I have some problem in my environment,My environment is db2 9.7 fp 7, wse, on Redhat linux,32 bit.
I am doing hadr on testing servers.
Acutally i want to enable acr property, how to enable it ? According to my knowledge i was did, took ip addresss ,port numbers from primary and standby server. I updated in primary db2 update alternate server for database dbname using hostname scenoddaryip port secondary instance
in standby
db2 update alternate server for database dbname using hostname primaryip port primaryinstance
after that from client i was try to connect to the primary by using
db2 connect to dbname user uname using pwd
it is connected
i went to standby machine and ran
db2 takeover hadr on db dbname
it was completed successfully.
but from client when i was trying to connect it is not connected why? how do it? evenry dir_cache is also enable at instance level on two server.
Great thanks in advance
You have the correct steps for using it, but ACR does not work with every method of connecting. It depends on the client being used. I don’t know every method it works with or doesn’t, but for example JDBC type 2 drivers with a full DB2 client installed, ACR works. With JDBC type 4 drivers (which do not require a DB2 client be installed) generally do not work with ACR. General steps are these:
Hi can you please write a article only based on log shipping with out hadr or suggest a method to impliment the same.
Sorry its very selfish of me asking,but then i am new to this and not finding any thread.
Please any one can help me on this
Why do you want to do log shipping without HADR?
hi Ember
thanks so much for the information
is there any function table or view for monitoring HADR on db2 9.7 ?
do you know in db2top hadr panel what query is running in background ?
MON_GET_HADR lets you get the information that way. It requires a connection to the database to use, of course. Also, I’ve seen an APAR in 10.5 fixpack 4 where every time I queried MON_GET_HADR, it would crash the entire db2 instance (saw it on windows, don’t know if it’s on other OSes). So if you’re on 10.5, make sure you’re on fixpack 5 or above before querying MON_GET_HADR.
My understanding is that db2top pulls from in-memory areas directly. It has not been significantly improved or changed since 8.2.
Hello Ember,
today i see some applications stay in more time “commit active state”..what is exactly commit active state..because of that state,I got more locktimeouts regular intervals…
my version V10.5 fp5
Hadr environment..
Commit active means the commit in actually taking place at that moment. The amount of time a commit takes depends on how large the transaction is, how large your log buffer is, how fast the disks behind your active transaction logs are, and your HADR SYNCMODE (and depending on SYNCMODE, possibly the speed of your network to the standby and the disk on your standby). I certainly see connections in it from time to time, but I wouldn’t expect to see commits taking a long time. You may want to run the HADR tools on you environment to make sure your SYNCMODE choice isn’t slowing down your primary database.
Hi Ember,
Thank you for the post.
We have dbf database consisting of 2 nodes(both are diff physical server having its own memory and space).now we want to configure HADR with the help of 2 nodes only.
currently in our case for exmpl,there is table ABC residing in tablespace def and its containers are on both nodes.
So to implement HADR we need to move to standalone server from DFP ???
what steps we can follow to implent HADR??
As far as I know HADR is not supported for DPF.
Hello Ember, thanks for all the knowledge over the years,
just wanted to check one thing with you .. we have TSA HADR setup version v9.7.0.6. TSA failed over primary to passive DB but dont see any errors in the log during that time . it happened around same time last year and again now.
one strange thing that i observed is this is a PCI compliance DB, dont see any notifications in error log couple of hours before the switch..
Sounds like you may want to open a PMR to troubleshoot. I have heard of a couple of apars for this. You’ve appropriately turned off ASLR if it’s Linux, right? ASLR can cause unplanned and unneeded failovers if left on.
Hi Ember,
Nice article.
While trying to initialize HADR on a production database with prod and dr datacenters geographically separated, it will take hours to ship the backup images based on the size and WAN speed followed by a restore on DR server. My question is can we let application connectivity right after the offline backup has been obtained on the primary database and not really wait until the file transfer and DR restore ? If so, will there be issues while initializing HADR on primary with existing connections ? Or is it imperative /best practice to keep applications down until HADR is initialized and the databases get into a PEER state. Thanks !
Usually there is activity – as long as there is not too much, HADR will rollforward once it comes up. As long as all log files since the backup can be recalled. Depending on how long that is, it may also require a long time to catch up. I have some databases where an hour or two of HADR being down requires two or three hours for catchup.
The larger problem brought up by your question is that HADR has to be re-initialized with a restore from time to time. Certainly at upgrade, but randomly with problems as well.
Also if your link is slow, it can cause performance issues on the primary database, especially if you’re trying to run in SYNC or NEARSYNC.
Hi Ember
Iam new to dba I want to know what are the steps to follow when a primary server is in maintainance.
If I need to stop the primary instance what are the steps to set up stand by as primary?
For that, you need to issue a TAKEOVER HADR statement on the standby. Do NOT use the “by force” keywords. Once the takeover is complete, you can then do whatever you like to the former primary. When done, make sure that HADR is up and running and in sync. Once you’re sure it is in a peer state, you can issue the TAKEOVER command again to move the primary back to it’s original location, if needed. You’ll want to be sure that applications are properly configured to connect to the database on the other server before relying on it, of course.
Hi Ember
Thanks for the reply,but how applications will connect to the stand by server when we switch the role from standby to primary
?
There are several options:
Hi Ember
Thank you one more time for you article.
One qurestion: how does HADR behave when I issue a “Load from cursor” command at the Primary?
There is no file to place on a shared device so I wonder how will DB2 acomplish that.
Thank you in advance for your help.
Regards
`COPY YES` can still be used with a load from cursor. The copy file is not the file being loaded from, but a file generated during the load process. The location for the copy file must be a shared location between the two servers.
Hi ember,
V 10.5 Fp 7
We have Primary, principle standby and auxiliary standby
Primary and prinicple standby are in same data center, auxiliary standby in different data center.
ACR is configure from application Webshere
1) Primary crashed and Standby lost Storage.
2) Rebooted Standby server
3) Once standby server came up , execute takeover command we force option
4) Takeover command took 40 mins to come back online, and in diag log can see connections are getting hit from application to Standby as ACR is configure.
Not able to understand what caused so much of time for takeover by force command, is the connection which was coming caused. ?
The time this takes depends on the settings of HADR_PEER_WINDOW, HADR_TIMEOUT, and can also be longer if log spooling is enabled and the principal standby is far behind the primary. What entries do you have in the diagnostic log during the time the standby was taking over? Did it happen that the standby came up when you happened to get the primary back up again? What state was HADR in when you issued the takeover command? Do you have a copy of the db2pd -hadr output from just before issuing the takeover command? You don’t have a replay delay on this standby, do you?
Hi Ember,
Thanks for these gems – I’ve been blessed with the opportunity to learn my job…on the job, so I’ve been reading a bunch of your blog pieces.
We’ve been experiencing disk full errors on our system for the last couple when a primary reboot happens. The involved databases don’t have HADR shut off, and the standby appears to begin crash recovery when the primary is rebooting – and then I see a bunch of confusion – and split brain warnings when everything comes back up. It appears that the process has always struggled if I understand the logs correctly, but we’ve just begun to get the disk full errors – I think because of a rather large set of transaction processes that happen to occur at the same time.
My newby brain wants to try deactivating the primary and shutting HADR off there before it’s rebooted – but I think that pretty much does away with the whole idea of why we put HADR on these tables to begin with.
My apologies if there’s a load of information missing here…I admit to being a DB2 LUW DBA for about 9 weeks now and I have ALOT to learn. Would you have any suggestions for good reading about HADR and how one might tweak things to avoid disk full, or how one might experiment to see what might work best?
Thanks again for sharing your big brain.
are you doing a clean takeover on the standby before the reboot? What method are you using to automate failover? What is your procedure? Assuming the primary is going down, and you want the database to remain up through either a VIP or ACR, you need to:
Never at any time use the “by force” commands on the takeovers in this process unless something unusual happens.
Thank you very much, Ember.
I refer to your blog most of the time for learning/troubleshooting.
Your posts have helped me a lot in day-to-day DB administration.
Keep up the good work.
Hi Ember,
Whenever i do RUNSTATS on production server same time secondary server goes down ,Replay Window Active.
Could you describe how and why ,due to RUNSTATS ?
Are there other commands that are a part of your runstats? For example, explicit rebinds can cause this. There are so many reasons that ROS don’t happen that if uptime for your read-only workloads is critical, then it is critical to have a vip or other methodology to reroute connections when reads are not available on the standby.