To eliminate any confusion – I am not Ember. Ember is a little shorter, with a big smile, and glasses. I was lucky enough to meet her through a mutual friend at the IDUG 2012 Conference in Denver just a few weeks ago. After enough networking events and sessions, we found we had similar technical backgrounds and began swapping stories. Eventually, she was kind enough to ask me to write a guest entry for you. Having a background in 24×7 production support, I have picked up a pretty efficient method of troubleshooting problems quickly. As a result, I wanted to pass some of this knowledge to others through my session at IDUG and now in a condensed two-part article on Ember’s blog. I hope you enjoy it.
* * *
“$237 a minute, can you remember that? That is what we pay in fines to the client every minute we are down”. In an hour, that’s enough money to buy a Toyota Camry and in 24 hours, enough to buy a house-twice. That is not even taking into account user frustration or lost sales.
Anyone can be a good production support DBA. However, there are only a few true DBA’s who are natural “ER Doctors” who can successfully triage an emergency situation. Only a few DBA’s can be on the clock for $237 a minute, with a manager looking over one shoulder, as they assess and respond in rapid fashion.
It is the equivalent of doing surgery while the patient’s family watches over you as you work. No pressure, really.
The key to rapid recovery from a failure involves two things: a solid baseline and plenty of preparation. We will cover how to baseline in the first part of this series. In part two we will look at how to attack the problem quickly.
Baseline:
How do you know if that “blip” was normal or an anomaly? Is a three second query on a sub second OLTP database normal? How about a twenty five minute query on a Data Warehouse? You just do not know without getting a good picture, or baseline, before the preverbal poo hits the fan.
A good baseline will include a few things:
- The historical usage of a database over time. The longer, the better. I recommend data be collected over a day, week, month, and quarter. If you want a gold star, gather it for one year.
- The historical data needs to come from both DB2 and the Operating System. Think of a marriage, generally they are happy together, but occasionally one spouse irritates the other. So you need to know how the O/S behaves when DB2 is irritating it and vice versa.
These are some of the pieces of data you should collect for your baseline:
- DB2:
- Avg Execution time of SQL
- Number of Executions for SQL
- Bufferpool Hit Ratios
- Locking
- Normal DB2Diag.log
- Operating System:
- CPU
- Memory (and paging)
- Disk I/O
There are a few good tools that are native to both DB2 and AIX to help you gather this information.
Baseline DB2
From a DB2 Perspective you have a ton of tools at your disposal, you know-the ones you have in a “DB2 Notes” folder hidden on your laptop. The “Hey, how did you do that?” commands you protect like a dragon does his hoard. Ember and others have written blog entries about these awesome little commands so I will only mention a few.
- TOP_DYNAMIC_SQL (SYSIBMADM table)
TOP_DYNAMIC_SQL is your ‘go to’ guy. SQL can easily be manipulated to show you “Top SQL by Number of Executions or Largest Execution Time”. Just remember the cache is flushed upon deactivation of the database, so it is cumulative. You can always flush the cache manually with a “FLUSH PACKAGE CACHE DYNAMIC” command *. Personally, I rolled this into a script that sends me a Top 20, every week or so.- Top 20 SQL by Number of Executions:
db2 “SELECT NUM_EXECUTIONS as EXECUTIONS, AVERAGE_EXECUTION_TIME_S as
TIME_SECONDS, STMT_TEXT as TEXT FROM SYSIBMADM.TOP_DYNAMIC_SQL WHERE
(STMT_TEXT like ‘SELECT%’ or STMT_TEXT like ‘select%’) ORDER BY
NUM_EXECUTIONS DESC FETCH FIRST 20 ROWS ONLY” - Top 20 SQL by Execution Time:
db2 “SELECT AVERAGE_EXECUTION_TIME_S as TIME_SECONDS, NUM_EXECUTIONS
as EXECUTIONS, STMT_TEXT as TEXT FROM SYSIBMADM.TOP_DYNAMIC_SQL WHERE
(STMT_TEXT like ‘SELECT%’ or STMT_TEXT like ‘select%’) ORDER BY
AVERAGE_EXECUTION_TIME_S DESC FETCH FIRST 20 ROWS ONLY”
* Disclaimer: Remember caching is there for a reason, it speeds up queries. There could be a slight performance impact by flushing your dynamic cache. - Top 20 SQL by Number of Executions:
- SNAPLOCKWAIT (SYSIBMADM table)
There is one candy bar and two kids are fighting for it. Who is shoving another away and yelling “Mine!”?Command:
db2 “SELECT AGENT_ID, LOCK_MODE, LOCK_OBJECT_TYPE, AGENT_ID_HOLDING_LK, LOCK_MODE_REQUESTED FROM SYSIBMADM.SNAPLOCKWAIT” - DB2TOP
A powerful tool, DB2TOP is DB2’s crack at a fancy GUI. It’s not up to par with, say Quest Spotlight, DBI Brother-Panther, or IBM Optim. But when you need to get data quick and dirty, this tool will get the job done.DB2TOP is actually a real-time tool. However, it can be run in REPLAY mode with captured session information (capturing the historical data you need). The binary session information is captured with a press of a “C” while in real-time or “-C” via command line. You can replay against the binary file with a “-f” argument.
Command: db2top -d <dbname>
Useful arguments: -B Bottleneck; -b Bufferpools; -D Dyn. SQL; -U Locks; -C Data Capture - DB2DIAG.log or DBDIAG.nfy
Is your DB2 Diag.log the strong silent type or a chatty Kathy? Does it babble on and on about an EVENT or an ERROR? Make sure this log is being archived off daily, weekly, or monthly using the “-A” argument. It pays to know what a normal healthy log looks like.
Baseline AIX
I do not know why I picture DB2 and its Operating System as an old married couple. They work well together, have done so efficiently for a long time, but occasionally they start tossing snippy remarks at each other and the harmony is gone until one admits defeat and apologizes for their behavior. Sometimes it is DB2; sometimes it’s AIX. But in the end they get back to normal and plod along happily.
AIX tools are not just for System Admins. To be a better DBA, you need to learn to speak geek of a different dialect. There are some awesome commands to know, and one tool you should know intimately.
- IOSTAT
Disk Utilization for a defined time over X intervals.Command:
iostat <delay in seconds> <iterations>Example:
“iostat 3 2” (Wait 3 seconds between 2 iterations of output) - VMSTAT
CPU Utilization for a defined time over X intervals.Command:
vmstat <delay in seconds> <iterations>Example:
“vmstat 5 3” (Wait 5 seconds between 3 iterations of output) - NMON or TOPAS (Real Time)
Pick your poison. Either one of these will do the job and show you useful information in real-time.Command: NMON
Arguments: C CPU; M Memory; D Disk; T Top Processes; -f -x or -X to send to a fileCommand: TOPAS
No Arguments, all on screen as default.Regardless of if we speak about AIX or DB2, the key to baselining is having archived data to refer to in a time of crisis. So, ask yourself this: How do I use all the fancy commands listed above? Do I pump to a table, e-mail myself, archive to a file? All this is up to you, but have a plan in place.
In a class by itself – NMON Analyzer
There is an awesome little tool called NMON analyzer. It takes NMON data captured in a file and displays all sorts of operating system information. It will graphically show spikes in CPU, Disk, and memory. However, that just scratches the surface of what it can do. You can also choose to see your HDISK Layout, your operating system level information, disk read/write output, your I/O adapter, or network I/O.
In our shop, this data is captured on a daily and weekly basis and is stored in an “.nmon” file on the server. When needed, the DBA can pull down the NMON file and run it through the NMON Analyzer to see how the system is performing.
NMON Analyzer is free, but not an official IBM tool and it is not supported by IBM. However, you can read a detailed write-up on developer works as well as download the XLS file.
NMON Analyzer URL:
http://www.ibm.com/developerworks/wikis/display/WikiPtype/nmonanalyser
Next Steps.
Baseline and preparation, it’s the solid foundation you need in your ‘bag of tricks”. Just like a physical at the physician’s office; the doctor can get an idea of what things look like at base state. In the next part of this series, we will look at how to logically attack the problem once it arises and compare it to a healthy database.
Edit: see part 2 of the series here: https://datageek.blog/2012/07/05/10-minute-triage-assessing-problems-quickly-part-2/
About the Author: Michael Krafick has been a production support DBA for over 12 years in data warehousing and highly transactional OLTP environments. He was a session speaker for “10 Minute Triage” at the
2012 IDUG Technical Conference. Michael also has extensive experience in setting up monitoring configurations for DB2 Databases. He can be reached at “Mkrafick @ bellsouth.net”. Linked-in Profile: http://www.linkedin.com/in/michaelkrafick .
Hello,
I just wanted to say that I found this article and many of Ember’s facinating and very useful, I am not a DBA but I am responsible for a Commerce 6 system that runs websites for 70 companies. We recently discovered the issues caused by the junk records as a result of, well a complete lack of routine housekeeping on the systems. Now we have the task of catching up on two years worth of system operations.
Thank you both for these wonderful articles, sharing such knowledge is greatly appreciated.
Kind regards,
Brian