(For Part 1 of this series, please click 10 Minute Triage: Assessing Problems Quickly, Part 1).
One of my favorite commercials is from a motor oil company. The scene begins with a man who is sleeping soundly. His slumber ends abruptly when he is physically grabbed by the collar, shoved on a treadmill and forced to run in an all out sprint from a cold start to prove a point about the abuse we give our cars in the morning.
Getting paged at 3 a.m. is like that poor man running on the treadmill–not quite awake in his PJ’s and bunny slippers. That is close to being on call … if you added people running around with their hair on fire, an alarm going off in the background, and bombs going off over your head like you were storming the beaches of Normandy.
That is what being on call is like at 3am. You are tired, confused, everything is loud, and you’re still in your bunny slippers.
So how do you quickly solve the problem while only half awake? In the first half of this article, you learned about baselining and have learned what “normal looks like.” So now let’s attack the problem.
Storming the beaches of Normandy.
In my case, every production server has what I call a “Triage Kit,” which is essentially a directory containing SQL and scripts that I can get very quickly.
Here is what I have in it:
- SQL – Top SQL by Execution Time
- SQL – Top SQL by Number of Executions
- SQL – Capture Bufferpool Utilization
- SQL – Capture Locking Information
- KSH – Flush Cache, and capture No Executions /Execution Time in intervals
- Triage.txt –
- db2expln and db2advis commands with common arguments
- db2top and useful arguments
- Various notes on AIX Utilities and Commands
- IBM Support Center Information to open a PMR
- It may not be a file, but there is a directory that is safe to use (i.e. large and not on a busy filesystem) where I can send output and archive off if necessary.
When it is time to troubleshoot, I have three session windows going at one time. The top half of my screen is divided in two. One window is showing a trailing db2diag.log (tail -f -100 db2diag.log) and one window has NMON online with CPU and Disk displayed. The bottom half is one lone window where I am tossing commands from my triage directory.
But the key here is speed. Scripts and SQL are great, but what if you have an OLTP database with over 10,000 connections running SQL rapidly? Can you quickly find the details you need in a snapshot that is so long that the file can’t open in VI properly? What are your alternatives?
GUI is not a four letter word.
I am a firm believer that every DBA must be self-sufficient via command line, but when speed is a factor (remember $237 a minute from the last article) GUI is the way to go. Heck, I can wash, wax, and detail my truck over a four hour period, but if I can donate $20 to a high school band who can attack the truck with an army and crank it out in 30 minutes, that is time and money well spent.
All hail freeware. I love freeware like NMON Analyzer. But for real time DB2 monitoring, there are tools like DB2MON (www.db2mon.com), which as far as I can tell, has been around since the beginning of time (v6.1 at least).
Names have been blacked out to protect the innocent.
(Click to enlarge)
Essentially DB2MON is a simple GUI front end to your snapshot data that gives you some basic drill down capability. It is simple, is free, and it saves a lot of time. This should be part of any DBA’s arsenal.
The big guns for quickly triaging situations would probably be Quest Spotlight, DBI Brother-Panther, and IBM Optim. All are very powerful tools that give you various advantages. Optim is really good at graphs and easily ties into DB2. DBI’s Brother-Panther is very good at finding the cumulative effect of a single SQL statement (run 10,000 times) on a server. Spotlight gives very good visual representation and drill down capabilities.
Quest Spotlight
(Click to enlarge)
As attractive as GUI tools are, they are not perfect. Most tools come with a licensing cost; some have a large footprint (but not all); some may need their own infrastructure. Even freeware can have a drawback that it may not run perfectly every time.
When used as a complimentary tool to your arsenal–not just the only tool–freeware tools can be very effective in troubleshooting when performance starts to suffer.
I can say from personal experience that having at least one GUI tool has cut my analysis time in half.
Beyond the basics – Using the DB2DIAG.log and TRAP Files
Why is my bufferpool only at 70%? Why is my CPU spiking at 90%? How come my disks are going crazy? Going into the “how and why” of DB2 and Operating System behavior is an article on its own. There are also those who can speak about it in much more depth than I. My purpose was to give you a method to analyze quickly and come to your own conclusions. But before I wrap up the article let me give you some interesting tidbits from my IDUG 2012 co-presenter, Pavel Sustr (IBM Toronto Lab).
When analyzing your DB2DIAG.log, where do you look? Well, there are a few phrases that you can search to help narrow down where the problem started.
Key Phrases to search in a DB2DIAG.log:
- “Error in agent”
- “pdEDUIsInDB2KernelOperation”
- “pdResilienceIsSafeToSustain”
- “FODC_Trap…” (name/type of the FODC directory)
Note the EUID from the DB2DIAG.log entry and backtrack to the first unrelated entry (such as a STMM adjustment, which is informational – not an error).
2012-03-26-11.30.45.695289-240 I9822E563 LEVEL: Error PID : 26689 TID : 140093448775424PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE APPHDL : 0-14 APPID: *LOCAL.DB2.120326153046 AUTHID : DB2INST1 EDUID : 31 EDUNAME: db2agent (SAMPLE) 0 FUNCTION: DB2 UDB, base sys utilities, sqleagnt_sigsegvh, probe:1 MESSAGE : Error in agent servicing application with coor_node: DATA #1 : Hexdump, 2 bytes 0x00007F6A0C00FA42 : 0000
Use this EDUID to find the corresponding TRAP file produced by DB2 in the sqllib/db2dump directory.
$ ls -l FODC_Trap_2012-03-26-11.30.45.417849 total 8784 -rw-r--r-- 1 db2inst1 db2iadm1 17811 Mar 26 11:30 03915407.000.locklist.txt drwxr-x--- 2 db2inst1 db2iadm1 4096 Mar 26 11:30 26689.000.core -rw-r--r-- 1 db2inst1 db2iadm1 223795 Mar 26 11:30 26689.31.000.apm.bin -rw-r----- 1 db2inst1 db2iadm1 1193 Mar 26 11:30 26689.31.000.cos.txt -rw-r--r-- 1 db2inst1 db2iadm1 209197 Mar 26 11:30 26689.31.000.db2pd.SAMPLE.txt -rw-r--r-- 1 db2inst1 db2iadm1 1894480 Mar 26 11:30 26689.31.000.dump.bin -rw-r--r-- 1 db2inst1 db2iadm1 164078 Mar 26 11:30 26689.31.000.rawstack.txt -rw-r--r-- 1 db2inst1 db2iadm1 29693 Mar 26 11:30 26689.31.000.trap.txt -rw-r--r-- 1 db2inst1 db2iadm1 2206 Mar 26 11:30 chkconfig.txt -rw-r----- 1 db2inst1 db2iadm1 6291312 Mar 26 11:30 db2eventlog.000.crash -rw-r--r-- 1 db2inst1 db2iadm1 3724 Mar 26 11:30 db2pd.bufferpools.SAMPLE.txt -rw-r--r-- 1 db2inst1 db2iadm1 9883 Mar 26 11:30 db2pd.dbcfg.SAMPLE.txt -rw-r--r-- 1 db2inst1 db2iadm1 8716 Mar 26 11:30 db2pd.dbmcfg.txt -rw-r--r-- 1 db2inst1 db2iadm1 743 Mar 26 11:30 db2pd.dbptnmem.txt -rw-r--r-- 1 db2inst1 db2iadm1 7213 Mar 26 11:30 db2pd.memory.SAMPLE.txt drwxr-xr-x 2 db2inst1 db2iadm1 4096 Mar 26 11:30 OSSNAPS -rw-r--r-- 1 db2inst1 db2iadm1 21376 Mar 26 11:30 procmaps.txt -rw-r--r-- 1 db2inst1 db2iadm1 22565 Mar 26 11:30 sysctl.txt -rw-r--r-- 1 db2inst1 db2iadm1 9271 Mar 26 11:30 top.txt
Looking in a TRAP file is like deciphering hieroglyphics. Unless you have some secret decoder ring, it is often hard to decipher. Most leave it to IBM support to work through. However, if you pay attention, you can gleam some information out of the trap file itself.
To detangle some of the stack signals, you will want to run the trap file through “c++filt”. See the example below.
$ cat 26689.31.000.trap.txt | c++filt > 26689.31.000.trap.txt.filtered $ vi 26689.31.000.trap.txt.filtered DB2 build information: DB2 v9.7.0.0 s090521 SQL09070 timestamp: 2012-03-26-11.30.45.445398 instance name: db2inst1.000 EDU name : db2agent (SAMPLE) 0 EDU ID : 31 Signal #11 … ---FUNC-ADDR---- ------FUNCTION + OFFSET------ 00007F6A1804D109 ossDumpStackTraceEx + 0x01e5 00007F6A18047F2A OSSTrapFile::dumpEx(unsigned long, int, siginfo*, void*, unsigned long) + 0x00cc 00007F6A1AA6A2A9 sqlo_trce + 0x02eb 00007F6A1AAAB9B1 sqloEDUCodeTrapHandler + 0x0167 … 00007F6A19D3FCF6 sqlbAlterPoolAct(unsigned short, SQLP_LSN8*, SQLB_GLOBALS*) + 0x03f8 … 00007F6A19E0D5BB sqldmpnd(sqeAgent*, int, char*, SQLP_LSN8*, SQLD_RECOV_INFO*) + 0x01cb 00007F6A1AADD1A9 sqlptppl(sqeAgent*) + 0x02f7 00007F6A1969EE2A sqlpxcm1(sqeAgent*, SQLXA_CALL_INFO*, int) + 0x05ae 00007F6A1970EE5D sqlrrcom(sqlrr_cb*, int, int) + 0x0467 00007F6A19D16AC5 sqlbEMReduceContainers(SQLB_POOL_CB*, unsigned int, sqeBsuEdu*) + 0x0341 00007F6A19D15672 sqlbLockAndMoveExtents(SQLB_POOL_CB*, bool, unsigned int, sqeBsuEdu*) + 0x04be 00007F6A19D1B4C5 sqlbExtentMovementEntryPoint(sqeBsuEdu*, void*) + 0x00bf 00007F6A192708B3 sqleIndCoordProcessRequest(sqeAgent*) + 0x062b 00007F6A1927D5D7 sqeAgent::RunEDU() + 0x0381
It is best to read a TRAP file from the bottom up. For example, you can see DB2 was moving extents and playing with containers before a bufferpool adjustment, which is when the TRAP occurred.
Notice the Signal code in the TRAP? That gives a good clue what was going on and gives a possible cause of the problem.
Here is a list of useful signals to watch for. Notice signal 11 shows that the instance was the source of the trap.
- SIGILL(4), SIGFPE(8), SIGTRAP(5), SIGBUS(10), Linux: (7), SIGSEGV(11), SIGKILL(9)
Instance trap. Bad programming, HW errors, invalid memory access, stack and heap collisions, problems with vendor libraries, OS problems. The instance shuts down. - Most UNIX’s: SIGABRT(6); HP-UX: SIGIOT(6)
Instance panic. Self induced by DB2 due to unrecoverable problems. Typically associated with data (disk) corruption. The instance shuts down.
To see if an APAR already exists for this behavior, search IBM’s website:
https://www-947.ibm.com/support/entry/portal/search_results/information_management/db2_for_linux%2c_unix_and_windows?sortby=-dcdate_sortrange&filter=language:en+scope%3Astgsysx%2Cdblue%2Cic%2Cpubs+DC.Type_avl%3ACT748&prod=B732565I80401W32&prod=C085867V89032N86&prod=D053518M30225P30&prod=G096328S00454E20&prod=J038363J86299A04&prod=J257661H55054Y90&prod=O372542I12371Q30&prod=P067787Y85362R93&prod=P383241Y60897N93&prod=P849948M05373L86&prod=S008045U16263E24&prod=S350547K04077A23&prod=T509412Q88783P22&prod=X816880U50489Z11&prod=Z983319R13364R33&prod=P849948M05373L86&matches=0&sn=spe&productContext=1690484112
In this case, search for “trap sqlbAlterPoolAct sqlbExtentMovementEntryPoint” and you’ll see that an APAR does exist for this issue.
Moving Forward.
In this series, we have learned the basics of troubleshooting in a rapid fashion. We learned the importance of baselining your production systems both at a DB2 and OS level. We learned various DB2 and OS commands to gather the information we need. We have learned that GUI is not a four letter word and that GUI tools can help when time is of the essence (at a price). Preparation, we see, is the true key to troubleshooting rapidly. Finally, we learned some advanced techniques in sifting through a DB2DIAG.log and a DB2 TRAP file.
Armed with the techniques and tips in both articles you now can go forth and diagnose your own ill performing database servers.
About the Author: Michael Krafick has been a production support DBA for over 12 years in data warehousing and highly transactional OLTP environments. He was a session speaker for “10 Minute Triage” at the 2012 IDUG Technical Conference. Michael also has extensive experience in setting up monitoring configurations for DB2 Databases. He can be reached at “Mkrafick @ bellsouth.net”. Linked-in Profile: http://www.linkedin.com/in/michaelkrafick .
[…] Edit: see part 2 of the series here: https://datageek.blog/2012/07/05/10-minute-triage-assessing-problems-quickly-part-2/ […]
[…] blog posts include: 10 Minute Triage: Assessing Problems Quickly (Part I) 10 Minute Triage: Assessing Problems Quickly (Part II) Now, now you two play nice … DB2 and HACMP failover Technical Conference – It’s a skill […]
[…] blog posts include: 10 Minute Triage: Assessing Problems Quickly (Part I) 10 Minute Triage: Assessing Problems Quickly (Part II) Now, now you two play nice … DB2 and HACMP failover Technical Conference – It’s a skill […]
[…] blog posts include: 10 Minute Triage: Assessing Problems Quickly (Part I) 10 Minute Triage: Assessing Problems Quickly (Part II) Now, now you two play nice … DB2 and HACMP failover Technical Conference – It’s a skill […]
[…] blog posts include: 10 Minute Triage: Assessing Problems Quickly (Part I) 10 Minute Triage: Assessing Problems Quickly (Part II) Now, now you two play nice … DB2 and HACMP failover Technical Conference – It’s a skill […]
[…] 10 Minute Triage: Assessing Problems Quickly (Part II) […]
[…] blog posts include: 10 Minute Triage: Assessing Problems Quickly (Part I) 10 Minute Triage: Assessing Problems Quickly (Part II) Now, now you two play nice … DB2 and HACMP failover Technical Conference – It’s a skill […]
[…] blog posts include: 10 Minute Triage: Assessing Problems Quickly (Part I) 10 Minute Triage: Assessing Problems Quickly (Part II) Now, now you two play nice … DB2 and HACMP failover Technical Conference – It’s a skill […]
[…] blog posts include: 10 Minute Triage: Assessing Problems Quickly (Part I) 10 Minute Triage: Assessing Problems Quickly (Part II) Now, now you two play nice … DB2 and HACMP failover Technical Conference – It’s a skill […]
[…] blog posts include: 10 Minute Triage: Assessing Problems Quickly (Part I) 10 Minute Triage: Assessing Problems Quickly (Part II) Now, now you two play nice … DB2 and HACMP failover Technical Conference – It’s a skill […]