10 Minute Triage: Assessing Problems Quickly (Part 2)

(For Part 1 of this series, please click 10 Minute Triage: Assessing Problems Quickly, Part 1).

One of my favorite commercials is from a motor oil company. The scene begins with a man who is sleeping soundly. His slumber ends abruptly when he is physically grabbed by the collar, shoved on a treadmill  and forced to run in an all out sprint from a cold start to prove a point about the abuse we give our cars in the morning.

Getting paged at 3 a.m. is like that poor man running on the treadmill–not quite awake in his PJ’s and bunny slippers. That is close to being on call … if you added people running around with their hair on fire, an alarm going off in the background, and bombs going off over your head like you were storming the beaches of Normandy.

That is what being on call is like at 3am. You are tired, confused, everything is loud, and you’re still in your bunny slippers.

So how do you quickly solve the problem while only half awake? In the first half of this article, you learned about baselining and have learned what “normal looks like.”  So now let’s attack the problem.

Storming the beaches of Normandy.

In my case, every production server has what I call a “Triage Kit,” which is essentially a directory containing SQL and scripts that I can get very quickly.

Here is what I have in it:

  1. SQL – Top SQL by Execution Time
  2. SQL – Top SQL by Number of Executions
  3. SQL – Capture Bufferpool Utilization
  4. SQL – Capture Locking Information
  5. KSH – Flush Cache, and capture No Executions /Execution Time in intervals
  6. Triage.txt –
    1. db2expln and db2advis commands with common arguments
    2. db2top and useful arguments
    3. Various notes on AIX Utilities and Commands
    4. IBM Support Center Information to open a PMR
  7. It may not be a file, but there is a directory that is safe to use (i.e. large and not on a busy filesystem) where I can send output and archive off if necessary.

When it is time to troubleshoot, I have three session windows going at one time. The top half of my screen is divided in two. One window is showing a trailing db2diag.log (tail -f -100 db2diag.log) and one window has NMON online with CPU and Disk displayed. The bottom half is one lone window where I am tossing commands from my triage directory.

But the key here is speed. Scripts and SQL are great, but what if you have an OLTP database with over 10,000 connections running SQL rapidly? Can you quickly find the details you need in a snapshot that is so long that the file can’t open in VI properly? What are your alternatives?

 GUI is not a four letter word.

I am a firm believer that every DBA must be self-sufficient via command line, but when speed is a factor (remember $237 a minute from the last article) GUI is the way to go. Heck, I can wash, wax, and detail my truck over a four hour period, but if I can donate $20 to a high school band who can attack the truck with an army and crank it out in 30 minutes, that is time and money well spent.

All hail freeware.  I love freeware like NMON Analyzer. But for real time DB2 monitoring, there are tools like DB2MON (www.db2mon.com), which as far as I can tell, has been around since the beginning of time (v6.1 at least).

Names have been blacked out to protect the innocent.
(Click to enlarge)

Essentially DB2MON is a simple GUI front end to your snapshot data that gives you some basic drill down capability. It is simple, is free, and it saves a lot of time. This should be part of any DBA’s arsenal.

The big guns for quickly triaging situations would probably be Quest Spotlight, DBI Brother-Panther, and IBM Optim. All are very powerful tools that give you various advantages. Optim is really good at graphs and easily ties into DB2. DBI’s Brother-Panther is very good at finding the cumulative effect of a single SQL statement (run 10,000 times) on a server. Spotlight gives very good visual representation and drill down capabilities.

Quest Spotlight
(Click to enlarge)

As attractive as GUI tools are, they are not perfect. Most tools come with a licensing cost; some have a large footprint (but not all); some may need their own infrastructure. Even freeware can have a drawback that it may not run perfectly every time.

When used as a complimentary tool to your arsenal–not just the only tool–freeware tools can be very effective in troubleshooting when performance starts to suffer.

I can say from personal experience that having at least one GUI tool has cut my analysis time in half.

Beyond the basics – Using the DB2DIAG.log and TRAP Files

Why is my bufferpool only at 70%? Why is my CPU spiking at 90%? How come my disks are going crazy? Going into the “how and why” of DB2 and Operating System behavior is an article on its own. There are also those who can speak about it in much more depth than I. My purpose was to give you a method to analyze quickly and come to your own conclusions. But before I wrap up the article let me give you some interesting tidbits from my IDUG 2012 co-presenter, Pavel Sustr (IBM Toronto Lab).

When analyzing your DB2DIAG.log, where do you look? Well, there are a few phrases that you can search to help narrow down where the problem started.

Key Phrases to search in a DB2DIAG.log:

  • “Error in agent”
  • “pdEDUIsInDB2KernelOperation”
  • “pdResilienceIsSafeToSustain”
  • “FODC_Trap…” (name/type of the FODC directory)

Note the EUID from the DB2DIAG.log entry and backtrack to the first unrelated entry (such as a STMM adjustment, which is informational – not an error).

2012-03-26-11.30.45.695289-240 I9822E563           LEVEL: Error
PID     : 26689                TID  : 140093448775424PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000          DB   : SAMPLE
APPHDL  : 0-14                 APPID: *LOCAL.DB2.120326153046
AUTHID  : DB2INST1
EDUID   : 31                   EDUNAME: db2agent (SAMPLE) 0
FUNCTION: DB2 UDB, base sys utilities, sqleagnt_sigsegvh, probe:1
MESSAGE : Error in agent servicing application with coor_node:
DATA #1 : Hexdump, 2 bytes
0x00007F6A0C00FA42 : 0000                                       

Use this EDUID to find the corresponding TRAP file produced by DB2 in the sqllib/db2dump directory.

$ ls -l FODC_Trap_2012-03-26-11.30.45.417849
total 8784
-rw-r--r-- 1 db2inst1 db2iadm1   17811 Mar 26 11:30 03915407.000.locklist.txt
drwxr-x--- 2 db2inst1 db2iadm1    4096 Mar 26 11:30 26689.000.core
-rw-r--r-- 1 db2inst1 db2iadm1  223795 Mar 26 11:30 26689.31.000.apm.bin
-rw-r----- 1 db2inst1 db2iadm1    1193 Mar 26 11:30 26689.31.000.cos.txt
-rw-r--r-- 1 db2inst1 db2iadm1  209197 Mar 26 11:30 26689.31.000.db2pd.SAMPLE.txt
-rw-r--r-- 1 db2inst1 db2iadm1 1894480 Mar 26 11:30 26689.31.000.dump.bin
-rw-r--r-- 1 db2inst1 db2iadm1  164078 Mar 26 11:30 26689.31.000.rawstack.txt
-rw-r--r-- 1 db2inst1 db2iadm1   29693 Mar 26 11:30 26689.31.000.trap.txt
-rw-r--r-- 1 db2inst1 db2iadm1    2206 Mar 26 11:30 chkconfig.txt
-rw-r----- 1 db2inst1 db2iadm1 6291312 Mar 26 11:30 db2eventlog.000.crash
-rw-r--r-- 1 db2inst1 db2iadm1    3724 Mar 26 11:30 db2pd.bufferpools.SAMPLE.txt
-rw-r--r-- 1 db2inst1 db2iadm1    9883 Mar 26 11:30 db2pd.dbcfg.SAMPLE.txt
-rw-r--r-- 1 db2inst1 db2iadm1    8716 Mar 26 11:30 db2pd.dbmcfg.txt
-rw-r--r-- 1 db2inst1 db2iadm1     743 Mar 26 11:30 db2pd.dbptnmem.txt
-rw-r--r-- 1 db2inst1 db2iadm1    7213 Mar 26 11:30 db2pd.memory.SAMPLE.txt
drwxr-xr-x 2 db2inst1 db2iadm1    4096 Mar 26 11:30 OSSNAPS
-rw-r--r-- 1 db2inst1 db2iadm1   21376 Mar 26 11:30 procmaps.txt
-rw-r--r-- 1 db2inst1 db2iadm1   22565 Mar 26 11:30 sysctl.txt
-rw-r--r-- 1 db2inst1 db2iadm1    9271 Mar 26 11:30 top.txt

Looking in a TRAP file is like deciphering hieroglyphics. Unless you have some secret decoder ring, it is often hard to decipher. Most leave it to IBM support to work through. However, if you pay attention, you can gleam some information out of the trap file itself.

To detangle some of the stack signals, you will want to run the trap file through “c++filt”. See the example below.

 
$ cat 26689.31.000.trap.txt | c++filt > 26689.31.000.trap.txt.filtered

$ vi 26689.31.000.trap.txt.filtered
DB2 build information: DB2 v9.7.0.0 s090521 SQL09070
timestamp: 2012-03-26-11.30.45.445398
instance name: db2inst1.000
EDU name     : db2agent (SAMPLE) 0
EDU ID       : 31
Signal #11
…

---FUNC-ADDR---- ------FUNCTION + OFFSET------
00007F6A1804D109 ossDumpStackTraceEx + 0x01e5
00007F6A18047F2A OSSTrapFile::dumpEx(unsigned long, int, siginfo*, void*, unsigned long) + 0x00cc
00007F6A1AA6A2A9 sqlo_trce + 0x02eb
00007F6A1AAAB9B1 sqloEDUCodeTrapHandler + 0x0167
…
00007F6A19D3FCF6 sqlbAlterPoolAct(unsigned short, SQLP_LSN8*, SQLB_GLOBALS*) + 0x03f8
…
00007F6A19E0D5BB sqldmpnd(sqeAgent*, int, char*, SQLP_LSN8*, SQLD_RECOV_INFO*) + 0x01cb
00007F6A1AADD1A9 sqlptppl(sqeAgent*) + 0x02f7
00007F6A1969EE2A sqlpxcm1(sqeAgent*, SQLXA_CALL_INFO*, int) + 0x05ae
00007F6A1970EE5D sqlrrcom(sqlrr_cb*, int, int) + 0x0467
00007F6A19D16AC5 sqlbEMReduceContainers(SQLB_POOL_CB*, unsigned int, sqeBsuEdu*) + 0x0341
00007F6A19D15672 sqlbLockAndMoveExtents(SQLB_POOL_CB*, bool, unsigned int, sqeBsuEdu*) + 0x04be
00007F6A19D1B4C5 sqlbExtentMovementEntryPoint(sqeBsuEdu*, void*) + 0x00bf
00007F6A192708B3 sqleIndCoordProcessRequest(sqeAgent*) + 0x062b
00007F6A1927D5D7 sqeAgent::RunEDU() + 0x0381

It is best to read a TRAP file from the bottom up. For example, you can see DB2 was moving extents and playing with containers before a bufferpool adjustment, which is when the TRAP occurred.

Notice the Signal code in the TRAP? That gives a good clue what was going on and gives a possible cause of the problem.

Here is a list of useful signals to watch for. Notice signal 11 shows that the instance was the source of the trap.

  • SIGILL(4), SIGFPE(8), SIGTRAP(5), SIGBUS(10), Linux: (7), SIGSEGV(11), SIGKILL(9)
    Instance trap. Bad programming, HW errors, invalid memory access, stack and heap collisions, problems with vendor libraries, OS problems. The instance shuts down.
  • Most UNIX’s:    SIGABRT(6); HP-UX:    SIGIOT(6)
    Instance panic. Self induced by DB2 due to unrecoverable problems. Typically associated with data (disk) corruption. The instance shuts down.

To see if an APAR already exists for this behavior, search IBM’s website:
https://www-947.ibm.com/support/entry/portal/search_results/information_management/db2_for_linux%2c_unix_and_windows?sortby=-dcdate_sortrange&filter=language:en+scope%3Astgsysx%2Cdblue%2Cic%2Cpubs+DC.Type_avl%3ACT748&prod=B732565I80401W32&prod=C085867V89032N86&prod=D053518M30225P30&prod=G096328S00454E20&prod=J038363J86299A04&prod=J257661H55054Y90&prod=O372542I12371Q30&prod=P067787Y85362R93&prod=P383241Y60897N93&prod=P849948M05373L86&prod=S008045U16263E24&prod=S350547K04077A23&prod=T509412Q88783P22&prod=X816880U50489Z11&prod=Z983319R13364R33&prod=P849948M05373L86&matches=0&sn=spe&productContext=1690484112

In this case, search for “trap sqlbAlterPoolAct sqlbExtentMovementEntryPoint” and you’ll see that an APAR does exist for this issue.

Moving Forward.

In this series, we have learned the basics of troubleshooting in a rapid fashion. We learned the importance of baselining your production systems both at a DB2 and OS level. We learned various DB2 and OS commands to gather the information we need. We have learned that GUI is not a four letter word and that GUI tools can help when time is of the essence (at a price). Preparation, we see, is the true key to troubleshooting rapidly. Finally, we learned some advanced techniques in sifting through a DB2DIAG.log and a DB2 TRAP file.

Armed with the techniques and tips in both articles you now can go forth and diagnose your own ill performing database servers.


About the Author: Michael Krafick has been a production support DBA for over 12 years in data warehousing and highly transactional OLTP environments. He was a session speaker for “10 Minute Triage” at the 2012 IDUG Technical Conference. Michael also has extensive experience in setting up monitoring configurations for DB2 Databases. He can be reached at “Mkrafick @ bellsouth.net”. Linked-in Profile: http://www.linkedin.com/in/michaelkrafick .

 

Michael Krafick
Michael Krafick

Michael Krafick is an aspiring Zamboni driver and well known twig-eater. During the day he is a Sr. Database Engineer for a Fortune 500 company. He is a frequent contributor to datageek.blog, an IBM champion, member of the DB2 Technical Advisory Board, and IDUG speaker Hall of Fame inductee. With over 18 years experience in DB2 for LUW, Mike works hard to educate and mentor others.

Articles: 32

10 Comments

  1. […] blog posts include: 
10 Minute Triage: Assessing Problems Quickly (Part I) 
10 Minute Triage: Assessing Problems Quickly (Part II)
 Now, now you two play nice … DB2 and HACMP failover Technical Conference – It’s a skill […]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.