I get it. The database server is a single point in many environments that have many points at most levels. Therefore any problem that affects more than one point must be the database.
As DBAs, sometimes we feel the database is under attack. We may therefore develop the knee-jerk reaction “It isn’t the database!” If a DBA is on top of things, and given reasonable power and resources, then it isn’t ALWAYS the database. I make a conscious effort to go into every outage call with the assumption by me that it IS the database, and it is my job to find out how. This approach has me digging a lot deeper than if I came in with the preconceived opinion that it wasn’t the database.
A client called me recently, reporting that only some of their servers were getting SQL300081 with return codes ,11004,. This is a DB2 on Windows environment. Many DBAs know that SQL30081 is a client-side code that basically means “I can’t talk to that database!” It can have such a vast array of causes from DB2 not being started on the database server, to a firewall between the servers, to an incorrect hosts file, and so much more. This technote is often extremely useful in troubleshooting. It has been around in one format or another since I started as a DBA.
Before they even gave me the error message, I was betting that’s what they’d say. But I didn’t say it out loud because I come across as enough of a know-it-all as it is. I dutifully linked through to the Windows message page, and got really nothing out of it. However, once the time frame of the fleeting issue that had already resolved itself by the time I joined the bridge call had been defined, I started parsing through the diag log for that time. I saw this error message:
2016-08-18-18.28.01.903000-300 I21542872F451 LEVEL: Warning PID : 87484 TID : 8816 PROC : db2syscs.exe INSTANCE: DB2 NODE : 000 DB : SAMPLE APPHDL : 0-5573 EDUID : 8816 EDUNAME: db2agent (SAMPLE) 0 FUNCTION: DB2 UDB, bsu security, sqlexLogPluginMessage, probe:20 DATA #1 : String with size, 62 bytes Password validation for user user1 failed with rc = -2146500252
(as always, names have been changed to protect the innocent)
I counted how many occurrences of this were seen in the half hour in which the problem occurred to the half hour before and the half hour after to make sure it wasn’t just something silly like a user mistyping a password that happened all the time. I found 251 occurrences in the half hour where the problem occurred, and ZERO in either the half hour before or the half hour after.
I’ve been finding the db2diag tool particularly useful, particularly on Windows lately. I used this command to count the occurrences in the problem half-hour:
PS D:\xtivia> db2diag -H 30m:2016-08-18-18.30 -g data:="Password validation for user user1" -c db2diag: 251 matches found of total 1543 records
The following also works on Windows:
PS D:\xtivia> db2diag -H 30m:2016-08-18-18.30 -g data:="Password validation for user user1" -fmt %ts |measure-object -line Lines Words Characters Property ----- ----- ---------- -------- 251
After reporting this to everyone on a large call, they were able to identify that both the primary and the secondary domain controllers were rebooted at the same time. The USER1 user id is an Active Directory domain ID. All of the application servers that had problems connecting were in the same site as these domain controllers.
In this case, it wasn’t the database, but the DB2 DBA helped identify the root cause of the problem.