I learn something new every day. The times I’m most surprised about what I learn is when it happens to be something pretty fundamental about how Db2 uses resources.
This situation happened on an AIX environment I’m mentoring someone to build. The client has competent AIX admins, but has no on-staff Db2 expertise. My role is to build the Db2 servers out (more than half a dozen across two data centers) while teaching their cross-platform DBA staff how to build and administer Db2. All are single-partition AWSE implementations, some using HADR and TSAMP. The Db2 version is 126.96.36.199. The environment in question is a lightly used development system for a production system that will go live in a month or two.
A week ago we were troubleshooting some backup issues. As a part of that troubleshooting process, a junior system administrator NFS mounted a file system as a sub-directory of /tmp. What we didn’t realize at that time was that he accidentally unmounted the /tmp file system entirely. /tmp is one of those necessary locations to AIX, so AIX just went and created /tmp as a directory in the root file system and moved on.
Presentation of The Problem
This came to me as the client reporting to me that queries were failing against several of the databases. When digging into the error, we discovered that TEMPSPACE1 – the only temp space for some of the databases – was in a 0x4000 state. I find it is a bit less common to deal with non-standard table space states these days than it was 15 years ago. The Db2 8.1 certification for Advanced DBA had me studying and even memorizing some common table space states, so I did not find this part of the problem unusual.
On looking up the 0x4000 table space state, I found it translates to “Offline and Not Accessible”.
Working Through The Problem
The first thing to check with this kind of error message is to verify that the table space containers still exist, and that the permissions and ownership on them look correct. This problem can be caused by someone deleting, compressing, renaming, or changing permissions on a table space container. This was easy to do using
LIST TABLESPACE CONTAINERS for 2 and then simple OS commands to investigate the containers. Since this was for a temporary table space, the containers were of very small size, but they were there and the ownership/permissions did not look off.
My next step was to engage the system administrator to see if he saw any storage errors on the server. He was fairly sure there wasn’t, but looked just to humor me. His expression of shock when he saw that /tmp had been unmounted will bring a smile to my face for several weeks. “What happened to /tmp?!?!” He discovered the unmounting of /tmp the week before and immediately corrected it by copying any recently created files in it and mounting it again. I was not convinced that was the smoking gun at this point, because I was not aware of an ongoing reliance on /tmp by Db2. But problems are onions – you have to address one stinky layer at a time.
After /tmp was remounted, the table space state for TEMPSPACE1 in the database we were focusing on remained 0x4000. Since the environments were not being used, we recycled Db2 and checked again. At that point the table spaces for all databases on that instance were back to a normal state. There were several other instances on the server, and we then checked them. Their table spaces had all returned to a normal state WITHOUT recycling Db2 or altering the table spaces to bring them online.
One of the odd things about the manifestation of this problem is that we had 5 instances that each had 6 databases in them. In each instance, exactly 3 of the databases had TEMPSPACE1 in the 0x4000 state, and it was always the exact same 3 databases. The databases with the issue were not the ones the developers expected to be most heavily used by the development applications that were connected.
The real problem here was probably not the fact that /tmp was unmounted, but rather that the permissions on the directory the OS created to replace it were too restrictive. The default file mask on these servers is fairly restrictive, and the Db2 instance ids did not have write access. IBM has a technote describing the proper permissions for /tmp.
Reproducing this on a test system shows errors like these in the db2 diagnostic log:
2019-04-23-22.17.12.014477+000 E4921E934 LEVEL: Error (OS) PID : 56710 TID : 140077552363264 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : TESTDB APPHDL : 0-65 APPID: *LOCAL.db2inst1.190423221711 AUTHID : DB2INST1 HOSTNAME: db2server EDUID : 22 EDUNAME: db2agent (TESTDB) 0 FUNCTION: DB2 UDB, oper system services, sqlomkdirp, probe:10 MESSAGE : ZRC=0x840F0001=-2079391743=SQLO_ACCD "Access Denied" DIA8701C Access denied for resource "", operating system return code was "". CALLED : OS, -, mkdir OSERR: EACCES (13) DATA #1 : File name, 15 bytes /tmp/SQLDIR.LK0 DATA #2 : signed integer, 4 bytes 509 DATA #3 : signed integer, 4 bytes 509 DATA #4 : String, 105 bytes Search for ossError*Analysis probe point after this log entry for further self-diagnosis of this problem.
2019-04-23-188.8.131.528279+000 I38455E12061 LEVEL: Error PID : 56710 TID : 140077552363264 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : TESTDB APPHDL : 0-79 APPID: *LOCAL.db2inst1.190423221845 AUTHID : DB2INST1 HOSTNAME: db2server EDUID : 22 EDUNAME: db2agent (TESTDB) 0 FUNCTION: DB2 UDB, buffer pool services, sqlbSMSFreePoolContainers, probe:6113 MESSAGE : ZRC=0x8402001E=-2080243682=SQLB_CONTAINER_NOT_ACCESSIBLE "Container not accessible"
Thanks to @idbjorh for talking through the issue with me after the fact, confirming that I am not the only one who was not aware of Db2’s dependence on /tmp, and for doing some of the research and reproduction of the issue.
This makes the issue pretty clear, but until this issue, I was not aware of any Db2 dependency on /tmp. I’m not sure exactly what /tmp/SQLDIR.LK0 is used for, but I’ll be asking questions about it to some IBMers at the IDUG conference. I actually wish the panel question submission link was working – I would submit this now as a question if it were.
There were some access reasons that kept me from fully examining the db2 diagnostic log in this situation (this client only allows me to access the server through screen sharing, and I must tell someone else what to type), or I would have been more proactive in identifying the problem to the system administrator rather than the other way around. This is my monthly reminder that all troubleshooting starts in db2diag.log.
This also changes my file system alerting strategy for monitoring. I’ll more aggressively monitor /tmp now. I feel a bit uncomfortable that Db2 relies on a file system which others may think nothing of filling to 100%. I wonder what happens when /tmp is 100% full? I wonder if it is possible to specify a different location for Db2 to use for this purpose?