I learn something new every day. The times I’m most surprised about what I learn is when it happens to be something pretty fundamental about how Db2 uses resources.
Environment
This situation happened on an AIX environment I’m mentoring someone to build. The client has competent AIX admins, but has no on-staff Db2 expertise. My role is to build the Db2 servers out (more than half a dozen across two data centers) while teaching their cross-platform DBA staff how to build and administer Db2. All are single-partition AWSE implementations, some using HADR and TSAMP. The Db2 version is 11.1.4.4. The environment in question is a lightly used development system for a production system that will go live in a month or two.
Situation
A week ago we were troubleshooting some backup issues. As a part of that troubleshooting process, a junior system administrator NFS mounted a file system as a sub-directory of /tmp. What we didn’t realize at that time was that he accidentally unmounted the /tmp file system entirely. /tmp is one of those necessary locations to AIX, so AIX just went and created /tmp as a directory in the root file system and moved on.
Presentation of The Problem
This came to me as the client reporting to me that queries were failing against several of the databases. When digging into the error, we discovered that TEMPSPACE1 – the only temp space for some of the databases – was in a 0x4000 state. I find it is a bit less common to deal with non-standard table space states these days than it was 15 years ago. The Db2 8.1 certification for Advanced DBA had me studying and even memorizing some common table space states, so I did not find this part of the problem unusual.
On looking up the 0x4000 table space state, I found it translates to “Offline and Not Accessible”.
Working Through The Problem
The first thing to check with this kind of error message is to verify that the table space containers still exist, and that the permissions and ownership on them look correct. This problem can be caused by someone deleting, compressing, renaming, or changing permissions on a table space container. This was easy to do using LIST TABLESPACE CONTAINERS for 2
and then simple OS commands to investigate the containers. Since this was for a temporary table space, the containers were of very small size, but they were there and the ownership/permissions did not look off.
My next step was to engage the system administrator to see if he saw any storage errors on the server. He was fairly sure there wasn’t, but looked just to humor me. His expression of shock when he saw that /tmp had been unmounted will bring a smile to my face for several weeks. “What happened to /tmp?!?!” He discovered the unmounting of /tmp the week before and immediately corrected it by copying any recently created files in it and mounting it again. I was not convinced that was the smoking gun at this point, because I was not aware of an ongoing reliance on /tmp by Db2. But problems are onions – you have to address one stinky layer at a time.
After /tmp was remounted, the table space state for TEMPSPACE1 in the database we were focusing on remained 0x4000. Since the environments were not being used, we recycled Db2 and checked again. At that point the table spaces for all databases on that instance were back to a normal state. There were several other instances on the server, and we then checked them. Their table spaces had all returned to a normal state WITHOUT recycling Db2 or altering the table spaces to bring them online.
One of the odd things about the manifestation of this problem is that we had 5 instances that each had 6 databases in them. In each instance, exactly 3 of the databases had TEMPSPACE1 in the 0x4000 state, and it was always the exact same 3 databases. The databases with the issue were not the ones the developers expected to be most heavily used by the development applications that were connected.
Technical Details
The real problem here was probably not the fact that /tmp was unmounted, but rather that the permissions on the directory the OS created to replace it were too restrictive. The default file mask on these servers is fairly restrictive, and the Db2 instance ids did not have write access. IBM has a technote describing the proper permissions for /tmp.
Reproducing this on a test system shows errors like these in the db2 diagnostic log:
2019-04-23-22.17.12.014477+000 E4921E934 LEVEL: Error (OS) PID : 56710 TID : 140077552363264 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : TESTDB APPHDL : 0-65 APPID: *LOCAL.db2inst1.190423221711 AUTHID : DB2INST1 HOSTNAME: db2server EDUID : 22 EDUNAME: db2agent (TESTDB) 0 FUNCTION: DB2 UDB, oper system services, sqlomkdirp, probe:10 MESSAGE : ZRC=0x840F0001=-2079391743=SQLO_ACCD "Access Denied" DIA8701C Access denied for resource "", operating system return code was "". CALLED : OS, -, mkdir OSERR: EACCES (13) DATA #1 : File name, 15 bytes /tmp/SQLDIR.LK0 DATA #2 : signed integer, 4 bytes 509 DATA #3 : signed integer, 4 bytes 509 DATA #4 : String, 105 bytes Search for ossError*Analysis probe point after this log entry for further self-diagnosis of this problem.
and
2019-04-23-22.18.46.338279+000 I38455E12061 LEVEL: Error PID : 56710 TID : 140077552363264 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : TESTDB APPHDL : 0-79 APPID: *LOCAL.db2inst1.190423221845 AUTHID : DB2INST1 HOSTNAME: db2server EDUID : 22 EDUNAME: db2agent (TESTDB) 0 FUNCTION: DB2 UDB, buffer pool services, sqlbSMSFreePoolContainers, probe:6113 MESSAGE : ZRC=0x8402001E=-2080243682=SQLB_CONTAINER_NOT_ACCESSIBLE "Container not accessible"
Thanks to @idbjorh for talking through the issue with me after the fact, confirming that I am not the only one who was not aware of Db2’s dependence on /tmp, and for doing some of the research and reproduction of the issue.
This makes the issue pretty clear, but until this issue, I was not aware of any Db2 dependency on /tmp. I’m not sure exactly what /tmp/SQLDIR.LK0 is used for, but I’ll be asking questions about it to some IBMers at the IDUG conference. I actually wish the panel question submission link was working – I would submit this now as a question if it were.
There were some access reasons that kept me from fully examining the db2 diagnostic log in this situation (this client only allows me to access the server through screen sharing, and I must tell someone else what to type), or I would have been more proactive in identifying the problem to the system administrator rather than the other way around. This is my monthly reminder that all troubleshooting starts in db2diag.log.
This also changes my file system alerting strategy for monitoring. I’ll more aggressively monitor /tmp now. I feel a bit uncomfortable that Db2 relies on a file system which others may think nothing of filling to 100%. I wonder what happens when /tmp is 100% full? I wonder if it is possible to specify a different location for Db2 to use for this purpose?
Take a look at http://www-01.ibm.com/support/docview.wss?uid=swg21988027. This is what it says about /tmp/SQLDIR.LK0:
“The /tmp/SQLDIR.LK0 is used to serialize access to the file system when creating directories and files needed for DB2. This mainly happens when creating directories and files for storage containers.”
https://www-01.ibm.com/support/docview.wss?uid=swg21683234
Also has some interesting information about the file.
Hi Ember !! I was administering Db2 in a PDOA appliance and had several issues with /tmp even when I didn’t have any Db2 file in /tmp. For example, when /tmp reached almost full Db2 and TSA stopped. Via PMR support said that for Db2 works fine we can have /tmp on 70% or less. So one more example of the relationship between Db2 and /tmp.
Interesting issue!
From https://www-01.ibm.com/support/docview.wss?uid=swg21683234
The SQLDIR.LK0 file is used by DB2 to serialize access to all the filesystems when creating/deleting directories.
Thanks for sharing Ember!
According to the IBM technote you linked the SQLDIR.LK0 is “used to serialize access to the file system when creating directories and files needed for Db2. This mainly happens when creating directories and files for storage containers.” I’m guessing this is due to automatic storage.
Hope that helps.
Nice, 3 comments describing the “serialize access to the file system” verbiage! I get why this would be more likely to affect temp space, which is SMS under the covers. SMS creates a new file for every table, so it seems like it would need constant access to this file. Temp tables are created and dropped under the covers all the time. My questions are primarily:
What does it mean to serialize file system access? We know Db2 can access more than one file system at a time. Serialize usually means either to place in order or to ensure only one thing is occurring at a time, depending on how it is used. Why do we need to do these things with Db2’s file system access?
Why is this not occurring in a more typical location where I am aware Db2 must have space and access – such as the instance or database home directory, or even the DIAGPATH? Or in the transaction logs in some way?
Has this requirement/access always been there and I just never noticed it before, or is this a more recent addition? If a more recent addition, what version was it added in?
I’m sure IBMers will have better answers for these, and I’ll be asking at IDUG!
The /tmp directory has always been a requirement going back to at least Db2 V9.7. I think it was there in versions before that but I am not 100% sure.
Keep in mind SMS and DMS have been deprecated starting in v10.1. The focus is to now use Automatic Storage. Which if you think about it makes sense in why Db2 needs this file to keep track of new directories being created automatically. From some of the examples in the technotes the error originates in either function sqlomkdirp or sqlbLockAllDirs.
I would be curious to see whether the file contents are cleaned up after the directories are created or altered, or if they persist even after a restart of the instance.
Yes, but automatic storage still uses the basics of SMS (for temporary table spaces only) or DMS under the covers. It has always felt like Db2 just managing SMS or DMS for you.
If it is tied to AST, then 9.7 makes sense as that was the version a single storage group was introduced. However if it is instead tied to autoresize, that may have been available before AST?
I still don’t get why it isn’t in a more typical place for Db2. The database home directory seems like the obvious choice.
Db2 has always created some directories for us. Maybe not related to table spaces, but related to transaction logs and in the diagpath.
Hi Ember. Thanks for the blog entry — i would never has suspected /tmp except when running the checks that say they use /tmp. It seems to make little sense to use a public file system for normal db operations.
Ember, thank you very much for sharing.
For curiosity, I tried a simulation and it looked like the 0x4000 state ONLY occur to SMS temporary tablespace(s), not SMS regular tablespace nor DMS temporary tablespace. See an output from my test DB:
$ db2 connect to sample; db2 list tablespaces show detail | grep -e Name -e State -e Content -e Type
Database Connection Information
Database server = DB2/LINUXX8664 11.1.3.3
SQL authorization ID = DB2INST1
Local database alias = SAMPLE
Name = SYSCATSPACE
Type = Database managed space
Contents = All permanent data. Regular table space.
State = 0x0000
Name = TEMPSPACE1
Type = System managed space
Contents = System Temporary data
State = 0x4000
Name = USERSPACE1
Type = Database managed space
Contents = All permanent data. Large table space.
State = 0x0000
Name = IBMDB2SAMPLEREL
Type = Database managed space
Contents = All permanent data. Large table space.
State = 0x0000
Name = SYSTOOLSPACE
Type = Database managed space
Contents = All permanent data. Large table space.
State = 0x0000
Name = SYSTOOLSTMPSPACE
Type = System managed space
Contents = User Temporary data
State = 0x4000
Name = SPACE8
Type = System managed space
Contents = All permanent data. Regular table space.
State = 0x0000
Name = DMS8
Type = Database managed space
Contents = User Temporary data
State = 0x0000