I recently had a client with a table in an inconsistent state. Apparently a load failed, which is common enough, but they then tried to do a reorg, thinking that would solve it, so I first had to identify and force off connections causing a lock chain before solving the root problem.
Environment and Complications
There were a couple of complications due to the client being on 9.5 FixPack 1. It was a small data warehouse (1.6 TB on 1 catalog and 4 data nodes). LOCKTIMEOUT is set to -1. The troubleshooting and resolution would have been significantly easier on a newer version of DB2. There were two complicating factors with the old environment:
- The db2top utility had not yet been incorporated into DB2 on this version, so I had manually installed it, and occasionally it doesn’t function quite right. At the time of this issue, it refused to show me the sessions screen without crashing.
- A host of table functions I use to administer DB2 were introduced in db2 9.7, and are not available in 9.5, and there is no way to back-port them. I normally would have made use of either MONREPORT or MON_GET functions or both.
Clearing up Lock Chain
The first thing I did was to clear up the reorg and other connections. As the instance owner, I made my putty session as large on my screen as possible. Next, I looked at the connections in db2top by issuing:
db2top -d sample
(where sample is the database name)and then pressing U to get the locking screen. I don’t have a screen shot of how that screen looked, but it showed me multiple connections in a red “lock waiting” state. By pressing L, I got the lock chain. While not from this specific issue, the lock chain screen looks something like this:
The main thing I’m getting from that screen are the three applicaiton ids that were involved in lock chains. Next, I use one of two methods to determine which application was the reorg and force off the other two.
- Method 1: From the locks screen in db2top, press A. It will prompt you for an agent id. Enter one of the agent ids and see if you can find where it lists an sql statement or former action. It may or may not give you this info in an easy to read format. This screen is the reason I make my putty window as large as possible – there is a lot of data to display
- Method 2: Exit out of db2top by pressing q or ctrl+c. At the command line, issue:
db2 get snapshot for applications on sample > appsnap.out
eral screens of text for each one to see if there is a reference to a reorg or to a script being run. Pay attention to the authorization IDs being used and the idle time. Alternately, search the file on “eorg” to find applications that have recently done reorgs. I copy about the first screen of this output when giving a client details about a connection I’d like to force, as it includes things like where the connection is coming from, the authorization id, and the idle time.
Using method 2 above, I identified which two of the three connections were not the reorg and forced them off (after verifying through the appsnap.out that they had been idle for a while – I hate to force an active or recently active connection).
Identifying and Clearing Table Problem
The reorg then completed or failed immediately. I attempted a simple select on the table and noted the exact error message and return code. This is what I got:
$ db2 "select * from cis_ods.cs_pgm fetch first 1 row only with ur"
CS_ID PGM_TYP_CD EFF_BGN_DT APP_ID EMR_REQ_SW PGM_REQ_DT PGM_STS_CD STS_RSN_CD STS_DT RESC_RSN_CD RESC_DT CSLD_ID CSLD_XFR_DT CSLD_XFR_RSN_CD PTNTL_ICT_DISCN_EFF_DT ACTN_DT EFF_END_DT CRT_USR_ID CRT_DTM UPD_USR_ID UPD_DTM HIST_IND LAST_RTRV_DT CNTY_SPFC_CS_TYP_IND BEN_SAFE_ARMS_NWBN_SW MCL_PROP_DSRG_EVA_SW TYP_OF_HH_IND AID_COD CSH_REACTVT_DT APP_CMP_SW APP_CMP_DT APP_CMP_OVR_SW EX_PARTE_CNTCT_DT EX_PARTE_CNTCT_CD FS_REACTVT_DT DISCV_RSN_CD DISCV_DT FS_APP_TM_LIM_RSN_CD ACA_TRNST_DT ACA_TRNST_OVR_SW INGEST_DATE
------- ---------- ---------- ----------- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------- ----------- --------------- ---------------------- ---------- ---------- --------------- -------------------------- --------------- -------------------------- -------- ------------ -------------------- --------------------- -------------------- ------------- ------- -------------- ---------- ---------- -------------- ----------------- ----------------- ------------- ------------ ---------- -------------------- ------------ ---------------- --------------------------
SQL0668N Operation not allowed for reason code "3" on table "CIS_ODS.CS_PGM".
SQLSTATE=57016
I then looked up that error message so I could understand what reason code 3 meant:
$ db2 "select * from cis_ods.cs_pgm fetch first 1 row only with ur"
CS_ID PGM_TYP_CD EFF_BGN_DT APP_ID EMR_REQ_SW PGM_REQ_DT PGM_STS_CD STS_RSN_CD STS_DT RESC_RSN_CD RESC_DT CSLD_ID CSLD_XFR_DT CSLD_XFR_RSN_CD PTNTL_ICT_DISCN_EFF_DT ACTN_DT EFF_END_DT CRT_USR_ID CRT_DTM UPD_USR_ID UPD_DTM HIST_IND LAST_RTRV_DT CNTY_SPFC_CS_TYP_IND BEN_SAFE_ARMS_NWBN_SW MCL_PROP_DSRG_EVA_SW TYP_OF_HH_IND AID_COD CSH_REACTVT_DT APP_CMP_SW APP_CMP_DT APP_CMP_OVR_SW EX_PARTE_CNTCT_DT EX_PARTE_CNTCT_CD FS_REACTVT_DT DISCV_RSN_CD DISCV_DT FS_APP_TM_LIM_RSN_CD ACA_TRNST_DT ACA_TRNST_OVR_SW INGEST_DATE
------- ---------- ---------- ----------- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------- ----------- --------------- ---------------------- ---------- ---------- --------------- -------------------------- --------------- -------------------------- -------- ------------ -------------------- --------------------- -------------------- ------------- ------- -------------- ---------- ---------- -------------- ----------------- ----------------- ------------- ------------ ---------- -------------------- ------------ ---------------- --------------------------
SQL0668N Operation not allowed for reason code "3" on table "CIS_ODS.CS_PGM".
SQLSTATE=57016
I then looked up that error message so I could understand what reason code 3 meant:
$ db2 ? SQL0668N
SQL0668N Operation not allowed for reason code "reason-code" on table
"table-name".
Explanation:
Access to table "table-name" is restricted. The cause is based on the
following reason codes "reason-code":
1 The table is in the Set Integrity Pending No Access state. The
integrity of the table is not enforced and the content of the
table may be invalid. An operation on a parent table or an
underlying table that is not in the Set Integrity Pending No
Access state may also receive this error if a dependent table
is in the Set Integrity Pending No Access state.
2 The table is in the No Data Movement state. When in this state,
operations that cause data movement are disallowed. Data
movement operations include REDISTRIBUTE, update of database
partitioning key, update of multi-dimensional clustering key,
update of range clustering key, update of data partitioning key
and REORG TABLE.
3 The table is in the Load Pending state. A previous LOAD attempt
on this table resulted in failure. No access to the table is
allowed until the LOAD operation is restarted or terminated.
4 The table is the Read Access state. This state can occur during
on-line LOAD processing (LOAD INSERT with the READ ACCESS
option), or after an on-line LOAD operation, but before all
constraints have been validated in the newly appended portion
of the table using the SET INTEGRITY statement. No update
activity is allowed on this table.
5 The table is in the Load In Progress state. The LOAD utility is
currently operating on this table, no access is allowed until
the LOAD is finished.
6 Materialized query tables that reference a nickname cannot be
refreshed in ESE.
7 The table is in the reorg pending state. This can occur after
an ALTER TABLE statement containing a REORG-recommended
operation.
8 The table is in the alter pending state. This can occur when
using the table in the same unit of work as an ALTER TABLE
statement containing a REORG-recommended operation.
9 The table is in Redistribute Pending state. The REDISTRIBUTE
utility is not completed on this table, no access is allowed
until the REDISTRIBUTE is finished.
User response:
1 Execute the SET INTEGRITY statement with the IMMEDIATE CHECKED
option on table "table-name" to bring the table out of the
Set Integrity Pending No Access state. For a user maintained
materialized query table, execute the statement with the
IMMEDIATE UNCHECKED option instead of the IMMEDIATE CHECKED
option.
2 Execute REFRESH TABLE statement on the dependent immediate
materialized query tables and staging tables of table
"table-name". The contents of these dependent immediate
materialized query tables and staging tables can be
incrementally maintained from the appended data of
"table-name" through previous LOAD INSERT operations and from
the attached data of "table-name" through previous ALTER
TABLE statements with the ATTACH clause.
3 Restart or terminate the previously failed LOAD operation on
this table by issuing LOAD with the RESTART or TERMINATE option
respectively.
4 Issue the LOAD QUERY command to check whether the table is in
the process of being loaded. If yes, wait until the LOAD
utility has completed, or if necessary, restart or terminate
previously failed LOAD operation. If LOAD is currently not in
progress, issue the SET INTEGRITY statement with the IMMEDIATE
CHECKED option, to validate constraints in the newly loaded
portion of the table.
5 Wait until the current LOAD operation has finished. You can use
the LOAD QUERY command to monitor the progress of load.
6 Define a materialized query table using the MAINTAIN BY USER
option. Then, use an INSERT statement with a subquery to
populate the materialized query table.
7 Reorganize the table using the REORG TABLE command (note that
INPLACE REORG TABLE is not allowed for a table that is in the
reorg pending state).
8 Complete the unit of work, and re-issue the command.
9 If the REDISTRIBUTE utility is working, wait until it finishes
working on the current table. You can use the LIST UTILITIES
command to monitor the progress of the REDISTRIBUTE utility. If
a previous REDISTRIBUTE operation failed and left the table in
this state, issue the REDISTRIBUTE utility again with the
CONTINUE or ABORT option and let it finish on this table.
sqlcode: -668
sqlstate: 57007
This was clearly telling me that this is not a reorg issue, but a table in a load pending state. I then tried a load terminate command that the client mentioned they had used in the past, and got this:
$ db2 load from /dev/null of del terminate into cis_ods.cs_pgm nonrecoverable
Agent Type Node SQL Code Result
______________________________________________________________________________
LOAD 001 +00000000 RESTART required.
______________________________________________________________________________
LOAD 002 -00002032 Error. RESTART required.
______________________________________________________________________________
LOAD 003 -00002032 Error. RESTART required.
______________________________________________________________________________
LOAD 004 -00002032 Error. RESTART required.
______________________________________________________________________________
RESULTS: 1 of 4 LOADs completed successfully.
______________________________________________________________________________
Summary of LOAD Agents:
Number of rows read = 0
Number of rows skipped = 0
Number of rows loaded = 0
Number of rows rejected = 0
Number of rows deleted = 0
Number of rows committed = 0
SQL2032N The "nonrecoverable" parameter is not valid. SQLSTATE=22531
After searching around a bit on google and IBM sites, I found some suggestions that perhaps I needed to use copy yes instead of nonrecoverable. I’m not sure if BLOCKNONLOGGED might be set on these servers. I did this:
$ db2 load from /dev/null of del terminate into cis_ods.cs_pgm copy yes to /dev/null
Agent Type Node SQL Code Result
______________________________________________________________________________
LOAD 001 +00000000 Success.
______________________________________________________________________________
LOAD 002 +00000000 Success.
______________________________________________________________________________
LOAD 003 +00000000 Success.
______________________________________________________________________________
LOAD 004 +00000000 Success.
______________________________________________________________________________
RESULTS: 4 of 4 LOADs completed successfully.
______________________________________________________________________________
Summary of LOAD Agents:
Number of rows read = 0
Number of rows skipped = 0
Number of rows loaded = 0
Number of rows rejected = 0
Number of rows deleted = 0
Number of rows committed = 0
Resolution
I then tried the simple select again, and this time it was successful! The state of the table was cleared and it was now empty. This is what the client wanted – an empty but accessible table.
I would not have used copy yes to /dev/null if HADR was in use for the database, because that would make the table unavailable on the HADR server. I would have had to do copy yes to a shared location between the two servers.
I would also suggest setting LOCKTIMEOUT to 120 and DLCHKTIME to 60000 for warehouse/OLAP based loads. From what I have researched on the web (including DBI’s blogs) these are the recommended settings for warehouses to prevent lock timeouts and dead locks.
Hi Ember,
Thanks for sharing your problem. I would also suggest if the customer is happy with an empty accessible table then you could also just drop and recreate it.
Regards
Hi Ember,
on db2 V10.5 load job was cancelled by the developer and let me know that table is not responding.
When I check the load utility status with query (load query table table-name), it shows “table is in progress”. When I checked for lock it has locked as well.
I had try multiple times to kill the session db2 force application all. But it is not kill the session till 4 hours.
I had execute db2 load terminate command as well, it has still running from couple of hours. And its blocked by the load id user.
Can you please help how I can terminate the load utility when table is in progress stat and table level lock on it
Is this a load replace where the entire content of the table is intended to be replaced?
Its a partition table and load job first truncate the table and then insert the data. Every time load job failed or cancel, table goes to hang state and show table level locked and we are unable to query. And to put table back to normal state I have to restart the DB.
The old trick was to perform a load replace from /dev/null
Many thanks Ember for sharing your exp, even I faced same issue and resolved.
Hi Ember,
An article related to difference between warmns start replication, full refresh and cold start will be very useful!
I had a problem on my project this week, it seems that I started replication warmns mode + enable full refresh and tables got into set integrity pending state. I solved the problem, but I forgot to disable full refresh and after a couple of days, we had a SEV1 again because tables were not accessible.
Thanks!