There have been a few times in my career where I have been stunned by something I have seen while at work.
Sometimes it was the good type of stunned. Like the first time first time I observed a truly obnoxiously large workload hit my production database and the database still performed like a champ. Sometimes it was the bad type of stunned. That would be the “impressive, a developer corrupted everything” moments, “what do you mean you don’t have a backup” moments, or the “don’t ask me that at 5pm on a Friday unless you want to get punched” moments.
Then there are the moments when you are really stunned or dumbfounded. Those are the times you sit at your desk, trying to process what you have seen, and mumble to yourself “I have to be misunderstanding this because this goes against everything I have learned”. This article is about one of those special types of stunned moments. Recently, I was investigating an IBM Sterling Order Management System database. This was an established Db2 11.5 database on an Amazon EC2 instance that had been running for years.
During an overall health check of an established database like this one, I switch from looking at general configuration and best practices to metrics that may show where inefficiencies lie. This is when I start looking for things like poor index read efficiency, wait times, and in this specific case – transaction rollback percentage.
Transaction Rollback Percent
Db2 is always trying to protect integrity of the data and is constantly working with save points to know what work is in flight and what work is committed to the database. A commit makes changes from a unit of work permanent, and a new save point is established. A rollback tells Db2 to undo work in flight and to go back to the last save point. This is an overly simplistic explanation, but for the most part it holds true.
The transaction rollback percentage is an early warning system. If a lot of rollbacks are happening it could be because of an abnormal interruption to a unit of work due to something like an error code, power outage, or network hiccup. If I am seeing anything over a minimal percentage of rollbacks compared to the number of transactions, I know something may be “off” and I may need to poke around some more.
According to the handy dandy Jupyter Notebook I had been developing for database health checks, my production database spent 34-35% of its total workload trying to rollback work. For context, anything over 1% would have caught my attention. At 35% I rocketed past “you have my attention” to “Holy Crap! This is not good. How is our database not on its knees with some sort of slowness or contention”?
SQL to determine Database Transaction Rollback Percentage
SELECT TOTAL_APP_COMMITS AS APP_COMMITS, TOTAL_APP_ROLLBACKS AS APP_ROLLBACKS, TOTAL_APP_COMMITS + TOTAL_APP_ROLLBACKS AS TOTAL, CASE WHEN TOTAL_APP_COMMITS + TOTAL_APP_ROLLBACKS > 0 THEN DECIMAL(FLOAT(TOTAL_APP_ROLLBACKS)/FLOAT(TOTAL_APP_COMMITS + TOTAL_APP_ROLLBACKS),10,2)*100 ELSE 0 END AS PCNT_ROLLBACK FROM TABLE(MON_GET_DATABASE(-2)) WITH UR
At first, I assumed my SQL was wrong. I’ve worked with internal homegrown tool databases in the past and wouldn’t be surprised to find something so wasteful, but this doesn’t make sense for an established vendor database like IBM Sterling Order Management System. Upon further investigation, I saw roughly the same ratio in all OMS database environments ranging from development to production.
I turned to a second tool that I depend on for triage analysis, capacity planning, and performance analysis. Using ITGain’s Speedgain for Db2, I was able to generate a running graph of rollbacks per second. I figure this would validate my fledgling Jupyter notebook and prove I was not crazy. Sure enough, I am averaging 2-3 rollbacks a second. I believe, in a 24 hour period, I captured over 724,000 rollbacks.
The next few days were spent diving into the technical weeds to see if we could discover why this was happening. Was this a specific application server that was misconfigured? Could this be a job that kicks off every few minutes that runs tremendously inefficiently? Could we have a network connectivity issue?
Between graphs from Speedgain for Db2 and details gathered in various workload event monitors I determined a few things.
- This happened all day, every day.
- It was not one specific application server or IP address.
- Almost all SQL seemed affected with a rollback at some point. Surprisingly, and more concerningly, this seems to happen to out of box SQL from the vendor.
Engaging IBM Support
After a week of beating my head against the wall with the internal application and system engineering teams, we threw up our hands, punted, and opened a Case with IBM Sterling OMS support. A lot of time was spent with level one support. They were confused with the fact that things were working, there was no error code, we were just reporting some very strange behavior, and that we were concerned this inefficiency would cause problems if we had to scale up dramatically. Many days later, I was faced with the response that kind of struck me dumb.
The response from IBM … things were working as designed.
From IBM Sterling OMS technical support via my Case:
“In OMS, after every commit, a rollback statement is fired, irrespective of whether there is something really needs to be rolled back or not.”
Remember my earlier explanation of a commit and a rollback statement? If we issue a commit at the end of our unit of work, there will be nothing to undo if we issue a rollback statement. Yet, this design will always issue that rollback statement. In our case this happened 724,000 times a day.
Reaching out to the Community
This is where I became so dumbfounded that I start to question the basics. “Ok, I’ve been dealing with relational databases for twenty years, but maybe I have misunderstood some fundamental concept all this time. I can think of no rational reason to issue a rollback after almost every transaction. Especially in a highly transactional database”.
After verifying my grasp on the basics, I concluded that although this isn’t my first rodeo, there is probably some architecture that explains this behavior that I have just not run across in my 20-year career. Over the next day or two, I started engaging other senior level Db2 administrators and experts through e-mail, chat, and even Twitter. Had anyone seen a design like this before, and if you haven’t seen this application design could you think of a reason why you would use this approach? If I could summarize most responses, it would be “<confused look on face> <slight pause> … That makes no sense”.
Interestingly, two responses via Twitter provided scenarios that would justify this design. In both cases, it was to address a possible overall design flaw where rollback statements acted as a safety net to close any unit of work that hadn’t been committed. Around that same time another update came through our Case from IBM Sterling OMS support.
Further on in IBM’s response I was pointed to an old technote from 2018:
“One function of the connection pooling is, when a connection is released, to automatically perform a rollback; this is in case the coding logic accidentally didn’t either commit or rollback. So, there will be a rollback for every connection used, independent of whether the code called commit or rollback, to this will prevent uncommitted transactions from causing downstream locking problems.”
Huh. There you go. The rollbacks seem to be a safety net just like the suggestions I saw via Twitter. I don’t agree with this design, but at least I understand the logic.
What I thought was interesting was an ongoing question pressed by IBM as we went back and forth. IBM wanted to know if the rollbacks were having a negative performance impact. I see their point. You can play the “if it isn’t broken, don’t fix it” card in this situation.
There is an argument that zero impact times our 724,000 rollbacks in a 24 hour period is still zero impact. If a commit is issued and a rollback immediately follows it that does nothing, there should be no impact to performance.
The problem is I can’t quantify impact of this design decision. Our configuration is slightly overpowered which could mask any impact this could have. Unless there is a Db2 utility or SQL against system tables that I am not aware of, I can only see the rollback statement issued, not if Db2 had to undo any work because the statement was issued.
What I can say to IBM is that “I can’t answer that but at a minimum this design muddies the waters”. It will be hard to sift through all the noise of rollbacks generated by this design if I have a separate problem that causes work to end abnormally. It also makes it difficult for planned capacity planning. If 30-40% of my transactions are trying to undo a unit of work that may or may not be there, how do I determine its impact so I can properly scale if I double, triple, or quadruple the number of transactions or volume of data?
In the end you have IBM and I in a stalemate of “Does it affect performance” vs. “I’m not sure, but why would you decide on such a design. What issue are you trying to address with this failsafe”?
This should be considered as hearsay evidence, but I suspect this is a carryover from a design decision made years ago. Although I can’t confirm it, I’ve been told from more than one source that Sterling Order Management System was purchased by IBM. Supposedly, it was originally developed for an Oracle database and IBM ported it to work with Db2 when it was purchased. The technote IBM referenced seemed to be Oracle-specific, so there may be some truth to that. Maybe this design complimented how the application would naturally interface with Oracle.
I can’t help but think about lessons learned in a book written by Craig Mullins. Although the target audience for “A Guide to Db2 Performance of Application Developers” is application developers, I gained a lot of insight into how an application developer may think and how they may naturally approach a problem. A database engineer has a macro point of view and looks more at the impact of a process, workaround, or feature on the database. We may look for increased resource consumption, overall impact on transaction speed, and how the process is affected as a whole. A developer may approach things from a more micro point of view where they focus on the specifics of a feature, ability, or fix with less thought given to overall impact. Few developers would have a deep understanding of Db2 internals, and a shocking number don’t truly understand transactional control. I don’t think many developers would know the effect a decision like this one on Db2 and data consistency. To a developer, this probably made sense at the time and acted as a safety net for a specific problem they were seeing.