There have been a few times in my career where I have been stunned by something I have seen while at work.
Sometimes it was the good type of stunned. Like the first time first time I observed a truly obnoxiously large workload hit my production database and the database still performed like a champ. Sometimes it was the bad type of stunned. That would be the “impressive, a developer corrupted everything” moments, “what do you mean you don’t have a backup” moments, or the “don’t ask me that at 5pm on a Friday unless you want to get punched” moments.
Then there are the moments when you are really stunned or dumbfounded. Those are the times you sit at your desk, trying to process what you have seen, and mumble to yourself “I have to be misunderstanding this because this goes against everything I have learned”. This article is about one of those special types of stunned moments. Recently, I was investigating an IBM Sterling Order Management System database. This was an established Db2 11.5 database on an Amazon EC2 instance that had been running for years.
During an overall health check of an established database like this one, I switch from looking at general configuration and best practices to metrics that may show where inefficiencies lie. This is when I start looking for things like poor index read efficiency, wait times, and in this specific case – transaction rollback percentage.
Transaction Rollback Percent
Db2 is always trying to protect integrity of the data and is constantly working with save points to know what work is in flight and what work is committed to the database. A commit makes changes from a unit of work permanent, and a new save point is established. A rollback tells Db2 to undo work in flight and to go back to the last save point. This is an overly simplistic explanation, but for the most part it holds true.
The transaction rollback percentage is an early warning system. If a lot of rollbacks are happening it could be because of an abnormal interruption to a unit of work due to something like an error code, power outage, or network hiccup. If I am seeing anything over a minimal percentage of rollbacks compared to the number of transactions, I know something may be “off” and I may need to poke around some more.
According to the handy dandy Jupyter Notebook I had been developing for database health checks, my production database spent 34-35% of its total workload trying to rollback work. For context, anything over 1% would have caught my attention. At 35% I rocketed past “you have my attention” to “Holy Crap! This is not good. How is our database not on its knees with some sort of slowness or contention”?
SQL to determine Database Transaction Rollback Percentage
SELECT TOTAL_APP_COMMITS AS APP_COMMITS, TOTAL_APP_ROLLBACKS AS APP_ROLLBACKS, TOTAL_APP_COMMITS + TOTAL_APP_ROLLBACKS AS TOTAL, CASE WHEN TOTAL_APP_COMMITS + TOTAL_APP_ROLLBACKS > 0 THEN DECIMAL(FLOAT(TOTAL_APP_ROLLBACKS)/FLOAT(TOTAL_APP_COMMITS + TOTAL_APP_ROLLBACKS),10,2)*100 ELSE 0 END AS PCNT_ROLLBACK FROM TABLE(MON_GET_DATABASE(-2)) WITH UR
At first, I assumed my SQL was wrong. I’ve worked with internal homegrown tool databases in the past and wouldn’t be surprised to find something so wasteful, but this doesn’t make sense for an established vendor database like IBM Sterling Order Management System. Upon further investigation, I saw roughly the same ratio in all OMS database environments ranging from development to production.
I turned to a second tool that I depend on for triage analysis, capacity planning, and performance analysis. Using ITGain’s Speedgain for Db2, I was able to generate a running graph of rollbacks per second. I figure this would validate my fledgling Jupyter notebook and prove I was not crazy. Sure enough, I am averaging 2-3 rollbacks a second. I believe, in a 24 hour period, I captured over 724,000 rollbacks.
The next few days were spent diving into the technical weeds to see if we could discover why this was happening. Was this a specific application server that was misconfigured? Could this be a job that kicks off every few minutes that runs tremendously inefficiently? Could we have a network connectivity issue?
Between graphs from Speedgain for Db2 and details gathered in various workload event monitors I determined a few things.
- This happened all day, every day.
- It was not one specific application server or IP address.
- Almost all SQL seemed affected with a rollback at some point. Surprisingly, and more concerningly, this seems to happen to out of box SQL from the vendor.
Engaging IBM Support
After a week of beating my head against the wall with the internal application and system engineering teams, we threw up our hands, punted, and opened a Case with IBM Sterling OMS support. A lot of time was spent with level one support. They were confused with the fact that things were working, there was no error code, we were just reporting some very strange behavior, and that we were concerned this inefficiency would cause problems if we had to scale up dramatically. Many days later, I was faced with the response that kind of struck me dumb.
The response from IBM … things were working as designed.
From IBM Sterling OMS technical support via my Case:
“In OMS, after every commit, a rollback statement is fired, irrespective of whether there is something really needs to be rolled back or not.”
Remember my earlier explanation of a commit and a rollback statement? If we issue a commit at the end of our unit of work, there will be nothing to undo if we issue a rollback statement. Yet, this design will always issue that rollback statement. In our case this happened 724,000 times a day.
Reaching out to the Community
This is where I became so dumbfounded that I start to question the basics. “Ok, I’ve been dealing with relational databases for twenty years, but maybe I have misunderstood some fundamental concept all this time. I can think of no rational reason to issue a rollback after almost every transaction. Especially in a highly transactional database”.
After verifying my grasp on the basics, I concluded that although this isn’t my first rodeo, there is probably some architecture that explains this behavior that I have just not run across in my 20-year career. Over the next day or two, I started engaging other senior level Db2 administrators and experts through e-mail, chat, and even Twitter. Had anyone seen a design like this before, and if you haven’t seen this application design could you think of a reason why you would use this approach? If I could summarize most responses, it would be “<confused look on face> <slight pause> … That makes no sense”.
Interestingly, two responses via Twitter provided scenarios that would justify this design. In both cases, it was to address a possible overall design flaw where rollback statements acted as a safety net to close any unit of work that hadn’t been committed. Around that same time another update came through our Case from IBM Sterling OMS support.
Further on in IBM’s response I was pointed to an old technote from 2018:
“One function of the connection pooling is, when a connection is released, to automatically perform a rollback; this is in case the coding logic accidentally didn’t either commit or rollback. So, there will be a rollback for every connection used, independent of whether the code called commit or rollback, to this will prevent uncommitted transactions from causing downstream locking problems.”
Huh. There you go. The rollbacks seem to be a safety net just like the suggestions I saw via Twitter. I don’t agree with this design, but at least I understand the logic.
What I thought was interesting was an ongoing question pressed by IBM as we went back and forth. IBM wanted to know if the rollbacks were having a negative performance impact. I see their point. You can play the “if it isn’t broken, don’t fix it” card in this situation.
There is an argument that zero impact times our 724,000 rollbacks in a 24 hour period is still zero impact. If a commit is issued and a rollback immediately follows it that does nothing, there should be no impact to performance.
The problem is I can’t quantify impact of this design decision. Our configuration is slightly overpowered which could mask any impact this could have. Unless there is a Db2 utility or SQL against system tables that I am not aware of, I can only see the rollback statement issued, not if Db2 had to undo any work because the statement was issued.
What I can say to IBM is that “I can’t answer that but at a minimum this design muddies the waters”. It will be hard to sift through all the noise of rollbacks generated by this design if I have a separate problem that causes work to end abnormally. It also makes it difficult for planned capacity planning. If 30-40% of my transactions are trying to undo a unit of work that may or may not be there, how do I determine its impact so I can properly scale if I double, triple, or quadruple the number of transactions or volume of data?
In the end you have IBM and I in a stalemate of “Does it affect performance” vs. “I’m not sure, but why would you decide on such a design. What issue are you trying to address with this failsafe”?
This should be considered as hearsay evidence, but I suspect this is a carryover from a design decision made years ago. Although I can’t confirm it, I’ve been told from more than one source that Sterling Order Management System was purchased by IBM. Supposedly, it was originally developed for an Oracle database and IBM ported it to work with Db2 when it was purchased. The technote IBM referenced seemed to be Oracle-specific, so there may be some truth to that. Maybe this design complimented how the application would naturally interface with Oracle.
I can’t help but think about lessons learned in a book written by Craig Mullins. Although the target audience for “A Guide to Db2 Performance of Application Developers” is application developers, I gained a lot of insight into how an application developer may think and how they may naturally approach a problem. A database engineer has a macro point of view and looks more at the impact of a process, workaround, or feature on the database. We may look for increased resource consumption, overall impact on transaction speed, and how the process is affected as a whole. A developer may approach things from a more micro point of view where they focus on the specifics of a feature, ability, or fix with less thought given to overall impact. Few developers would have a deep understanding of Db2 internals, and a shocking number don’t truly understand transactional control. I don’t think many developers would know the effect a decision like this one on Db2 and data consistency. To a developer, this probably made sense at the time and acted as a safety net for a specific problem they were seeing.
I know what you mean Mike; I’ve not hit this specific problem but have had several at customer sites where I’m thinking “why in the name of … did you design it like that”? And the nearest I can get to a response from the customer is something along the lines of “well we ported it over from (fill in name of some other RDMS) and it didn’t do any harm there”.
BTW; I like that Transaction Rollback percentage query. May I ‘borrow’ it?
I’m glad someone feels my pain. Absolutely feel free to borrow the SQL, that is why I posted it here.
Thank you Michael for a great article. I’m telling my team to read this. It’s good to understand the flow of your investigation.
About the excessive rollback, my personal opinion, not of IBM, is… I have to agree with you. Again, my personal opinion, not of IBM. 🙂 By the way, I checked with the dev team (Keri) and fortunately, the excessive rollback won’t impact the database instance. whew~~~
I appreciate you going the extra mile to talk to Keri about this. Logically, spinning your wheels for no reason and not expending some resource doesn’t make sense to me. If I am on an exercise bike and set to no resistance, I can still pedal and go a lot farther but I still expend a little extra energy just because of the friction caused by the belt sitting against the wheel with minimal tension. If the belt was completely removed, there is no tension at all – even minimal. Without a load test with and without this scenario I don’t know if we could definitively say this has no affect at that scale (724k in 24 hours). However, considering Keri is the
Db2 Architect, I will gladly shut the heck up and defer to her on this one. 😉
I really enjoyed reading your article, your logic, and the way you write. I am very non-technical now, but it makes a lot of sense to me. Thank you again for presenting at our WDUG (Wisconsin Db2 Users Group) and other combined RUGs on Sept 2, 2021.
Michael, very interesting post. With SQL in this post I have checked all of my production DB2 databases and found out all databases expect one have 0%, but one one of them has 10%.
How to get at least one SQL that is part of the rollback transactions? When I contact application developers they need something to search application’s source code.
Tony, you are lucky that I keep an obscene amount of notes when I work. I was able to go back in my ticket from a few months ago and sure enough I recorded what I did. In the end I created two different event monitors, discovered what tables were spawned, and found some SQL that seemed to join the two. Hopefully this will help. Let me know if it works for you.
CREATE EVENT MONITOR ROLLBACK_MONITOR FOR UNIT OF WORK WRITE TO TABLE
CREATE EVENT MONITOR ROLLBACK_STMT FOR STATEMENTS WRITE TO TABLE
Confirm the monitor is created and see if it is off (0) or on (1):
SELECT evmonname, EVENT_MON_STATE(evmonname) FROM syscat.eventmonitors
Turn Monitors on/off (recommended no more than 30 seconds):
ON - SET EVENT MONITOR ROLLBACK_MONITOR STATE 1
OFF - SET EVENT MONITOR ROLLBACK_MONITOR STATE 0
ON - SET EVENT MONITOR ROLLBACK_STMT STATE 1
OFF - SET EVENT MONITOR ROLLBACK_STMT STATE 0
Tables Created after event monitors are on:
SQL Joining the tables:
UOW.APPLICATION_HANDLE, UOW.CLIENT_HOSTNAME, UOW.start_time, UOW.COMPLETION_STATUS, STMT.STMT_TEXT
FROM UOW_ROLLBACK_MONITOR UOW, STMT_ROLLBACK_STMT STMT
ORDER BY UOW.APPLICATION_HANDLE, UOW.CLIENT_HOSTNAME, UOW.start_time
thanks for idea. Event monitors are the right direction if all SQLs are required to be caught and detailed analysis required. Thank you very much for all of SQLs. But… somehow I am not a fan of event monitors, because they can be pretty aggressive and to my experience should not be used on databases with limited hardware resources. I once got important database to crawl. Also saving data into tables “pollutes” the database with objects and database can get out of storage (specially if I get unexpectedly interrupted and I forget about event monitors), to get out of disk space problem new storage group is required to be created and new tablespace, but this further “pollutes” the database. Specially if restore is performed and all this “pollution” is restored to another database.
I have few questions about event monitors if I may:
1. Is it possible to store event monitor output data to another database? This would be very useful, to separate production database from monitoring data (except federated database).
2. When saving event monitor data to files (instead of database tables), is it possible to format monitoring output to be column like oriented (just like tables using event monitors)? Saving event monitors to files saves each column as row and it is not easy to manipulate event monitor data from files.
MY HUMBLE SOLUTION
I manage to get some of the data using MON_GET tables. This are for sure not all SQLs displayed when rollback was executed, but only last SQL executed in last transaction inside current connection. I hope application developers will have enough data to ping pong the problem. Bellow is my SQL I am executing every few seconds and so getting at least some SQLs at rollback situation:
CURRENT_TIMESTAMP AS TS,
A.APPLICATION_HANDLE AS APP_HANDLE,
REPLACE(REPLACE(C.STMT_TEXT, CHR(13), CHR(10)), CHR(10), ‘ ‘) AS STMT
TABLE(MON_GET_CONNECTION(CAST(NULL AS BIGINT), -1)) AS A
LEFT OUTER JOIN TABLE(MON_GET_PKG_CACHE_STMT (NULL, NULL, NULL, -2)) AS C ON A.LAST_EXECUTABLE_ID = C.EXECUTABLE_ID
WHERE A.APPLICATION_HANDLE (SELECT APPLICATION_HANDLE FROM TABLE(SYSPROC.MON_GET_CONNECTION(SYSPROC.MON_GET_APPLICATION_HANDLE(), -1)))
AND A.LAST_REQUEST_TYPE = ‘ROLLBACK’
Your assessment on event monitors was dead on. They are resource intensive, capture so much that it can often be hard to parse all the data to find what you want, and are generally something I try to avoid unless I really need them (especially in production). Now that I think of it, I was able to find some of the data I needed via MON_GET tables but at the time it was capturing so much that I wasn’t sure I was gathering/interpreting things right. (Remember that almost all my transactions had a rollback issued immediately after and I didn’t expect to see that when gathering data so I thought I was interpreting things wrong). The other scenario specific issue I had is that what I started to capture was OOB SQL behavior for the application so the SQL (by itself) didn’t provide a clue to system engineering on where the query was coming from. I needed to not only get the SQL but the IP of the server it was coming from. I found ways to get one or the other, but couldn’t gather the data at the same time without an event monitor and joining the tables. However, your badass SQL looks like a viable solution to this and I just didn’t see it.
Regarding your two questions, my response would be “I don’t know” purely because I don’t work with event monitors often anymore. I usually can get what I need via MON_GET tables.
If your database has always done rollbacks 100s of 1000s of time every day, how can you quantify the impact these rollbacks have on the performance ? A serious question, not a rhetorical one.
I assume buffers from other transactions will be waiting to write into the log buffers which is possible only after the commit is successful. If these buffers get into the log buffers before the rollback, then an incomplete log buffer will be flushed.
Does the cost of hosting on AWS increase because of this application design?
Completely agree with you, and I go a little into this at the end of the article. IBM’s argument is this should have no impact, 724k rollback statements that do nothing is still nothing. My response to that is … “Eh, I call bullcrap. There is something going on at a very minimum that consumes resources and you haven’t quantified that for me”. A bad example would be an exercise bike with a flywheel controlled by a belt with varying tension. There is a difference with a belt tension set to very low (0 or 1) and not having a belt attached at all. I keep thinking back to an old presentation Scott Hayes did for DBI at IDUG many years ago. IBM was touting the heck out of the Optim Tooling. Scott started a session and pointed his Brother-Panther software at the Optim database and used it as an example of how to find inefficiency. That sticks with me till this day. IBM is much better about it, but even IBM can get lazy on tuning it’s own stuff sometimes.
As for the cost of AWS hosting impact .. Again, can’t quantify that in my case as I have no idea if I am paying more for a beefier host to compensate for the inefficiency. I can’t give away specifics but lets say my Sterling OMS Database is less than 500G and runs fairly well. If I told you this DB could increase 1x, 2x, or even 3x you would be relatively confident that even with an unknown quantity of inefficiency you could compensate. But what if I told you that size and activity could exponentially get bigger – 20x, 50x, 100x. At point do you wonder about the impact of that inefficiency and cost on the hosting infrastructure.
In the end, there really could be no impact, but my spidey sense is tingling and it makes me a little nervous.