I think that one of my least favorite phrases is “Nothing else changed!” More common than a performance problem that simply slowly creeps up with performance getting worse and worse over time is the sudden performance problem. Many times, sudden database performance problems can be mapped back to a specific change at some level.
Other times, everyone involved claims “Nothing changed”. The problem is that there are literally thousands of little things that change every moment, the vast majority of which we aren’t watching, and shouldn’t be. It is impossible to watch every single metric related to a database system at all times and reliably identify exactly what changed every time. (for details on what should be watched, see my blog entry on monitoring)
One of the big problems is that there are tipping points with many different parts of a database system. A minuscule increase in data volume, which happens all the time, is just enough to kick over some threshold that makes the difference between stability and chaos.
The thing is that the same minuscule data addition that has happened a million times in the last year without a problem is suddenly just those few extra marbles that unbalance everything.
The same is true at the system level – we don’t know what servers are sharing storage, network, and a thousand other things.
Things that may have changed when “Nothing has Changed”:
- Connections to the database went up or down
- The server is virtual and:
- The database server is on a different host that it was before (or even different data center)
- Other vms on the host are using more CPU than they were before
- Other vms on the host are using more memory than they were before
- Other vms using the data store are using more IOPS than they were before
- More vms are on the same host than before
- Any amount of data was added to any table
- Any amount of data was deleted from any table
- Any amount of data was deleted in any table
- Any change in network volume changed in any way
With this in mind, unless literally no connections have been made and the server is not virtual, and it does not share any resources (including network) with anything else, then yes, something changed. Something always changes. The same changes that yesterday didn’t cause any problems may have just been enough today to tip something over.
In this case, a client had a DB2 database running on Windows. About Wednesday, performance rather suddenly became so significantly bad that a process of inserts from another system was not able to keep up. Inserts to two tables specifically were running at the rate of about 100 per minute, when they had previously run at least four times faster. While the main concern is insert performance, the inserts in question do more than just the insert – they also look for duplicates and perform an update if needed.
The application structure had not changed. There were no deploys. The database structure had not changed. We had not changed how we manage basic database maintenance.
I was pulled in to help on Saturday. Being Windows, they had already rebooted the server. They also did their monthly rerogs. Runstats were already up to date – but both runstats and reorgs were running slower than usual. This was an important clue that the problem may be systemic and local to the database server.
I started with the DB2 diagnostic log, looking for anything that could impact performance there.
Finding nothing concerning in the diagnostic log, I started with key performance indicators. STMM was in use. Buffer Pool Hit Ratios were excellent (above 99% at all times), both at the highest level, and digging down into specific types of data and specific buffer pools.
The most negative locking phenomena looked great – no deadlocks, no lock escalations. Less than 10 Lock timeouts per day. There was clearly some lock waiting going on, but not the hugest concern.
Sort overflows were less than 0.03%.
When I went to calculate index read efficiency, I hit my first red flag. The value was 2,713. Looking at the package cache, this was largely attributable to a single query – which ran over 4,000 times in 8 hours, each time scanning an 18-million row table. A critical index was missing here. I wrote this up to ask the client if that table or query were involved with the most problematically slow processes. While they said it was not, they are related to user login. At that point, I’m shocked that no one has complained about this part of performance – we’re talking seconds added on just for this one query. I recommend we add the index and keep looking – it is an issue and potentially a contributing one, but it’s not likely to be the silver bullet for this issue.
The log buffer is appropriately sized – Log Pages Written is high while Log Pages Read is 0. Both the catalog cache hit ratio and the package cache hit ratios are above 97%.
Now I move on to wait time. I used both MONREPORT.DBSUMMARY and the time spent summary of the throughput area in dsmtop to look at things. This is where I see a significant amount of wait time. Many of my smoothly running database systems, I see less than 5% of the elapsed time as wait time. The first time I run MONREPORT.DBSUMMARY for a 5-minute interval, I see this:
I’m a bit shocked to see that wait time is over 40% of elapsed time for this time period. I run this for several more 5-minute periods to see if this is consistent. It’s not consistently quite that bad. My first run happened to hit some lock wait time. Second run sees this:
Without historical data, I cannot say for sure that this is out of the ordinary. However, I am consistently seeing high direct read time.
I’ve already asked the client to have their virtualization folks look at the systems. They are not at a loss for memory or cpu resources from the perspective of the guest OS. The virtualization folks don’t see a problem.
Given I’m seeing disk-related wait time, I take a look at latency. This is what I see:
Wow, now that looks bad. Multiple spikes above 20 MS each minute, and some spikes up above 80 ms. That is a problem. This is a very granular view, and it is likely that the storage team has better statistics going back before and after the problem.
The storage team agreed that latency was a major issue and used VMotion to move the VMs to better balance them across data stores.
This is what finally seemed to solve the problem. 24 hours later, performance is fully back to normal, and the disk latency looks like this:
Excellent article. thanks. And I’m sure your investigation techniques apply to systems other than DB2 too.