The first professional technical presentation I ever gave was on March 31, 2003. I had been a DBA for just less than 2 years. It was at an event called “SDC West Geek Fest”. It was an internal IBM event within Global Services where the technical folks delivering services got together to share technical expertise. The title was something along the lines of “10 ways to improve physical performance in Db2”. This was just after version 8.1 was released.
The top three on my 10-item list were “Bufferpools, Bufferpools, Bufferpools”. We didn’t have automatic tuning of any kind yet, and in addition to the size and number of buffer pools, we had to pay a lot more attention to page cleaning. That was my second buffer pools topic – tuning the page cleaners. At that time, if triggered, it was not unheard of to have the page cleaners absolutely pound a server into the ground if you made NUM_IO_CLEANERS too large. We’re talking back when 64-bit was an experimental mode.
In some ways default page cleaning is very similar today to what it was then, however the number of page cleaners can be automatically set by Db2, and is often entirely reasonable. The combination of that and the increase in the power of servers means it is less likely these days to completely tank a server with page cleaning. But that doesn’t mean it’s not an important concept for a DBA to understand.
What is Page Cleaning?
Page cleaning is an asynchronous process that is a big part of the huge power of Db2 buffer pools. When Db2 changes a page, the log file record is externalize to disk. This provides the “Durability” in ACID. However, the dirty page is still in the buffer pool. In this case, “dirty” means a page that has changed in memory, but that change has not yet been written out to disk.
Page cleaning happens when these dirty pages are cleaned by writing them out to disk. There are several things that can trigger page cleaners, but when they are triggered, the number of EDUs (it was processes before 9.5) spawned to do so is controlled by NUM_IO_CLEANERS.
What Happens When There are No Clean Pages
Db2 automatically determines for you what pages should stay in memory and which are a better choice to be replaced when push comes to shove, and a page must be overwritten. If all of the buffer pool is in use or dirty, and a connection comes in that needs space in the buffer pool, then the agent itself (and not an asynchronous process) must write the data out to disk before it can read the new data in from disk. This can understandably add even more I/O overhead to a query that was already experiencing cache misses.
The fewer dirty pages, the faster crash recovery occurs. The more data is on disk, the less Db2 has to read through transaction log files to find things to write to disk during recovery. Db2 tracks the location in the log files as of which the data on disk is fully current and knows where to start when recovery must be invoked.
You can actually force dirty pages to be written out to disk using the FLUSH BUFFERPOOL command. This may be a good idea to run before an online backup or at a time you think you’re likely to need to recover to.
Original Page Cleaning Algorithm
This is still the default today in all versions of single-node Db2, and in DPF. It is NOT used for PureScale or if you set DB2_WORKLOAD to ANALYTICS (for BLU).
With the default page cleaning, the following trigger page cleaners:
- Dirty Page Steal – when a synchronous write was performed by an agent because there were not enough clean pages available for it to use.
- LSN Gap Cleaner – when the SOFTMAX parameter is set to 0, page cleaners are triggered if the oldest page in the buffer pool is older than the log position that is PAGE_AGE_TRGT_MCR seconds. If PAGE_AGE_TRGT_MCR is not set, then these triggers happen when SOFTMAX percent of one log file is reached. The page cleaners are only active until the LSN gap is resolved. Note: SOFTMAX was deprecated in Db2 10.5.
- Changed Pages / Dirty Page Threshold – when CHNGPGS_THRESH percentage of pages in the buffer pool are dirty(changed).
When page cleaners are triggered, NUM_IOCLEANERS page cleaner EDUs start work. Before Db2 9.5, these were processes, not EDUs.
The default page cleaning reacts to either thresholds or dirty page steals to trigger page cleaning.
Alternate Page Cleaning Algorithm
An alternate method for page cleaning was introduced in DB2 8.1 Fix Pack 4. It is enabled with the Db2 registry variable DB2_USE_ALTERNATE_PAGE_CLEANING. The only time alternate page cleaning is used if you have not explicitly set this variable is if you are using PureScale or have set DB2_WORKLOAD to ANALYTICS (which is usually done for BLU).
To check if you have alternate page cleaning enabled, use:
select REG_VAR_VALUE from SYSIBMADM.REG_VARIABLES where REG_VAR_NAME='DB2_USE_ALTERNATE_PAGE_CLEANING'
To enable alternate page cleaning use:
Alternate Page Cleaning is also referred to as proactive page cleaning. With this methodology, the CHNGPGS_THRESH parameter is meaningless. Page cleaners more actively search the buffer pool for victim pages, and we are not able to configure this activity.
Proactive page cleaning also tries to spread out the writes. Instead of only cleaning pages that are causing an LSN gap, the page cleaners also look for pages that might cause a future LSN Gap. Essentially, alternate page cleaning is paying more attention to the velocity at which pages are going though the buffer pool and attempts to match that velocity.
Db2 also includes essentially a love/hate list for pages in the buffer pool. For example, if a table scan uses 75% of a buffer pool, those pages are likely to go on a hate list that means they are likely to be the first targets of page cleaners.
As a result of this, SOFTMAX or PAGE_AGE_TRGT_MCR is actually more likely to be respected and be closer to reality.
One of the hard things for a DBA is that there is less insight into what is going on here and fewer knobs we can turn to tune it, for better or worse.
How to Decide Which to Use
Any OLTP database that I build, I turn on alternate page cleaning. This is something I check for in health checks. There are a few edge cases where alternate page cleaning has caused issues, and it sounds like those edge cases are why it is not the default. It also sounds to me like Toronto is on the fence about whether to make it the default or not. It sounds to me like the IBM cloud offerings are likely leveraging alternate page cleaning.