This is a new series that has been percolating in my head for a while. When I help clients with SQL, we’re often adding (or removing) indexes to help SQL performance. It has been said (though not by me) that SQL re-write is obsolete because the Db2 Optimizer will take care of that. But there are still some ways you can write a query that will have it using a much more expensive access plan than is needed. This series covers SQL that returned the required results, but does so in a way that could have been written much more efficiently.
Much of what is reviewed here would be true for any RDBMS, but I know it to be true for these queries on Db2 for Linux, Unix, and Windows. Table or column names may be changed to protect the innocent.
This particular SQL was found in the an analysis of the package cache using my favorite SQL for finding problem SQL in the package cache. In several different time frames analyzed, it was found to be consuming more than 10% of the Rows Read by all the queries in the package cache.
SQL Statement or Fragment Found in the Wild
SELECT * FROM (SELECT ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY person_id, environment_start_dt DESC ) seq , person_id , sample1 , sample2 , sample3 , environment_no , environment_start_dt FROM schema.ENVIRNOMENT ) AS recentenv WHERE seq = 1 and person_id=? ;
Explaining this query showed most of the expense was in the table scan of schema.ENVIRONMENT.
Rows RETURN ( 1) Cost I/O | 1.91471e-06 FILTER ( 2) 29995.8 17868 | 573015 TBSCAN ( 3) 29880.2 17868 | 573015 SORT ( 4) 29859.9 17868 | 573015 TBSCAN ( 5) 29505.3 17868 | 573015 TABLE: SCHEMA ENVIRONMENT Q1
A Better Way to Get the Same Results
We tried this query instead to reduce the amount of data that actually needed to be processed:
SELECT * FROM (SELECT ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY person_id, environment_start_dt DESC ) seq , person_id , sample1 , sample2 , sample3 , environment_no , environment_start_dt FROM schema.ENVIRNOMENT WHERE person_id=? ) AS recentenv WHERE seq = 1 ;
Modified Access Plan
Looking at the access plan, this syntax made a world of difference:
Total Cost: 32.5022 Query Degree: 1 Rows RETURN ( 1) Cost I/O | 1 FILTER ( 2) 32.5022 3 | 1.09716 FETCH ( 3) 32.5014 3 /---+----\ 1.09716 573015 IXSCAN TABLE: SCHEMA ( 4) ENVIRONMENT 21.6697 Q1 2 | 573015 INDEX: DB2INST1 IDX_ENVIRONMENT_01 Q1
That’s a 99.999% improvement in the timeron cost – not only do we need to evaluate fewer rows, but there is also an index that we can use to get those rows.
If we look at what we’re asking Db2 to do with each query, the difference is easy to understand:
Original Query: First, figure out a table that returns 6 columns from the table that represents every row from environment along with a column ranking the rows for each person_id from most recent environment start date (1) to least recent. After you have that, find the rows that match a specific person_id, and out of those, find the one row with a rank of 1.
Modified Query: First, figure out a table that returns 6 columns from the table, along with a column ranking the rows from most recent environment start date (1) to least recent for a specific person_id. Then return the row with a rank of 1.
Because the condition is applied on the inner query, and the cardinality of this column is pretty low (it is fairly unique), the rows that even need to be evaluated and sorted is much lower. The first query had to evaluate and work with over 500,000 rows to return one. The second query on average has to deal with one or two. This requires far fewer I/Os. Even if the table is already entirely in the buffer pool, this requires fewer CPU cycles to execute the query. With the modified query, we’re also getting the data in order from an index, which luckily allows us to also eliminate the sort.