Ok, so everyone has their top 10, and some of them are significantly different – but here’s my top 10 on SQL performance for DB2. I mostly share this information with developers, but it’s important for a DBA to be aware of these things. Since it’s directed more at developers, it ignores some of the basics like “do runstats” and “do rerogs”. This top 10 is directed purely at WRITING sql, and not full database support.
Where is the cost?
Some actions can occur either in the SQL or in the Application. Some of these recommendations assume that the cost is reduced at the database level, but do not take into account the extra cost at the application level. I don’t expect that every query meet all of these suggestions – sometimes you simply have to use an inefficient method to get the data the way you need it.
These recommendations are also focused on OLTP databases which favor small fast queries.
What we aim for
•Index-only access when possible – where all data can be retrieved from indexes without looking data up in the table itself
•Accurate estimates on the number of rows we’re dealing with
We don’t want:
•Table scans – where db2 has to examine each and every row in a table to determine which rows meet the query criteria
•Sorts – some are inevitable, but they can be extremely time consuming depending on where they appear in the query. Sorts are done not only when explicitly called for, but also when db2 decides they are part of the most efficient access plan
•Nested-loop joins – again, many queries will require these, but we’d like other join methods where possible
Predicate types from most to least efficient:
1.Range-delimiting predicates – start/stop condition narrows down the pages of the index
2.Index-sargable predicates – columns from a predicate are part of an index key in the right order
3.Data-sargable predicates – predicates can be evaluated on data pages in the buffer
4.Residual predicates – require I/O beyond the buffer – include subqueires or long/LOB data
The 10 tips:
Select only that which you need. Avoid SELECT *, opting instead for specifying the smallest number of columns that will meet your needs.
Each returned column has an overhead that is multiplied by the number of rows
If there are too many columns then index-only access will not be an option
Lock not, lest ye be locked. Specify FOR READ ONLY and WITH UR if these restrictions meet your needs – they will reduce locking
Even if you don’t see errors due to locking, you may still impact performance time not just of this query, but of other queries
Avoid functions and conversions when possible – CHAR, UPPER, etc. These will limit the use of indexes.
The most important place to do this is on the join predicates – and remember that join predicates include those not using the “join” keyword. Using functions or even expressions in join predicates also limits the join method to Nested Loop and may produce inaccurate results
The second most important place to do this is in the where clause
This may also be useful in the column list, depending on circumstances
Avoid Sorts. Some sorts are expensive – avoid them where possible. Large sorts may spill from memory to disk, drastically impacting their performance.
Avoid DISTINCT and ORDER BY when possible
These often require sorts, and sorts are frequently time eaters
A GROUP BY may be a better choice than DISTINCT
In-lists are out. Avoid the use of “IN” lists where possible
Multiple smaller queries may be more efficient
Ensure you are specifying the appropriate join predicates to avoid Cartesian joins. This may seem obvious, but a really bad query can actually cause a database to be unresponsive to other users – something you want to avoid, especially in production.
Play fetch with as few sticks as possible. If using the FETCH FIRST N ROWS ONLY clause, include the OPTIMIZE FOR N ROWS clause.
Strangely enough, DB2 does not handle this for you.
To parameter or not to parameter. Consider where to pass in specific values
If Parameter Markers are used, DB2 does not make use of distribution statistics
If Parameter Markers are not used, then DB2 re-calculates the access plan every time you run the query
Balance the reduced preparation time with the performance advantaged gained by the use of statistics – this is especially important when your data is not distributed in a normal fashion.
Consider using temporary tables for repeatedly accessed data. I’ve actually seen the same stored procedure called 3 times with the same parameters in a single query, and being one that determines the distance from nearby stores, it was an expensive one. Storing such data in a temporary table before executing the query might be a good idea.
Ember is always curious and thrives on change. Working in IT provides a lot of that change, but after 18 years developing a top-level expertise on Db2 for mid-range servers and more than 7 years blogging about it, Ember is hungry for new challenges and looks to expand her skill set to the Data Engineering role for Data Science. With in-depth SQL and RDBMS knowledge, Ember shares both posts about her core skill set and her journey into Data Science.
Ember lives in Denver and work from home