My background comes largely from the physical DBA world, though I’ve gained logical DBA skills over the years. Any reasonably mature IT organization will have a DBA at some level or another review SQL and DDL before it is implemented in production.
Why Reviewing SQL and DDL is Important
Reviewing SQL and DDL is important because there are some very simple things that a DBA can catch in SQL or DDL that developers may not think of. Even with vended DB2 databases, there is often a component of custom additions that should be reviewed.
What I Look for in DDL
DDL is Data Definition Language. It refers to statements that add to or change the structure of objects in the database. It is usually rarer to run DDL against a production database, but depending on the environment, it can happen anywhere from weekly to at the outside annually. The frequency depends on how frequently application functionality is changed – either by developers or by vendors. Often DDL is included as a part of a release cycle.
There are a number of things that I look for when reviewing DDL…
Coding Practices in Triggers and Stored Procedures
I review the logic in triggers and stored procedures and look for common mistakes like not using a cursor when one makes sense, using a join instead of referring to the new or old values available. In compound SQL objects like this, I also apply the items from the section on reviewing SQL below.
Indexes
There are so many opportunities for indexing, both in custom and vended databases. While some vendors will not allow indexes to be added, others and custom databases will. These opportunities are not just in response to SQL, but are in the creation of views, stored procedures, triggers, and even tables. One of the most common problems I see is under indexing, especially when custom tables are added to a vended database such as a WebSphere Commerce database. I go so far as to request that developers provide the SQL they plan to execute so I can index for it.
Data Types for Columns or Variables that May be Compared
As long as DB2 can convert between two data types, it will. For example, comparing a number that is stored in a string variable with an integer. Since these comparisons do not throw errors, I see this fairly commonly for developers on custom databases. The problem is that, often, the developers do not perform load testing to understand how their code performs, but only functionality testing to make sure it works. These cross-data-type comparisons are very slow and especially when a part of frequently executed SQL, can really drag database performance down. Because of this, I look for similarly named columns across objects with different data types. The performance problems can show up either in join or when the columns are used in comparisons in the where clause.
I recently went through some offline, after-hours work on one client’s databases to clean up and work around some issues in this area – largely by adding a generated column of the data type I wanted that was generated from the column with the bad data type. That same client’s developers recently asked me to review and approve a few hundred lines of DDL for a new project, and I found more than 5 examples of exactly the same problem.
Identifying this is time consuming and tedious, because columns that are likely to be compared or joined on are not always obvious. Often it involves looking through all instances of columns with the same name with different data types or looking for logical ways that the data is likely to be compared. Foreign Key definitions can also be clues for this.
No Stupid Data Types
I tried to come up with a nicer title for these, but frankly, sometimes there are just stupid data types. The most obvious example of this is VARCHAR(1). Declaring a data type as VARCHAR leads to some overhead, so a VARCHAR(1) actually takes up more space in the database than if you just declared the column as CHAR(1) and had all of the values as NULLs. Anything under about VARCHAR(5) is subject to questioning in my book to see if the variable portion of the definition ever makes any sense. The overhead for VARCHAR varies from 2 to 5 bytes per data depending on whether value compression is active and whether the column is nullable.
The other “stupid” thing I look for in data types is to see if everything is defined as some arbitrary value, particularly if that value is on the high side. Some developers tend to define all strings as VARCHAR(255), whether the data is likely to ever be 255 or not. Or even if it is something that is likely to be over 255. If I see that, I’ll ask the developers questions to try and help then determine what the real maximum size for the column is likely to be.
Sane Naming Standards
Allowing DB2 to choose the names of constraints I can take or leave. I’d prefer to have logical names for them, but it is honestly not a battle I am usually willing to fight. However, if I see that constraint names are explicitly defined, but simply the default names that DB2 chose to give them on some other system, I will usually raise an objection. I generally at least glance at object naming. If I know a client has a standard for object naming, I will enforce it. If they’ll let me use my own standards, all the better.
Existence of Foreign Keys
It is sometimes very clear that one column in one table has a parent/child relationship to a column in another table. That’s not always clear just from the naming, but if it appears to be the case, I will try to convince developers to define, and index for, that relationship.
Use of Sequences and Identity Columns
This is another area where I’m often asking questions rather than making proclamations. If I see columns that appear to be an incrementing ID, I’ll ask if they should be handled via an identity column or a sequence. Sometimes the answer is no because the number comes from somewhere else or is already coded into the application, and that is fine.
Table Space Placement
Table space placement is something that less experienced, smaller DBA teams, or just busy DBAs don’t always think about. Smaller or less active databases may work fine using only the default storage group and table space. I always look and think to see if I need to be adding or pushing some table space layout for the developers. Table space placement is often something that doesn’t matter much to developers, but can make a big difference to DBAs. There are four main things I consider with table space placement:
- Tables or groups of tables that are likely to need to be restored together
- Particularly large or very active tables that may benefit, now or in the near future, from buffer pool separation
- Backup speed – parallelism on backup occurs only by table space
- Storage layout – if I have fully separate I/O paths to play with, then how can I best use them for different types of data
Reorg Required and Reorg Recommended Operations
These are normally caught in a testing or development environment. Certain DDL operations may require a reorg of the table, while others, called reorg-recommended, can be done in limited numbers before the reorg is required.
To determine if reorg-recommended operations have been executed against a table, see this great technote.
What I Look for in SQL
In an e-commerce or OLTP environment, the database should only ever run tested and known SQL. In Decision Support Systems, Data Warehousing, or Analytics environments, there may be much more variability in the SQL, but there are still likely some cases for repetitive SQL such as periodic reporting. Any SQL that is repetitive in any way should be reviewed by a DBA. Any SQL for which performance is particularly critical should be reviewed by a DBA. There are a number of checks that I perform on the SQL I am asked to review. The details listed here are not the checks I perform for SQL that has already been deemed problematic, but for the straightforward, proactive review of SQL that is new to the database.
Clear and Sane Join Syntax and Strategies
I much prefer explicit join syntax (table1 JOIN table2 ON table1.column=table2.column). Simply putting the join predicates in the where clause can lead to mistakes later, including inadvertent Cartesian products. There are also some cases where the way a join or subquery is done makes more sense another way
Joins or Subqueries that Can be Eliminated
Sometimes, in SQL, I see joins or subqueries that are entirely unnecessary. Eliminating them can be an instant performance win, and cause less load on the database.
Indexes
Reviewing SQL is the best opportunity to intelligently add or alter indexes. When reviewing SQL, I will often run explains and/or the index advisor to understand any problems with the SQL given the database structure, and will recommend indexes be added or changed. However, I don’t index for every SQL that comes through. I also work with the developers to understand the importance and frequency of the SQL so I can understand whether indexing will help it and will be worth the negative impact on inserts/updates that every additional index has.
Repetitive SQL that May Fit Better in a View or Function
When looking at a number of new SQLs, I also keep my eyes open for repetitive SQL that may work better in a view or a function. If I see the same complicated calculations over and over again, that’s a sign that I may want to recommend additional objects to simplify that SQL and future SQL performing similiar work.
No Select *
I never say that I will never allow something to stand. Except perhaps CLIENT authentication. But I do fight back fairly strongly on some things, and ‘SELECT *’ is one of them. The vast majority of the time, ‘SELECT *’ is lazy coding. The developer uses it to get all of the data into their program and then cherry picks what they need. However, we can save a lot of workload on the server and achieve much better query performance by only returning exactly the data that is needed. Rarely in a relational database is it true that every column in the table is needed. But I never say never (see the irony?), because there are situations in which all the data in a table really is being used by the application. Even if every single column is needed, it is better application coding practice to specify the name of each column so that a change to add a column does not break the application.
Summary
I think that each DBA has their own standards and checks when reviewing SQL and DDL. Some are simply a rubber stamp, while others are a more thorough review process. What checks do you perform when asked to review SQL and DDL?
Ember, do you have a book about it?
No, I don’t. I’ve often been tempted to write books, but through this blog, I publish at least one book’s worth of words every year. If you’re looking for more on writing efficient SQL overall, I love http://use-the-index-luke.com/.
Oh, OK.
I knew use-the-index-luke . It’s great to know you like it too.
Thank you Ember.
Best regards from Spain !
(What I mean is that for me it’s a great reference if you recommend it to me)