This was really a revelation for me when I took DanL‘s SQL workshop before the IDUG conference in Phoenix.
I started out as a physical or systems DBA at IBM, and until studying for certification, I hardly wrote a statement more complicated than select * from syscat.bufferpools
. After being a DBA for probably 5 years or more I started to understand a bit more about SQL, but was very strictly limited by my box at IBM. I was not supposed to be writing SQL – that was someone else’s job. But I could look for indexes to help the SQL that ran against my databases. In the last 6 years, my role has involved more development support – helping developers write and tune SQL.
Sometimes the DB2 optimizer makes re-writing SQL irrelevant. But other times, rewriting SQL can get you performance improvements. This article focuses tightly on one kind of re-write you can try. According to DanL, you can rewrite a join as a non-correlated subquery. You can also rewrite it as a correlated subquery. All three methods generally work for the same results. I had never thought of it as quite that fluid – I thought each had their places. Let’s work through each of these to understand what it means, using a real-world example. I’ll use DB2 LUW system catalog tables on a 10.5 system.
The question this query is trying to answer is “What tablespaces do I have that have tables in them with LOB columns?”
Join
A join is simply a query with an explicit or implicit join in the query. The from clause lists two or more tables. One or more join predicates are specified either explicitly in join clauses within the from clause, or implicitly within the where clause. Explicit join syntax is considered best practice for clarity. There is no subquery.
This is probably the way I would have naturally written the query for this particular item. Here’s what this query looks like using join methodology:
select distinct( substr(tbspace,1,30) ) as tbspace from syscat.tables tab join syscat.columns col on tab.tabschema=col.tabschema and tab.tabname=col.tabname where col.typename in ('BLOB','CLOB') and tab.type='T' with ur; TBSPACE ------------------------------ DBA32K LOB_32K REF_TAB16K REF_TAB4K SYSCATSPACE SYSTOOLSPACE TAB16K TAB4K TAB8K USERSPACE1 10 record(s) selected.
Non-Correlated Subquery
A subquery is a query within parentheses that is within another select statement. It can be in the select clause, the from clause, or the where clause. A non-correlated subquery means that the inner query does not have in its where clause any references to the outer query.
This one came the least naturally to me of the three methods, and took me a bit to work out.
select distinct( substr(tbspace,1,30) ) as tbspace from syscat.tables tab where (tab.tabschema, tab.tabname) in (select tabschema, tabname from syscat.columns where typename in ('BLOB','CLOB')) and tab.type='T' with ur; TBSPACE ------------------------------ DBA32K LOB_32K REF_TAB16K REF_TAB4K SYSCATSPACE SYSTOOLSPACE TAB16K TAB4K TAB8K USERSPACE1 10 record(s) selected.
Correlated Subquery
A correlated subquery is a subquery that includes in its where clause references to one or more items from the outer query.
This one made a medium amount of sense to me – I could see myself writing the query this way.
select distinct( substr(tbspace,1,30) ) as tbspace from syscat.tables tab where exists (select 1 from syscat.columns col where tab.tabschema=col.tabschema and tab.tabname = col.tabname and typename in ('BLOB','CLOB')) and tab.type='T' with ur; TBSPACE ------------------------------ DBA32K LOB_32K REF_TAB16K REF_TAB4K SYSCATSPACE SYSTOOLSPACE TAB16K TAB4K TAB8K USERSPACE1 10 record(s) selected.
Which Performs Better?
There is no blanket answer to this. In the case of this specific query, I explained all three versions, and DB2 did the access plans all exactly identical, so there was no difference. This is a very simple example, however, and it could easily make a difference in some situations. It really is best to test a query all three ways to see if one works better. Testing involves minimally writing the query three ways and then running db2exfmt and db2advis on it each way and comparing. It may also involve db2caem to get actuals for each method and even db2batch to test the actual timing of the query each way.
Hey Ember,
Great article. Was actually going to ask Dan about whether there were any general rules about which join construct was the most efficient to use. The reason I was going to do this was my colleague and I have rewritten a couple of queries recently and saw significant improvements by changing non-correlated sub queries to correlated sub-queries.
In one case the improvements was a reduction of the magnitude of thousands of seconds down to sub-second. The other we removed between 100 and 150 MIPS at peak processing periods, between 20% and 25%, for 1 application from just 1 query.
I was hoping to set some guidelines for my developers but I guess that as usual the answer is the typical DBA one, it depends.
Regards,
Paul Murray.
Paul,
One of the easiest ways you would see a difference between correlated and non-correlated would be if you could filter data. For instance, taking Embers’ examples above. Let’s say we have a particular table we are interested in, then I have seen SQL like below examples, where you can achieve the significant savings that you speak of, but that can normally be achieved by changing the query in the non-correlated example as well.
Non-Correlated Subquery
select distinct( substr(tbspace,1,30) ) as tbspace
from syscat.tables tab
where (tab.tabschema, tab.tabname) in (select tabschema, tabname from syscat.columns where typename in (‘BLOB’,’CLOB’))
and tab.tabname = ?
and tab.type=’T’
with ur;TBSPACE
——————————
REF_TAB16K
1 record(s) selected.
Correlated Subquery
select distinct( substr(tbspace,1,30) ) as tbspace
from syscat.tables tab
where exists (select 1 from syscat.columns col where tab.tabschema =col.tabschema and tab.tabname = col.tabname and typename in (‘BLOB’,’CLOB’) )
and tab.tabname = ?
and tab.type=’T’
with ur;
TBSPACE
——————————
REF_TAB16K
1 record(s) selected.