This was really a revelation for me when I took DanL‘s SQL workshop before the IDUG conference in Phoenix.
I started out as a physical or systems DBA at IBM, and until studying for certification, I hardly wrote a statement more complicated than
select * from syscat.bufferpools. After being a DBA for probably 5 years or more I started to understand a bit more about SQL, but was very strictly limited by my box at IBM. I was not supposed to be writing SQL – that was someone else’s job. But I could look for indexes to help the SQL that ran against my databases. In the last 6 years, my role has involved more development support – helping developers write and tune SQL.
Sometimes the DB2 optimizer makes re-writing SQL irrelevant. But other times, rewriting SQL can get you performance improvements. This article focuses tightly on one kind of re-write you can try. According to DanL, you can rewrite a join as a non-correlated subquery. You can also rewrite it as a correlated subquery. All three methods generally work for the same results. I had never thought of it as quite that fluid – I thought each had their places. Let’s work through each of these to understand what it means, using a real-world example. I’ll use DB2 LUW system catalog tables on a 10.5 system.
The question this query is trying to answer is “What tablespaces do I have that have tables in them with LOB columns?”
A join is simply a query with an explicit or implicit join in the query. The from clause lists two or more tables. One or more join predicates are specified either explicitly in join clauses within the from clause, or implicitly within the where clause. Explicit join syntax is considered best practice for clarity. There is no subquery.
This is probably the way I would have naturally written the query for this particular item. Here’s what this query looks like using join methodology:
select distinct( substr(tbspace,1,30) ) as tbspace from syscat.tables tab join syscat.columns col on tab.tabschema=col.tabschema and tab.tabname=col.tabname where col.typename in ('BLOB','CLOB') and tab.type='T' with ur; TBSPACE ------------------------------ DBA32K LOB_32K REF_TAB16K REF_TAB4K SYSCATSPACE SYSTOOLSPACE TAB16K TAB4K TAB8K USERSPACE1 10 record(s) selected.
A subquery is a query within parentheses that is within another select statement. It can be in the select clause, the from clause, or the where clause. A non-correlated subquery means that the inner query does not have in its where clause any references to the outer query.
This one came the least naturally to me of the three methods, and took me a bit to work out.
select distinct( substr(tbspace,1,30) ) as tbspace from syscat.tables tab where (tab.tabschema, tab.tabname) in (select tabschema, tabname from syscat.columns where typename in ('BLOB','CLOB')) and tab.type='T' with ur; TBSPACE ------------------------------ DBA32K LOB_32K REF_TAB16K REF_TAB4K SYSCATSPACE SYSTOOLSPACE TAB16K TAB4K TAB8K USERSPACE1 10 record(s) selected.
A correlated subquery is a subquery that includes in its where clause references to one or more items from the outer query.
This one made a medium amount of sense to me – I could see myself writing the query this way.
select distinct( substr(tbspace,1,30) ) as tbspace from syscat.tables tab where exists (select 1 from syscat.columns col where tab.tabschema=col.tabschema and tab.tabname = col.tabname and typename in ('BLOB','CLOB')) and tab.type='T' with ur; TBSPACE ------------------------------ DBA32K LOB_32K REF_TAB16K REF_TAB4K SYSCATSPACE SYSTOOLSPACE TAB16K TAB4K TAB8K USERSPACE1 10 record(s) selected.
Which Performs Better?
There is no blanket answer to this. In the case of this specific query, I explained all three versions, and DB2 did the access plans all exactly identical, so there was no difference. This is a very simple example, however, and it could easily make a difference in some situations. It really is best to test a query all three ways to see if one works better. Testing involves minimally writing the query three ways and then running db2exfmt and db2advis on it each way and comparing. It may also involve db2caem to get actuals for each method and even db2batch to test the actual timing of the query each way.