I have been paying a bit of attention to cross-platform SQL optimization lately, and read this interesting post:
https://www.periscopedata.com/blog/use-subqueries-to-count-distinct-50x-faster.html?utm_content=bufferaf11f&utm_medium=social&utm_source=linkedin.com&utm_campaign=buffer
Being a bit of an experimenter, the first thing that I wondered is how DB2 would handle this scenario. Would the DB2 optimizer be smarter than others, or would the same hold true for DB2 that held true for PostgreSQL?
Environment
The environment I’m testing on is my little Linux VM (Ubuntu) running DB2 Express-C 10.5, fixpack 5 on my laptop. DB2 Express-C is free to use on a small machine. This is my default experimentation environment for anything that doesn’t require features I cannot get in Express-C. Express-C is actually shockingly complete – I’m working on some experimentation with Native Encryption, and I can also do that in this environment. I don’t have the same data set as the people who wrote the article. I am using the db2 sample database, created using the db2sampl command. To generate a reasonable volume of data, I took the EMPPROJACT table and inserted it into itself until I had over 10 million rows.
$ db2 "select count(*) from empprojact" 1 ----------- 10099008 1 record(s) selected.
I also completed runstats on the table:
$ db2 runstats on table empprojact with distribution and detailed indexes all DB20000I The RUNSTATS command completed successfully.
I created the explain tables, and the explain view as described in my post about a more platform-independent way of looking at explain.
The Experiment
I plan to perform three tests. First a test as-is with the default sample database schema. Next, a test with an added index (which is low-cardinality). Both of those will be with SQL similar to the original SQL in the article that is my inspiration. For the final test, I’ll re-write the SQL in a similar way as the article.
Base SQL
Here is the starting SQL. I’ve written it to be as similar to the PostgreSQL SQL as possible:
SELECT actkwd, count(distinct(empprojact.empno)) as count FROM act join empprojact on act.actno=empprojact.actno group by actkwd order by count desc;
Test #1: Baseline
How long does this SQL run as a baseline?
$ db2batch -d sample -f distinct_query.sql * Timestamp: Thu Nov 12 2015 13:53:49 MST --------------------------------------------- * SQL Statement Number 1: SELECT actkwd, count(distinct(empprojact.empno)) as count FROM act join empprojact on act.actno=empprojact.actno group by actkwd order by count desc; ACTKWD COUNT ------ ----------- CODE 8 TEST 8 LOGIC 7 MANAGE 6 DOC 5 OPERAT 4 MAINT 3 ADMQS 2 TEACH 2 ADMDB 1 ADMDC 1 ADMSYS 1 COURSE 1 DEFINE 1 ECOST 1 LEADPR 1 SPECS 1 * 17 row(s) fetched, 17 row(s) output. * Elapsed Time is: 5.544549 seconds * Summary Table: Type Number Repetitions Total Time (s) Min Time (s) Max Time (s) Arithmetic Mean Geometric Mean Row(s) Fetched Row(s) Output --------- ----------- ----------- -------------- -------------- -------------- --------------- -------------- -------------- ------------- Statement 1 1 5.544549 5.544549 5.544549 5.544549 5.544549 17 17 * Total Entries: 1 * Total Time: 5.544549 seconds * Minimum Time: 5.544549 seconds * Maximum Time: 5.544549 seconds * Arithmetic Mean Time: 5.544549 seconds * Geometric Mean Time: 5.544549 seconds --------------------------------------------- * Timestamp: Thu Nov 12 2015 13:53:56 MST
The timing is pretty similar when my bufferpools are hot and when they’re cold, but I’m checking both each test to be sure.
The explain looks like this:
Explain Plan ---------------------------------------------------------------------------------------------------- ID | Operation | Rows | Cost 1 | RETURN | | 49569 2 | TBSCAN | 18 of 18 (100.00%) | 49569 3 | SORT | 18 of 18 (100.00%) | 49569 4 | GRPBY (COMPLETE) | 18 of 540 ( 3.33%) | 49569 5 | TBSCAN | 540 of 540 (100.00%) | 49569 6 | SORT (UNIQUE) | 540 of 10099008 ( .01%) | 49569 7 | HSJOIN | 10099008 of 18 | 46443 8 | TBSCAN EMPPROJACT | 10099008 of 10099008 (100.00%) | 46172 9 | IXSCAN XACT2 | 18 of 18 (100.00%) | 0
As expected, the empprojact table is getting a full table scan, and that is taking up most of the time.
Test #2: Base SQL with (Low Cardinality) Index
Next, I added an index to eliminate that table scan:
$ db2 "create index ecrooks.I_empprojact_01 on empprojact (actno,empno) collect detailed statistics" DB20000I The SQL command completed successfully.
Note that this index only has a cardinality of 53 on a table of over 10 million rows – if this were the real world, I’d be looking to see if MDC could better help me with the performance of this query.
Here’s what the timing looks like with the index:
$ db2batch -d sample -f distinct_query.sql * Timestamp: Thu Nov 12 2015 14:33:24 MST --------------------------------------------- * SQL Statement Number 1: SELECT actkwd, count(distinct(empprojact.empno)) as count FROM act join empprojact on act.actno=empprojact.actno group by actkwd order by count desc; ACTKWD COUNT ------ ----------- CODE 8 TEST 8 LOGIC 7 MANAGE 6 DOC 5 OPERAT 4 MAINT 3 ADMQS 2 TEACH 2 ADMDB 1 ADMDC 1 ADMSYS 1 COURSE 1 DEFINE 1 ECOST 1 LEADPR 1 SPECS 1 * 17 row(s) fetched, 17 row(s) output. * Elapsed Time is: 2.105718 seconds * Summary Table: Type Number Repetitions Total Time (s) Min Time (s) Max Time (s) Arithmetic Mean Geometric Mean Row(s) Fetched Row(s) Output --------- ----------- ----------- -------------- -------------- -------------- --------------- -------------- -------------- ------------- Statement 1 1 2.105718 2.105718 2.105718 2.105718 2.105718 17 17 * Total Entries: 1 * Total Time: 2.105718 seconds * Minimum Time: 2.105718 seconds * Maximum Time: 2.105718 seconds * Arithmetic Mean Time: 2.105718 seconds * Geometric Mean Time: 2.105718 seconds --------------------------------------------- * Timestamp: Thu Nov 12 2015 14:33:26 MST
We cut the time in half. Here’s what the explain looks like:
Explain Plan ---------------------------------------------------------------------------------------------------- ID | Operation | Rows | Cost 1 | RETURN | | 15657 2 | TBSCAN | 18 of 18 (100.00%) | 15657 3 | SORT | 18 of 18 (100.00%) | 15657 4 | GRPBY (COMPLETE) | 18 of 540 ( 3.33%) | 15657 5 | TBSCAN | 540 of 540 (100.00%) | 15657 6 | SORT (UNIQUE) | 540 of 10099008 ( .01%) | 15657 7 | NLJOIN | 10099008 of 18 | 12531 8 | IXSCAN XACT2 | 18 of 18 (100.00%) | 0 9 | IXSCAN I_EMPPROJACT_01 | 561056 of 10099008 ( 5.56%) | 750 Predicate Information 7 - JOIN (Q2.ACTNO = Q1.ACTNO) 9 - START (Q2.ACTNO = Q1.ACTNO) STOP (Q2.ACTNO = Q1.ACTNO)
That is somewhat better. Note that the unique is applied in step 6, after the indexes for both tables have been scanned.
Test #3: Rewrite
Finally, I rewrote the SQL to be similar to the article:
select actkwd, log_counts.ct from act join ( select distinct_logs.actno, count(1) as ct from ( select distinct actno, empno from empprojact ) as distinct_logs group by distinct_logs.actno ) as log_counts on log_counts.actno = act.actno order by log_counts.ct desc;
And here’s what the timing looks like:
$ db2batch -d sample -f distinct_query_rewrite.sql * Timestamp: Thu Nov 12 2015 14:52:01 MST --------------------------------------------- * SQL Statement Number 1: select actkwd, log_counts.ct from act join ( select distinct_logs.actno, count(1) as ct from ( select distinct actno, empno from empprojact ) as distinct_logs group by distinct_logs.actno ) as log_counts on log_counts.actno = act.actno order by log_counts.ct desc; ACTKWD CT ------ ----------- CODE 8 TEST 8 LOGIC 7 MANAGE 6 DOC 5 OPERAT 4 MAINT 3 ADMQS 2 TEACH 2 ECOST 1 DEFINE 1 LEADPR 1 SPECS 1 COURSE 1 ADMSYS 1 ADMDB 1 ADMDC 1 * 17 row(s) fetched, 17 row(s) output. * Elapsed Time is: 1.136768 seconds * Summary Table: Type Number Repetitions Total Time (s) Min Time (s) Max Time (s) Arithmetic Mean Geometric Mean Row(s) Fetched Row(s) Output --------- ----------- ----------- -------------- -------------- -------------- --------------- -------------- -------------- ------------- Statement 1 1 1.136768 1.136768 1.136768 1.136768 1.136768 17 17 * Total Entries: 1 * Total Time: 1.136768 seconds * Minimum Time: 1.136768 seconds * Maximum Time: 1.136768 seconds * Arithmetic Mean Time: 1.136768 seconds * Geometric Mean Time: 1.136768 seconds --------------------------------------------- * Timestamp: Thu Nov 12 2015 14:52:03 MST
Wow, that’s nearly half the time as the last test!
Here’s what the explain plan looks like for it:
Explain Plan ---------------------------------------------------------------------------------------------------- ID | Operation | Rows | Cost 1 | RETURN | | 12706 2 | TBSCAN | 17 of 17 (100.00%) | 12706 3 | SORT | 17 of 17 (100.00%) | 12706 4 | HSJOIN | 17 of 17 (100.00%) | 12706 5 | IXSCAN XACT2 | 18 of 18 (100.00%) | 0 6 | GRPBY (COMPLETE) | 17 of 53 ( 32.08%) | 12706 7 | UNIQUE | 53 of 10099008 ( .00%) | 12706 8 | IXSCAN I_EMPPROJACT_01 | 10099008 of 10099008 (100.00%) | 12527 Predicate Information 4 - JOIN (Q3.ACTNO = Q4.ACTNO)
Notice here that the “UNIQUE” is applied as the second step (#7), and the total cost is slightly better, with the actual performance being half the time of the last run.
Conclusions
I came into this experiment hoping the DB2 optimizer would manage to do things the most efficient way, even with the starting syntax. This was not the case. Even with DB2’s very strong optimizer, how you write SQL still has a significant impact on performance.
I wonder what it would look like on z/OS. I learned a few new tricks from this blog that I would like to get a chance to try out on z/OS. Keep writing these kinds of interesting blogs because most people are still trying to write SQL with what was available in DB2 V6.
Would love to see results from z/Os as well!
A bit late to the party, but internet searches let us find these gems 🙂
Did you consider using an MQT enabled for query optimisation?
Another alternative might be a statistical view.
If we can use either MQTs and/or statistical views then it may be possible to enhance an application to which we don’t have access to change the SQL.
Regards
Rob