SQL That Meant Well: Unions vs. Leveraging COUNT and COUNT_IF

This was one of my favorite types of blog entries to write when I worked with IBM Db2, and I’d like to revive it for Snowflake queries. I spend a good part of my day job helping people solve SQL performance problems on Snowflake. I’m never happier at work than when I’m knee-deep in a query, figuring out why it’s slow and how it could be faster. Due to my role, I cannot share exact customer SQL, so I’ll have to re-create the scenarios that I encounter using sample data. The tables I’m using here are all in the available SNOWFLAKE_SAMPLE_DATA database.

The entire point of this series is to share SQL patterns that gave a correct result, but were written in such a way that they didn’t perform as well as they could. I love learning by example – both good examples and bad examples.

SQL Statement

The statement I’m re-creating here was one of several we reviewed with a newer customer. It calculated three different counts – one of all rows with a given set of conditions, one with rows with the same conditions, adding in a where clause for a particular column to not be null, and a third with the conditions of the first two, but adding a range predicate on another column. Here’s a reasonable representation of what it looked like:


SELECT 'preferred_customer_count', COUNT(*)
FROM tpcds_sf100tcl.customer
WHERE c_preferred_cust_flag = 'Y'
UNION ALL
SELECT 'preferred_cust_w_email', count(*)
FROM tpcds_sf100tcl.customer
WHERE c_preferred_cust_flag = 'Y'
    AND c_email_address is not null
UNION ALL
SELECT 'preferred_cust_w_email_and_old_review', count(*)
FROM tpcds_sf100tcl.customer
WHERE c_preferred_cust_flag = 'Y'
    AND c_email_address is not null
    AND c_last_review_date < 2452500;

Query Profile

The query profile for this query involved reading the same sizable table (though perhaps different parts of the table) three times. The query profile looks like this:

Note that despite this table only having 221 micro-partitions, we're reading 663 micro-partitions, because we're reading the table three times.

Why do the three reads take up different portions of time? There are a couple of possible contributing factors. One is that each count requires different columns, and when Snowflake reads the micro-partitions, it doesn't actually have to read the entire micro-partition, but only the columns that are required to satisfy the query. Each micro-partition contains all columns for a set of rows, so it counts as reading the micro-partition, even if only specific columns are read.

The second reason the various reads may have different impacts on the timeline is that reading the table once may cache the data in the warehouse data cache, meaning that the data is already present in memory when other parts of the query read it.

A Better Way to Get the Same Results

Assuming exactly the same result set is required, this SQL can be used:


WITH counts as (
SELECT COUNT(*) as preferred_customer_count
    , COUNT(c_email_address) as preferred_cust_w_email
    , COUNT_IF(c_email_address is not null and c_last_review_date < 2452500) as preferred_cust_w_email_and_old_review
FROM tpcds_sf100tcl.customer
WHERE c_preferred_cust_flag = 'Y'
)
SELECT 'preferred_customer_count', preferred_customer_count
FROM counts
UNION ALL
SELECT 'preferred_cust_w_email', preferred_cust_w_email
FROM counts
UNION ALL
SELECT 'preferred_cust_w_email_and_old_review', preferred_cust_w_email_and_old_review
FROM counts;

Obviously this could be simplified to return each count in it's own column, using syntax like this:

SELECT COUNT(*) as preferred_customer_count
    , COUNT(c_email_address) as preferred_cust_w_email
    , COUNT_IF(c_email_address is not null and c_last_review_date < 2452500) as preferred_cust_w_email_and_old_review
FROM tpcds_sf100tcl.customer
WHERE c_preferred_cust_flag = 'Y';

When I help a customer rewrite code like this, I'm always starting with returning exactly the same result set, even if I personally don't see the reasoning behind it.

Modified Query Profile

The query profile looks much better:

The Difference

The modified query here reads the table only once, and then splits to do the different aggregations required. The timing difference in these examples takes the query duration from 3.0 seconds for the original methodology using UNION to just 1.7s. Nearly halving the query time is often a win for me.

I don't want anyone's takeaway here to be that UNION (particularly UNION ALL) is bad. It is not inherently bad. I've suggested customers use UNION ALL in other scenarios where it made sense. It just doesn't make sense in this scenario where we can instead use the way COUNT works and the COUNT_IF aggregate function to reduce the reading of data.

Ember Crooks
Ember Crooks

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, spent a year working with high-volume MySQL, and is now learning Snowflake. Ember shares both posts about her core skill sets and her journey learning Snowflake.

Ember lives in Denver and work from home

Articles: 554

3 Comments

  1. Nice article

    I still have faith that there is a ton of optimizer tricks to unfold, like this one case….
    But, surely people must think better to avoid doing UNIONs on the same table…
    A quick hint i used to do without COUNT_IF was to

    SUM ( case true then 1 else 0 )

    probably is what the optimizer does under the hood

    • I was shocked someone would choose this methodology. I don’t think I would have made this up as a possible scenario, but if one developer chose this method, there may be others out there. Amazing how many ways there are to get the same result set.

  2. Wooow in your result the original query had a run time of 3 seconds? Our Java developers woult have been happy enough, and we – as (application) DBA – would not really notice this.
    I have even seen queries with GROUP BY, where – as you check table and indexes, the count would always be 1, because they did a grouping on the primary key index 🙂
    Developers should contact more often SQL specialists to support them with (complicated) queries. It is not because the result is what was expected, that the SQL is oke as well.
    Thank you for the tip in this specific example. I am going to check if this is also a good (better) solution in DB2, or is our optimizer smart enough? 😉

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.