I’ve spent 20 years of my career working with IBM Db2, then a year working with MySQL, and now my database platform of choice is Snowflake. When I was moving from Db2 to MySQL, there were a number of “WTH?” kind of moments. Things I was familiar with in Db2 that weren’t there or didn’t work the same way. Granted, part of this was the deep level I was digging into with MySQL and the deep level of expertise I had with Db2. The number of people who know about or care about the page structure of tables on either platform is probably pretty small.
Snowflake WTH
I have only had one such moment with Snowflake, and it was while I was still interviewing. I learned that essentially in Snowflake, indexes don’t exist. I’ll list some semi-exceptions to this below. I’ve probably spent a measurable part of the last 21 years working with indexes. I understand the difference between a b-tree and a b+tree, even if both MySQL and Db2 use a b+tree. Even having not touched a production Db2 environment in over a year and a half now, I could still get really deep in the discussion about selecting an optimal index, using clustering indexes, and how to understand if indexes are being used in Db2. My first independently published article was for IBM developerWorks (now defunct) on how low-cardinality indexes can actually be bad for performance.
Because of this, there may or may not have been a column in my spreadsheet comparing my job opportunities I was considering when I was laid off earlier this year titled “Indexes?” (right after the column titled “Relational?”).
It didn’t take me long to understand why B+ tree indexes are not a standard part of Snowflake. Snowflake takes a big data approach to an SQL-based database. There are a lot of controls over things like how much horsepower you apply to a query, and the starting point is a single vm (“warehouse”) with 8 vCPUs, with steps that very quickly have you applying dozens or hundreds of vms, each with 8 or more vCPUs to a single query. While I’ve absolutely seen Db2 systems applying even more horsepower than that, they were large systems where the companies paid millions in licensing and support costs (not to mention planning and capital spending for the hardware), not a system that allows paying a few dozen or few hundred dollars for such resources for the life of a query.
To me, that’s one of the amazing things about Snowflake. I still remember early in my career working through capital expenditure processes that took months to get hardware in place that today is less than what I have running in my MacBook pro. That’s partially how the world has changed around me, but I think that Snowflake is one of the leaders in applying this kind of computing power at the database level and making it really easy to use this capacity when needed and not have to pay for it the rest of the time. I can absolutely design a Db2 systems that has 128 8-core VMs, but I can’t do so in time to apply it to a query an hour from now. And I can’t change my mind and a week later only use a single 8-core VM, followed by compute twice the original size for the next peak period. Even the cloud offerings for Db2 don’t allow me to change sizes that quickly.
If Not Indexes, Then What?
Does this mean that there are no levers to pull/push for performance in Snowflake? Absolutely not.
One of the most powerful tools that Snowflake customers have to improve performance is the use of clustering. Snowflake uses the statistics collected about the data stored in micro-partitions to eliminate the reading of entire micro-partitions. I’ve seen queries on tables that were around 30 TB of compressed data have to read only a few compressed 16 MB micro-partitions to return the results of a query. Partition elimination is one of the key performance metrics that can be used to improve query performance.
A table can be explicitly clustered via the definition of a clustering key and the use of a serverless service called automatic clustering. The work that automatic clustering does is much like what we might do with reorg for a table in a Db2 database with a clustering index. It takes the data and tries to place it on micro-partitions grouped by the clustering key. One of the interesting things is that simply loading data into a table in an interesting order can also lead to the performance kick of partition elimination. This means that integrating ordering into a data engineering pipeline can lead to performance benefits without having to pay Snowflake for the service.
As with any database platform, there are some design patterns that lead to higher performance than others.
Index-Like Things in Snowflake
There is a new feature under development called Unistore. Snowflake announced it in 2022. It introduces Hybrid tables, which are designed differently than normal Snowflake tables, and designed for better performance on singleton-row lookups and updates. It will allow the creation of not just the indexes to support primary and foreign keys (also non-existent in regular Snowflake tables), but a full indexing capability.
Depending on your definition of “index”, the search optimization service may be somewhat similar. The search optimization service builds a materialized view that is a bloom filter of values. A bloom filter is a probabilistic structure that essentially lists micro-partitions where a specific value might exist. A false positive is possible, but that is filtered out in the course of running the query. What it really means is that far fewer micro-partitions need to be scanned to find a single unique value. The search optimization service greatly speeds up the lookup of fairly unique values in standard Snowflake tables. It must be explicitly enabled for a table, and is another serverless service where the creation and maintenance of that materialized view is billed separately. Use of that materialized view is transparent, and no change is required to queries to take advantage of it.
Each database platform I’ve worked with has quirks in how things should be done to get the best performance. Snowflake is no different. You can write bad SQL for any platform. Snowflake really benefits from queries that use SQL set-based logic and data warehousing data models. Data models that match Snowflake’s strengths will do more for performance than most other things.
I Don’t Miss Indexes
I don’t miss trying to find the perfect index for a workload of over a thousand queries. I do enjoy seeing near linear scalability on the processing of some queries with the growth of hardware. I really enjoy a small table being anything under a billion rows, and seeing query plans that sift through trillions of rows in minutes. I may be a tad biased, though!
Thank you for sharing your deep insight always.
Netezza is probably the first RDBMS or DBMS (starting 20 some years ago) which does not have indexes.