The revelation that Snowflake doesn’t (currently) use b-tree indexes may be shocking to some who are used to a more traditional rdbms environment. It is interesting, therefore, to walk through some of the generalities of the methodologies to understand why the lack of indexes didn’t scare me away from the platform for the right use cases.
Typical RDBMS B-Tree Indexes
My experience here comes mostly from Db2, though I’ll note a few differences I’m aware of for MySQL.
Most RDBMSes have the concept of a page or a block. This is the minimum unit of I/O for the database. Meaning you can’t read less than one page, and when you write, you have to write out the entire page, even if you only changed one bit. Often page size choice is a factor in performance and choices generally range from 4K to 32K. Some platforms may support larger page sizes. For the most part, a row should fit on a page, though most RDBMS platforms have ways of getting around that restriction for really wide rows.
Generally, indexes are stored in a b+ tree format. The overall effect is to introduce more page reads in order to read fewer pages overall. In vague terms, the table and index can be thought of as having this kind of structure:
The root and non-root index pages store essentially a directory of the pages the next level down in the hierarchy. The values stored are the page location and the maximum value for the index key stored on that page. The leaf pages list every index key value in order, and a locator for the table page(s) where that row(s) can be found in the table. In Db2, the locator is a RID, which is composed of the page number within that table space, and a slot within that page. In MySQL, that locator is the primary key value of the row, since table data is stored within the b-tree structure of the primary key index.
This makes a single-row lookup on a unique key value extremely fast, as the index can be navigated to get the right locator, and then only the one page within the table has to be read.
Without the index, every table data page would have to be read to find the one row with that value.
The value of this becomes less and less impactful as the percentage of the table being returned gets higher. In an extreme case, we could read nearly every page of the index, only to have to read nearly every page of the table to return 99% of the rows in the the table.
This is actually more expensive than a full table scan, because we’re reading more pages. RDBMS optimizers look for this and are likely to choose a table scan instead.
At moderate levels, clustering a table on an index (placing the table in approximately the same order as the index) can help, and some RDBMSes offer this feature.
In Db2, the threshold we used to define an index as “low-cardinality” and therefore unlikely to be helpful was when the index cardinality is below about 3% of the table cardinality.
Indexes are physical structures that must be maintained in real-time with table data. If we index every column (and every combination of columns), then that not only takes up a lot of space by storing the data in multiple places, it also slows down inserts, updates, and deletes considerably. Indexing is a balance of finding the indexes that help our query workload without slowing down performance too much. It was always a warning flag for me when I saw a table with more than 8 indexes on it, though I’ve absolutely seen tables in Db2 and MySQL with dozens of indexes.
It is also worth noting that we are not showing the reads of metadata (statistics) that are used by the optimizer to come up with an access plan in the first place. Db2’s statistical metadata can be out of date if statistics are not current.
Snowflake’s Micro-Partition Structure
I can’t stop thinking of Snowflake’s micro-partitions as giant pages. Micro-partitions are often 16 MB of compressed column-organized data. This is vastly larger than other RDBMS pages, which is actually appropriate when dealing with an analytical workload. The micro-partition is a file in cloud object storage. It is also kind of the minimum unit of I/O in that a partial partition cannot be written, though it is possible to read only the needed columns from a partition. Since cloud object storage is immutable, a change to a micro-partition actually involves creating a new file and deregistering the old file.
Snowflake targets a larger data set, and doesn’t have the same optimizations around single-row access. Avoiding reading an entire table is absolutely a priority, though, so Snowflake uses partition elimination as one way to achieve that.The first level of partition elimination is achieved by reading metadata about the micro-partitions. This metadata includes the minimum and maximum values for each column on each micro-partition. The metadata is stored in Snowflake’s cloud services layer, while the micro-partitions are stored in cloud object storage owned by Snowflake.
The metadata for a micro-partition is written at the same time the micropartition is written, as part of the transaction, so the metadata is never incorrect or out of date.
A table can be intentionally or accidentally clustered on one or more columns. If the table is appropriately clustered on a column that holds a value we’re looking for, we can then read the meta-data to understand which partition we need, and then go load that specific partition to retrieve the row.
The worse the clustering is for the column we’re concerned about, the worse partition elimination gets, with the worst-case being that we have to read the entire table to find a specific row that we’re looking for.
Because of the larger nature of micro-partitions compared to pages, we can easily have thousands or millions of rows on a page.
We can’t cluster the table in different ways at the same time, so the column or columns that make up our clustering key must be carefully chosen to match our query workload. High cardinality on the clustering key can lead to higher charges if using Auto Clustering.
To help with single row queries, Snowflake also offers Search Optimization Service, which builds a search path that can be used to find unique values more quickly. This is only really helpful for single values or very short lists of values. The tech behind SOS can basically be thought of as a materialized view based on a bloom filter. The bloom filter points to partitions where specific values may exist, and false positives are possible, though filtered out during query processing.
Explain Plan / Query Profile
Now, I’d like to show some actual explain plans and query profiles illustrating these concepts in Db2 and Snowflake. I had some problems getting access to a Db2 that worked for me, and that has delayed my ability to publish this article. The only Db2 I have regular access to for free is Db2 on Cloud Lite, which has a 200 MB data limit. I couldn’t install locally, because IBM still does not have a version of Db2 that runs on M1/M2. I usually work with much more data than this in Snowflake. Snowflake has several sample data sets. I usually work with the data in the TPCH_SF100 or the TPCH_SF1000 schema. ‘SF’ in these schema names stands for “Scale Factor”, so TPCH_SF1000 is 10X the size of TPCH_SF100 in data volume. For this, I’ll scale down to the TPCH_SF1. Even with one table from that the CSV was 170 MB in size. I was unable to load that into Db2 on Cloud Lite, so instead I pulled an old mac from the basement that my husband uses for his 3d printer. For this, I’ll be using the orders table.
I started by creating the table, and loading the data I exported from Snowflake, like this:
create TABLE ORDERS ( O_ORDERKEY BIGINT NOT NULL PRIMARY KEY, O_CUSTKEY BIGINT, O_ORDERSTATUS CHAR(1), O_TOTALPRICE DECIMAL(12,2), O_ORDERDATE DATE, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15), O_SHIPPRIORITY BIGINT, O_COMMENT VARCHAR(79) ); load from ORDERS.csv of del replace into orders; backup db testdb to /dev/null;
The load command is a super-fast way of getting data into a Db2 table that bypasses logging, and this is why the backup command is included here. Backing up to /dev/null does nothing, but in a real environment, you'd want a backup after a load to maintain recoverability. You can also use the COPY YES or NONRECOVERABLE keywords if they are appropriate for you.
Snowflake Data and Query Profiles
I'm using some of the sample data that comes with Snowflake here, so I'm really just going to pull it into a table to cluster it however I'd like.
CREATE OR REPLACE TABLE orders_rand AS ( SELECT * FROM snowflake_sample_data.TPCH_SF1.orders ORDER BY RANDOM() );
I'm putting the data in random order here to ensure I don't have any accidental data clustering. Instead of adding indexes, I'll be clustering this data for the target query we're working with here.
Db2's explain plans are read much like Snowflake's query profiles - bottom to top. Their graphical representation is text-based in the interface I'm using - db2exfmt. Snowflake provides profiles for all queries available by clicking on the query id or by selecting the query from query history (under activity in the snowsight interface).
No Index or Clustering
The query I'll use is just a select of order data from a specific date. The result set here is 621 rows in size, so a fairly small portion of the 150 million row table.
select * from orders where o_orderdate=date('01/01/1992')
The Db2 explain plan graph for this looks like this:
Access Plan: ----------- Total Cost: 46313.3 Query Degree: 1 Rows RETURN ( 1) Cost I/O | 613.407 TBSCAN ( 2) 46313.3 52256 | 1.49979e+06 TABLE: DB2INST1 ORDERS Q1
This is a typical table scan access plan. Db2 thinks it will end up with about 613 rows, which is a fairly close estimate for the 621 that are actually returned. To accomplish this, Db2 is scanning every row in the table to determine if each meets the criteria of the query conditions.
Each operator in an explain plan shows both the estimated cost and the estimated I/O's expected. The access plan above has only one operator (the table scan - TBSCAN). When there is more than one, the cost and I/O estimates are cumulative and include everything below themselves in the plan. In this case, Db2 expects to have to read 52,256 pages in order to scan this table. In this example, the pages are 4KB in size.
In Snowflake, our query profile looks like this:
Note how similar this looks in structure to the Db2 access plan. Snowflake does represent the filtering of data separately, though the filtering is done while the table is being scanned. While Db2 has a separate node in the graph to represent the table object itself, Snowflake represents only the scan of the table, including details about the table in the right pane when you click on the table scan operator.
One important part to look at here is the number of partitions that were scanned. 8 out of 8 tells us that snowflake had to scan all micro-partitions for the table to get this result. Like Db2, Snowflake is scanning each row in the table to determine if it meets our query criteria.
Now what does this same query look like with an index on Db2? We'll add a simple index on o_orderdate:
db2 "create index idx1 on orders (o_orderdate) collect detailed statistics"
The access plan now looks much better:
Access Plan: ----------- Total Cost: 913.863 Query Degree: 1 Rows RETURN ( 1) Cost I/O | 582.605 FETCH ( 2) 913.863 135.046 /---+----\ 582.605 1.5e+06 IXSCAN TABLE: DB2INST1 ( 3) ORDERS 26.542 Q1 3.90992 | 1.5e+06 INDEX: DB2INST1 IDX1
Notice the difference in "Total Cost". This is expressed in a time expression called timerons. Timerons cannot be translated to an actual unit of time, but can be compared between queries. In this case, we reduced the cost of the query to almost 1/40th of what it was.
The estimate of the number of rows expected in the result set actually got a bit further off from what we know to be reality. The I/O estimated over the course of this query is far lower at only about 135 pages compared to the 52,256 pages that were read for the full table scan. We can also see that the index scan itself represents about 4 page I/Os, but those 4 I/Os are what enable us to read fewer pages from the table.
In Snowflake, we'll first cluster the table on the o_orderdate column by creating a new copy of the table like this:
CREATE OR REPLACE TABLE orders_sorted_orderdate AS ( SELECT * FROM snowflake_sample_data.TPCH_SF1.orders ORDER BY o_orderdate );
The query profile when running the same query against this table looks like this:
Notice there isn't a real change in the graphical structure here, but what we do see is that 1/8th of the data had to be scanned to begin with. Only one out of the 8 partitions for the table were even scanned.
There are some interesting things to note here. As the traditional RDBMS access plan gets better, it may actually look a bit more complicated. Instead of a straight line showing only the table, we have an additional physical object - the index - also in the plan. The table remains in the plan in this case because we query the index to figure out which rows we need, and then we go to the table to retrieve the rows we need. There are certainly queries and access plans in a traditional RDBMS where this would not be true. Index-only access is a possibility, and if we're querying by primary key on MySQL or another RDBMS that stores row data on the leaf pages of the primary key index, we could see only the index in the plan, without the table.
Snowflake's query profile looks the same in the graphical representation for more and less efficient access methods for the same table, but we need to look at the numbers for a better understanding of what is actually going on.
Both methods, if we were to actually time these queries would probably see faster clock time for their more efficient forms, though with significant variations based on whether their caches were warm or not.