Basics of Performance in Snowflake – for Relational DBAs

I spent more than 20 years working with relational databases, gaining a deep expertise with Db2 LUW and a lesser expertise working with MySQL along with a bit of time spent on MS SQL Server, Oracle, and random other database platforms and database-adjacent tasks. I started my career as a physical/systems DBA and evolved to cover all manner of database related tasks. Coming to Snowflake with this background has been an education, and I thought I’d share some of what I’ve learned.

Disclaimer

At the time of this article, I am a Snowflake employee. The statements here do not represent Snowflake. They are my personal opinions only. Any statements here should be verified through official channels before they are relied upon.

Buh-Bye Physical/Systems DBA Tasks

The lack of the traditional physical/systems DBA tasks was a hard one for me to fully accept and to let go of, and several customers I’ve talked to have also struggled with this. It is easy to be used to spending your days monitoring and managing database servers, tuning them and identifying problems that are hardware-adjacent. Knowing how utilized a database server is was the bread and butter of managing database systems. Ensuring backups, copying databases, building processes to accomplish this along with maintenance tasks have all taken up thousands of hours of my time over my career.

Even running traditional databases in the cloud still involved some tasks in this direction – ensuring correct statistics, and managing upgrades are things that still require effort, even if the database is on a service like RDS that does most of the work for you.

With Snowflake, there’s very little management of backup/recovery. If required for regulatory or business reasons, perhaps copying a database periodically would be as far as this might go. Managing settings for time-travel and fail-safe, and ensuring tables are appropriately created as transient or regular might fall under this category, but they require far less specialized expertise than designing and implementing a recovery or HA strategy for Db2 did.

Managing resources is still a thing, but at a different level. With the full separation of compute from storage in Snowflake, the tasks here are mostly less onerous, and often consist of having a strategy for who or what processes use which warehouses and helping users understand or have access to the right size of warehouses. Setting a sane value for auto suspend is maybe the most important and onerous task here, to make sure users don’t leave idle warehouses up and running or start the same warehouse a dozen times in a minute. Managing autoscaling may require a bit of testing and attention.

There’s no watching CPU utilization to ensure we don’t hit a wall or tuning memory allocations to accommodate different kinds of workloads. Workload Management is mostly handled by separating workloads into their own warehouses.

One of the things I personally enjoy about letting go of this work is that it opens me up to spend more times in the fun (to me) areas of query performance and optimization. There are still plenty of tasks for database professionals – they’re just different.

Compute, Storage, and Cloud Services

Snowflake’s architecture, instead of consisting of servers and an architecture within the servers, is truly a cloud-native architecture. There are three basic layers to understand: compute, storage, and cloud services. I’m going to use AWS terms here because I’m more familiar with Amazon, but equivalents absolutely exist for each of the clouds where Snowflake is available (currently AWS, Azure, and GCP).

Storage

The storage layer in Snowflake is mostly cloud object storage such as S3 on AWS. There is some EBS that is used for local caching of data, and Unistore/Hybrid Tables use storage differently. There are some great things that come with cloud object storage such as the famous 11 9’s of durability. It is nearly unheard of to lose data. However, the overhead of this type of storage is much more than what people used to working with SSD on database servers may be used to. Snowflake is designed for querying large amounts of data. Applying a ton of compute to read files in parallel goes a long way to counter-balance this disadvantage. Currently the smallest warehouse size in AWS has 8 cores(subject to change at any time). I have worked with many database servers smaller than that in my career.

Compute

The compute is not coupled with or tied to the storage in any way. Compute resources are referred to as warehouses, and the data is not owned by or tied to any one warehouse. That is different from any other database system I have worked with extensively. Even when running databases in AWS RDS, the storage there is coupled with the compute. Sure, you can size up or down, but not without causing an outage last time I worked with it, and you also can’t apply a server with 8 cores and a cluster of servers with a total of 1024 cores to the same data at the same time.

It is incredibly easy to size up not just from moment to moment, but also to use different compute sizes for different workloads running at the same time against the same data. Another characteristic of cloud object storage is the high throughput. The same data can be read by a lot of cores at the same time. I don’t know the limits there, but it is higher than I’ve ever had to think about.

Sizing a warehouse appropriately for a workload is still a bit of an art. I’ve seen cases where increasing the warehouse size was actually cheaper because the larger warehouse had to run for less than half the amount of time taken by the next smaller sized warehouse. As of today, there isn’t auto warehouse sizing (scaling up/down) for query complexity outside of tasks. There is automatic scale-out to address concurrency.

Cloud Services

If we have compute running queries against storage on cloud object storage, there’s still a piece of magic missing. After all, anyone can have files on object storage and apply compute to them. The magic sauce is all the system work of any database system. Cloud services is a mostly shared environment where system tasks like these occur:

  • Authentication
  • Access control
  • Infrastructure management
  • Metadata management
  • Query parsing and optimization

Notice that this is often a shared resource, unless you are an exceptionally large Snowflake customer. That means that this layer is where it might be possible to run into things like noisy neighbors or glitches with the cloud provider or Snowflake itself that may cause delays in processing a query. These things are very rare, but are possible.

Most customers don’t have to pay for using cloud services, but if the usage of cloud services exceeds a certain percentage, customers are billed for it. My understanding is that this is to prevent abuse of this layer.

Query Performance

When working with Db2, I found that about 20% of achieving good performance was things like appropriately sized hardware and working to tune memory allocations and such, while the other 80% was what I thought of as logical factors. That 20% was largely addressed by architecting a database server correctly and having reasonable values set for a number of parameters. Some ongoing monitoring and tuning was also needed. But the larger part was the logical factors. Use a bad data model, write a stupid query, that was 80% of the performance problems that I personally saw.

If it is really easy to use a larger warehouse, then it is really easy to throw hardware at a query or data modeling problem. I’ve seen people throw hardware at problems throughout my career, but with Snowflake, this can be done in seconds rather than hours, days, weeks, or months. The corollary of this is that throwing hardware at a problem is also throwing money at the problem. This can be expensive. Solving the underlying logical problem is difficult and takes time and expertise, but can be much cheaper, particularly in the long run.

With Snowflake, I find the vast majority of workloads on well-designed data models just work. Even with “well-designed” being a very broad net that can cover a vast array of data modeling strategies from Kimball to Inmon to Data Vault. It is a question I’m often asked “What is the best practice for data modeling for Snowflake?”, and while I can point to a lot of mistakes and bad ideas, most well thought-out data modeling techniques work well. I like some denormalization if given the choice, but it can be taken too far – I’ve seen well in excess of 2000 columns in a table, and find that there’s an inflection point around 800 columns where compilation time goes up without a corresponding reduction in total execution time.

There are absolutely cases where query performance problems happen. Snowflake is making strides to reduce the situations where this occurs every week. I plan to write a number of articles detailing performance issues that I have seen and that you can identify yourself.

There are also normal failures that happen on the part of a cloud provider or even Snowflake themselves that can cause one-time query performance problems. When I’m looking at a query, one of the first questions I ask is whether this query always performs this way or wether there is a trend in performance over time. The PARAMETERIZED_QUERY_HASH can be particularly useful for this.

Concurrency

Concurrency can absolutely cause problems, but mostly in the contention for resources on a singe warehouse or multi-cluster warehouse.

Many concurrency problems are solved by having different warehouses for different workloads. For many concurrency issues, a properly configured multi-cluster warehouse can provide more compute in scalable way. In extreme cases, the parameter MAX_CONCURRENCY_LEVEL can be used to increase or decrease concurrency, but it should be used with caution, and only when other options have been exhausted.

Clustering

Snowflake does not have indexes for normal tables. Unistore/Hybrid Tables are an exception and for a specific kind of workload. However, Snowflake does have clustering. Clustering in Snowflake is somewhat like a clustering index in Db2. Clustering affects the physical organization of data, and enables partition pruning so less data has to be scanned. This is a powerful tool for performance. As part of the metadata that is recorded about each micro-partition, Snowflake stores the minimum and maximum values for each eligible column on every micro-partition. When Snowflake knows the values needed through either a predicate or a join condition, Snowflake can then eliminate micro-partitions that cannot contain values required by the query. This can vastly reduce the I/O needed to satisfy a query. Choosing a clustering key is a bit of an art, though. Clustering keys that are too high in cardinality may be excessively expensive without a corresponding advantage in performance over a lower-cardinality key. I plan to cover clustering in much greater detail in future articles.

Materialization

If a particular intermediate data set is used frequently or by a number of different query patterns, then it may make sense to materialize it. That materialization can be done through a Materialized View, a Dynamic Table, or manually. Clustering can also be applied to the materialization to enable better performance through partition pruning, though caution should be if clustering at multiple levels of a materialization hierarchy. This can cause a lot of expense without a corresponding performance benefit.

Rate of Change in the Cloud

Another area that may be a bit frustrating to DBAs who are not used to cloud data services is the rate of change. When I was a Db2 DBA, I tried to push a fixpack (software update) 2-4 times a year. Certainly there was never a software update or upgrade without my knowledge or more often fighting for the upgrade or update. Snowflake is constantly fixing bugs and releasing new features. Earlier this year, I told a customer that automatic clustering was not available on dynamic tables, only to have auto clustering become available on dynamic tables two weeks later. If I can hardly keep up with the level of access I have internally, then others shouldn’t feel bad if they also struggle in this area. There is often a new release weekly that may include small or large changes. I am impressed by both the level of testing that is done before releases as well as the ability to quickly disable new behavior if problems are seen. These are hard to do right with platforms as complex as Snowflake.

Summary

Hopefully I’ve shared some good concepts and given you some terms to search or ideas to explore. Future articles will explore some of the concepts here in greater detail. Snowflake can be a bit of a learning curve coming from a traditional relational database, but it is clear to me that Snowflake loves and respects SQL and that Snowflake is constantly improving.

Ember Crooks
Ember Crooks

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, spent a year working with high-volume MySQL, and is now learning Snowflake. Ember shares both posts about her core skill sets and her journey learning Snowflake.

Ember lives in Denver and work from home

Articles: 556

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.