Disclaimer: this article is only my opinions. I do not represent Snowflake.
I realized after writing last week’s article that some people might be like me a few months ago, and not really understand what Snowflake is and how it fits into the database, data, and cloud markets.
I first heard of Snowflake from IBM when I was on-site in Toronto at the IBM lab where they write Db2. It was noted as one of the up and coming competitors that IBM was focusing on understanding and being able to compare IBM Db2 to. This was some years ago. I don’t think I paid it much attention other than noting that a few IBMers I respected in the Db2 world started working there. This isn’t unusual – I know people at quite a few cloud and database companies because I met them in the IBM Db2 world.
What is Snowflake?
Snowflake provides database as a service(DBAAS), along with other services, in the three major cloud providers (AWS, Azure, and Google Cloud). It’s a bit different from some of my previous experiences with DBAAS, though. AWS RDS is really nice – it makes administering databases mostly easy, autmating a lot of what physical DBAs do, but it is fundamentally just the same relational databases that happen to be running in a managed services model in the cloud. This is what all of the DBAAS providers I’ve worked with before were. Nearly all the concepts of administering the databases in the cloud still apply, just with some of them being done by the cloud provider. Snowflake can be looked at somewhat similarly, but there’s a really interesting aspect where compute and storage are completely decoupled.
Many cloud database implementations will say they decouple storage and compute, and what that normally means is that usually the storage is on something equivalent to EBS, and it can be detached from one “server” and attached to another – whether that server is a vm or a container or if you even known. Some providers are even starting to let databases use object storage for this purpose. But fundamentally, that storage still belongs to a single server at a time. Snowflake expects multiple servers to access the Object storage where all data is stored at the same time. For Analytics workloads, this works quite well, because generally the storage is not being access sequentially to pull all the storage for a table together in one place, but different micropartitions are being accessed by either different processes on the same server or by different servers in a cluster that are all working together to satisfy the same query.
In my head right now, as I’m learning about this, it feels like taking concepts from spark and the associated technologies and applying them to a SQL-based database. I think I’ll find that an inadequate description the more I learn, but it’s really taking the strength of MPP systems and incorporating it more fully than other database platforms I have worked with.
Snowflake is optimized for analytics workloads, and has other nifty features like some pretty slick tools for working with semi-structured and even unstructured data and data sharing. It’s also the first database I’ve worked with that provides cross-cloud DR and failover options built-in.
What is a Database?
I feel like this is a question that I’ve had to learn about and trip over with each new RDBMS that I’ve worked with. It seems like it should maybe have a simple answer, but each platform seems to use the term database somewhat differently. In Db2, a database is the physical files that contain the data stored in objects like tables and indexes, along with specific memory areas like buffer pools, and some of the processes that are used to query and work with the data as well. When I learned Oracle, Oracle defined the database as the files while the memory and processes all belonged to the instance. They may have changed this somewhat as they’ve expanded to allow more than one database per instance. MySQL really threw me for a loop at first because MySQL holds database as a synonym for schema. That still just makes my brain hurt – pick database or schema, do we really need both?
In Snowflake, a database is just another level of logical organization, above the schema. Since compute and storage are fully separated, it cannot include memory or processes in the definition. In fact, when I select which database to use, I can specify the schema in the same statement, like this:
This also means it is common to use double-dot notation, which I had previously only seen for Netezza, like this:
SELECT * FROM dbname.schema_name.table_name;
How is Compute Organized?
There are actually two layers of compute in addition to the storage layer.
One is the global cloud services layer, which does things like compile SQL, store your object definitions, store statistics, that sort of thing. This layer is fully managed and configured by Snowflake, and customers have very little control over it. I think of it much like the instance layer in Db2, but it’s not a 1:1 comparison. For the most part, customers are not really directly billed for this layer.
The other compute layer is one that generally is managed by customers within the snowflake UI or through SQL commands. It is also the layer that we work most with for both scaling and billing. At this layer, you create warehouses. Now when I think of a warehouse, I think of data, but remember that the data is not associated with any one database, but can access any of them (assuming permissions are in place). Users have the choice to use warehouses of different sizes, with billing by the second for a warehouse that is in use. Allocation of a new warehouse or starting a suspended warehouse takes at most a few seconds, and is often instantaneous because Snowflake maintains a hot pool of warehouses.
All of this leads to some interesting scaling choices. Of course we have the traditional database approach of scaling up. Snowflake uses t-shirt sizes for warehouses, from XS up to 6XL. Snowflake intentionally does not publish the exact details of the number of cores, memory, and so forth. It may change over time, or it may be slightly different from cloud to cloud to deal with differences in the hardware available. Generally an XS has the ability to do about 8 processes in parallel, and each level up through 4KL doubles, which leads to some pretty large resources. Note that this doesn’t mean that a 2XL is 2^6 cores on a single server, but may be harnessing multiple servers working in parallel and feeling like one large server. This kind of scaling is all about handling individual queries that are more and more complex – applying all that compute to solve a complex query.
Depending on the workload, the up/down scalability is fairly linear – meaning that if you have a query that runs for 10 minutes on an XS, it should probably run in around 5 minutes on a S. Additionally, 10 minutes on an XS should bill at about the same cost as 5 minutes on a S warehouse. So cautiously, with testing, you might be able to size up without a cost increase.
For managing concurrency, multi-cluster warehouses are used. Here you still specify a t-shirt size based on the query complexity, but you then give snowflake guard rails for the minimum and maximum number of clusters you would like. This allows scaling for concurrency – allowing larger numbers of things to run at the same time. Adding more clusters will generally not speed up a single complex query. I mean, it could, if your query is having to wait for resources because of other concurrent queries.
There’s another dimension to think about scaling, and that’s per workload. I’ve been fighting my own RDBMS-biases here and have heard others doing so as well. Let’s say you have a complex data load/parsing process that needs to run once a day for an hour, and to get it within that hour, you need a 2XL warehouse. The rest of your workload is fairly light and can easily be served with an XS warehouse. For this scenario, you should have a warehouse that is dedicated to that daily load process, and only starts for that process, suspending when it’s done. Your normal workload should go through a separate XS warehouse. There’s no need to pick only the larger warehouse because just one workload needs it, and there’s no conflict having the two warehouses up and running at the same time. In fact, you probably don’t want to resize warehouses just because this causes confusion.
A DBA’s Role
I have spent most of my career in various DBA roles, and have mentioned this in other articles, but I don’t believe that moving databases to the cloud eliminates the need to have a DBA. A fully managed solution like Snowflake may reduce the DBA workload a bit more than others. Any fully managed cloud database is likely to reduce the physical DBA requirements, but experts are still needed to ask and drive answers to the big questions about SLA, backup and recovery, HA, DR, and the like. They’re also needed to answer the small questions for users like how to connect and to do things like manage access. Where I see the role as just as much work is on the logical DBA side. Like any other platform, with Snowflake you can mitigate performance problems by throwing hardware at them. Until you can’t or can’t afford to any more. This is when the logical tasks of organizing data into reasonable data models, investigating query performance, and matching the two up come into play. These roles are still likely to be needed.
Data engineering is probably still needed as well, unless your end-users are talented at extracting data from your system of record and conforming it to a reasonable data model – and automating that. Yes, Snowflake and partners make that easier, but it’s still work that goes faster with skilled professionals at the helm.
Please share your experiences with Snowflake – would love to hear them in the comments below!