As I’ve been learning about Snowflake, at least once a week, I find small things in the SQL or feature set that just make me happy. I really feel like Snowflake is meant for someone who loves SQL, but also someone who does other programming. Here are the things that I’ve discovered that made me happy in Snowflake.
Undrop
One of the commands that always made me stop and think twice as a DBA was a drop command. For most RDBMSes, the only way to undo a drop of a table is to restore the entire database. IBM Db2 has dropped table recovery, but it has to be enabled for the table space and it is so infrequently that you use it. It is also not an onine operation – it requires doing a database restore, which either takes the database offline or needs to happen on a different server and then export/import the data over to the original target database.
In some of the first labs I did at Snowflake, the UNDROP
command was introduced, and it was amazing. Simply run the command to undo a drop command, as long as you’re within the data retention period for the table (default: 24h). Here’s a simple example in snowsql:
create table t1 (c1 string);
+--------------------------------+
| status |
|--------------------------------|
| Table T1 successfully created. |
+--------------------------------+
1 Row(s) produced. Time Elapsed: 0.155s
insert into t1 values ('test'), ('sample'), ('silly string');
+-------------------------+
| number of rows inserted |
|-------------------------|
| 3 |
+-------------------------+
3 Row(s) produced. Time Elapsed: 0.545s
select * from t1;
+--------------+
| C1 |
|--------------|
| test |
| sample |
| silly string |
+--------------+
3 Row(s) produced. Time Elapsed: 0.257s
drop table t1;
+--------------------------+
| status |
|--------------------------|
| T1 successfully dropped. |
+--------------------------+
1 Row(s) produced. Time Elapsed: 0.191s
select * from t1;
002003 (42S02): SQL compilation error:
Object 'T1' does not exist or not authorized.
undrop table t1;
+---------------------------------+
| status |
|---------------------------------|
| Table T1 successfully restored. |
+---------------------------------+
1 Row(s) produced. Time Elapsed: 0.121s
select * from t1;
+--------------+
| C1 |
|--------------|
| test |
| sample |
| silly string |
+--------------+
3 Row(s) produced. Time Elapsed: 0.127s
This is such a simple thing, but just amazing for anyone who has ever panicked even for a moment that they weren’t supposed to do that drop command in the environment they were in. The old dba joke – “if a cup of coffee doesn’t wake you up in the morning, try dropping a table in production!” – simply does not apply.
Don’t get me wrong, there could still be problems caused if you didn’t notice dropping the wrong table or something, but it’s really such an amazing little thing.
Fast COUNT(*)
In my last job, I worked with a MySQL database that was sharded across hundreds of servers. A global count(*) would simply never return for many tables, and we mostly discouraged developers from even trying it. I can report I’ve run a COUNT(*)
on a table with over 2 trillion rows and had it return in less than 15 seconds, and from a 343 billion row table and had the result in less than a second.
The reason here is that the statistic on the exact number of rows per micropartition is synchronously updated, so a count of everything is merely a metadata operation. The metadata here is kept exactly. Add in conditions and it will certainly run slower because then it actually has to go out and count.
Is there a cost to exact counts? Sure, there is, but when you don’t have the expense of maintaining indexes, what does it make sense to invest in? The metadata we’re talking about here is at the micropartition level, so every time a micropartition is written, the metadata has to be updated anyway to reflect the details of the micropartition.
Group By All AND Group By <Position>
Have you ever done a really messy long group by, grouping on multiple columns, and applying aggregate functions on other columns? Have you ever been writing or maintaining SQL code like this and had to add a column to the grouping columns and received an error message that you aren’t including a column you should be? Or had to include the syntax for a derived column both in the select list and in the group by clause?
GROUP BY ALL
tells Snowflake to group by every column you specified without an aggregate function, making it so you don’t have to specify the names of the columns. I particularly like this for derived columns like this:
select date_part('month', o_orderdate)
, sum(o_totalprice)
from orders
group by all
order by 1;
+---------------------------------+-------------------+
| DATE_PART('MONTH', O_ORDERDATE) | SUM(O_TOTALPRICE) |
|---------------------------------+-------------------|
| 1 | 204614487939.45 |
| 2 | 186564779387.12 |
| 3 | 204257402525.48 |
| 4 | 197786578627.44 |
| 5 | 204314572769.97 |
| 6 | 197865409275.28 |
| 7 | 204591807090.33 |
| 8 | 177031164640.81 |
| 9 | 169311453555.88 |
| 10 | 175062481374.29 |
| 11 | 169541947957.95 |
| 12 | 175356105604.43 |
+---------------------------------+-------------------+
12 Row(s) produced. Time Elapsed: 1.623s
This feature may not be for everyone, but it is nice for ad-hoc querying for sure. Unlike Db2, you can also use positional parameters in the group by clause:
select date_part('month', o_orderdate)
, sum(o_totalprice)
from orders
group by 1
order by 1;
+---------------------------------+-------------------+
| DATE_PART('MONTH', O_ORDERDATE) | SUM(O_TOTALPRICE) |
|---------------------------------+-------------------|
| 1 | 204614487939.45 |
| 2 | 186564779387.12 |
| 3 | 204257402525.48 |
| 4 | 197786578627.44 |
| 5 | 204314572769.97 |
| 6 | 197865409275.28 |
| 7 | 204591807090.33 |
| 8 | 177031164640.81 |
| 9 | 169311453555.88 |
| 10 | 175062481374.29 |
| 11 | 169541947957.95 |
| 12 | 175356105604.43 |
+---------------------------------+-------------------+
12 Row(s) produced. Time Elapsed: 0.532s
Positional arguments like this can be handy for ad-hoc querying, but if you’re coding something that will be reused, they are problematic – someone adds a column at the wrong point in the query and suddenly nothing works. So I thought I’d test using a column alias, and it works!
select date_part('month', o_orderdate) as order_month
, sum(o_totalprice)
from orders
group by order_month
order by 1;
+-------------+-------------------+
| ORDER_MONTH | SUM(O_TOTALPRICE) |
|-------------+-------------------|
| 1 | 204614487939.45 |
| 2 | 186564779387.12 |
| 3 | 204257402525.48 |
| 4 | 197786578627.44 |
| 5 | 204314572769.97 |
| 6 | 197865409275.28 |
| 7 | 204591807090.33 |
| 8 | 177031164640.81 |
| 9 | 169311453555.88 |
| 10 | 175062481374.29 |
| 11 | 169541947957.95 |
| 12 | 175356105604.43 |
+-------------+-------------------+
12 Row(s) produced. Time Elapsed: 1.616s
I tested each of these syntaxes on Db2 on cloud, just to verify they haven’t changed in this area since I was working with Db2 on a daily basis more than a year ago, and they all failed. The corresponding syntax you’d need to use is:
select month(o_orderdate)
, sum(o_totalprice)
from orders
group by month(o_orderdate)
order by 1;
While positional arguments are their own minefield, the column alias is both useful and maintainable.
HASH_AGG
I have been involved in many data migrations over the years where various techniques, scripts, and checks have been used to verify that the data on the target side of the migration matches the data on the source side. It has never been and easy thing to say “hey, the contents of this table are the same” without some serious scripting and querying.
Enter HASH_AGG
. This aggregate function is generally used like this:
select HASH_AGG(*) from orders;
+---------------------+
| HASH_AGG(*) |
|---------------------|
| 8019995652123155175 |
+---------------------+
1 Row(s) produced. Time Elapsed: 2.217s
The number will be the same as long as the same columns are present in the same order and the same rows are present in any order. It can be run on the source and target to ensure the data is the same with a reasonable degree of certainty. A useful little function.
I can think of another use case for it. Sometimes we’re writing complicated SQL and making changes to how that SQL works, which may also change the order in which rows are returned. This could be used to ensure that the result sets are the same without having to manually compare the data. This is an extra verification check that could be useful when writing a query that is five pages long.
Normal Function
This is such a small thing, and one that I ran into when working on a class project for some classwork I was doing at Georgia Tech. I took a class on simulation (which was excellent, but very math heavy), and for the project at the end, I chose to write functions in SQL to calculate random numbers from different distributions, including the normal distribution. This involves a formula using trigonometric functions and two different, unrelated random numbers from the uniform distribution (equal chance of all values between 0 and 1). I successfully used this code to generate these values in a Db2 function:
CREATE FUNCTION RAND_NORMAL (IN mu float, IN sigma float )
RETURNS FLOAT
LANGUAGE SQL
BEGIN
DECLARE ret_val float;
IF sigma <= 0 THEN
SIGNAL SQLSTATE '75001'
SET MESSAGE_TEXT = 'standard deviation must be greater than 0';
END IF;
SET ret_val = (values ( sqrt(sigma) * sqrt(-2*ln(rand())) * COS(2*ACOS(-1.0)*rand()) + mu));
RETURN ret_val;
END@
The problem came in when I wanted to add a seed to the random functions in the calculation. By specifying a seed, you get repeatable sets of random numbers, which is really important for data analytics and modeling processes. When I added in the seed, I was able to prove that the two random numbers I wanted were no longer independent, but showed a relationship, and therefore I failed to generate random numbers from the normal distribution. I found this little line on the IBM documentation page for the RAND
function:
The seed value is used only for the first invocation of an instance of the RAND function within a statement
Now, maybe I could have found a way around this in the long term, particularly had it been important enough to call on my considerable IBM contacts to resolve. But in the fashion of a college student a few days from a deadline, I documented my failure in the paper instead.
Because of this fairly recent experience, I thought to check on this when I started working with Snowflake. While trying to figure out how to check on that without the partner with the test code I worked with on the previous project, I thought I'd check to see if Snowflake already had a function for generating normally distributed random numbers and discovered the NORMAL
function!
Woohoo, problem completely averted! I never got back to checking if two RANDOM(seed)
functions in the same statement would show a relationship or not.
get_ddl
In MySQL, geneating the DDL used to create a table was easy using SHOW CREATE TABLE <tabname>
. Db2 uses the much more complicated db2look system tool, which is NOT available via the SQL interface, meaning you require an actual Db2 client or access to the command line on the server to use it. There is a hidden stored procedure I've heard about, but I don't think it's officially supported and documented to this day. This is somewhat frustrating, as needing to create a copy of a single table or understand the details of how a table was created are fairly common tasks. Yes there's the CREATE TABLE ... LIKE
statement, but that can only create a copy of the table structure, not show you the details of the command used to create it.
Snowflake has the GET_DDL function. It can be used like this:
select get_ddl('TABLE', 'T1');
+------------------------------+
| GET_DDL('TABLE', 'T1') |
|------------------------------|
| create or replace TABLE T1 ( |
| C1 VARCHAR(16777216) |
| ); |
+------------------------------+
1 Row(s) produced. Time Elapsed: 0.144s
I think for most purposes in Snowflake, I'd prefer to actually clone the table if I also needed to copy the data. Snowflake has this thing called zero-copy cloning that only rewrites the micropoartions if you actually change data, so the cloned table can literally operate on the same data that is in the table, saving a significant amount of storage if the table is sizable.
Variable Definition
Coming from an IBM Db2 background, I had frequently interacted with developers who were used to being able to define variables and then use them in their SQL. I frankly hardly saw the point, and now it's a feature I use almost every day in querying our internal data.
set var1='sample';
+----------------------------------+
| status |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+
1 Row(s) produced. Time Elapsed: 0.103s
select * from t1 where c1 = $var1;
+--------+
| C1 |
|--------|
| sample |
+--------+
1 Row(s) produced. Time Elapsed: 0.724s
Even more amazing is the ability to use these variables for actual identifiers, like this:
set var2='t1';
+----------------------------------+
| status |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+
1 Row(s) produced. Time Elapsed: 0.129s
select * from identifier($var2);
+--------------+
| C1 |
|--------------|
| test |
| sample |
| silly string |
+--------------+
3 Row(s) produced. Time Elapsed: 0.320s
I find this endlessly useful as I'm writing ad-hoc queries in Snowflake worksheets.
Summary
There's nothing here that's truly earth-shattering, though it's possible there may be things here that make your life of the life of your analysts or developers easier. These are just little reminders to me that I've made the right choice for which DBMS I'm working with at the moment. I'd love to hear others' favorite little things about any RDBMS platform in the comments below!