Leveraging PARAMETERIZED_QUERY_HASH for Snowflake Query Analysis

I often start with just a query id and a general description of a performance problem. Almost as often as I’m looking at query profiles or the equivalent in internal tools, I like to see more about the context of the query. There are a number of different dimensions and ways to look at that, but one of the first ones I look at is to understand if the statistics I’m seeing for this one query id are typical or whether they’re an outlier for this query. You can spend a lot of time in analysis, only to find that this is the one instance of this query that ran for 10X as long as other executions of the query.

Identifying a Query Pattern, Not Just a Query.

Sometimes we think of analytical systems like Snowflake as handling very different queries all the time, and while that is sometimes true, it is more likely that a particular query pattern is used repeatedly. This might be by a dashboard, a data modeling tool, or something written by a person, but it is still common to run very similar queries over time.

Understanding one execution of a query in the context of other executions helps me understand if a problem is transitory or if it is persistent. Just being transitory doesn’t mean the problem is dismissed, but it is a different avenue for further investigations than if a particular query is having the same performance in a repeatable fashion.

How can you tell what the performance of a query looks like over time, or across iterations with different string literals? PARAMETERIZED_QUERY_HASH to the rescue!

PARAMETERIZED_QUERY_HASH is derived from the query itself, after treating static values that are specified with a set of supported comparison operators. The following operators are supported at this time:

  • = (equal to)
  • != (not equal to)
  • >= (greater than or equal to)
  • <= (less than or equal to)

Finding different instances of the same PARAMETERIZED_QUERY_HASH across time allows us to see performance of the same query across time. This, of course, requires that the same query was executed multiple times with different parameter values specified. The PARAMETERIZED_QUERY_HASH also strips away differences in case, whitespace, comments and several other areas. See the Snowflake documentation on PARAMETERIZED_QUERY_HASH for more details.

Using PARAMETERIZED_QUERY_HASH

In my case, I’m often starting with a query id. That means that my query pattern to get more information often looks something like this:

set query_p_hash = (SELECT QUERY_PARAMETERIZED_HASH
FROM snowflake.account_usage.query_history
WHERE query_id = '01b3d8b4-0002-5225-0058-f8870001b25a'
);

select $query_p_hash;

Then the result of that last select is:

$QUERY_P_HASH
a38255a435c05afe54d97adf02eed443

Since I’ve set that as a variable, I can now use it in other queries. To list all executions of this query, I can use this sql:

SELECT *
FROM snowflake.account_usage.query_history
WHERE query_parameterized_hash = $query_p_hash
ORDER BY start_time DESC;

I’m not pasting the output here, because all 75 columns is not terribly fun to parse through as a human using a web interface like this. There is a wealth of information here.

In this case, I’m using a free trial account, so I don’t have a lot of history. But using Snowsight to generate some graphs, I can come up with things like this:

This graph tells me which days this query was executed on and how many times on those days. This can be useful for understanding how much a query is used and if its use has a time-based correlation. For example, if we had a query that was used as a part of a month-end process, that would be fairly clear. Also if we had a load test and a query involved in that load test, that peak in load would be obvious in a simple count like this.

Perhaps my favorite way to use QUERY_PARAMETERIZED_HASH is to look at the median execution time over time:

While that is pretty boring looking for a handful of executions in a month, it can get much more interesting on a real workload. One common pattern I’ve seen includes a sudden spike in the execution time in a very limited time frame. Often the query id I’m being asked to look at is a part of that spike. The spike can be due to things like problems with the cloud provider’s response time for returning objects or other infrastructure issues, or tests that involve loading additional data and then returning the table to its original state. Another common pattern to see is response time increasing over time. Without reasonable micro-partition pruning, performance of one query might degrade with larger amounts of data in a table being queried.

It is also interesting sometimes to look at the minimum or maximum execution times by day to see the outliers and how far from the median those might have been.

Note that the PARAMETERIZED_QUERY_HASH is also available in the information_schema.query_history table function, but this table function only has about two weeks worth of history, and I often find longer time frames are more instructive to understand the query in context, so that is why I’m looking at SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY.

Caveats

It is important to keep in mind that changing a parameter can change the data that a query has to process drastically. Selecting everything after yesterday can be a very different amount of data than selecting everything after a date a year ago. We wouldn’t expect response time to be similar for those queries depending on the data size and other conditions involved. Grouping queries by PARAMETERIZED_QUERY_HASH can lump these queries together as if they were the same thing. If we’re charting them on the same graph, we might get some surprises.

There are also quite a lot of ways you can get different parameterized query hashes. Only the operators listed above are used to define parameters. This means something as simple as using BETWEEN or a function might derail the PARAMETERIZE_QUERY_HASH from being the same.

In my previous career working with IBM Db2, similar statistics were stored in and queried from the package cache. Query plans were cached, and statistics aggregated by the query plan. Snowflake, at this time, does not make use of a plan cache. On the other hand, this means that I have detailed statistics in Snowflake for each query execution (except when using Unistore/Hybrid Tables).

Summary

Aggregating and grouping based on PARAMETERIZED_QUERY_HASH can let us analyze the performance of a query over time, and look for outliers or interesting patterns in the performance of a particular query. This can be an important part of the puzzle when trying to understand the performance of an individual query.

Ember Crooks
Ember Crooks

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, spent a year working with high-volume MySQL, and is now learning Snowflake. Ember shares both posts about her core skill sets and her journey learning Snowflake.

Ember lives in Denver and work from home

Articles: 555

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.