MySQL’s “partial” Group By

Posted by

This is one area that was not a surprise to me coming from Db2 to MySQL, as I’ve had to deal with people confused in the other direction for years. Essentially, MySQL allows a syntax for GROUP BY that Db2 and most RDBMSes do not allow. Using the syntax can actually lead to some unexpected query results in MySQL.

Query Order of Operations

Logically, though maybe not literally, we write SQL in a different order than it is executed in. I sometimes think this is one of the things that has so many avoiding SQL in favor of ORMs – that and the set-based instead of procedural nature of SQL. Logically, the order in which SQL is processed is generally:

  1. FROM
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. SELECT
  6. DISTINCT
  7. UNION
  8. ORDER BY
  9. LIMIT (or for Db2, FETCH FIRST N ROWS ONLY)

I’m not saying that this is the literal order because optimizers can do … interesting … things. But logically, the query will start with a table, and first apply filtering conditions from the WHERE clause. It will then process GROUP BY and HAVING if they exist.

This order might make it clearer why some things work the way they do.

Experimentation Environments

The environment I’m using here for MySQL experimentation is a docker container that you can easily pull and use to run the SQL in the examples below yourself. It comes from https://github.com/sakiladb/mysql. The Db2 environment comes from using the code in https://github.com/ivanceras/sakila/tree/master/db2-sakila-db to build the sakila database in a vanilla db2 docker container.

The general query used in the examples is meant to find the most recent rental from the rental table for customer id 1, by staff_id 1, and return the inventory_id, the customer_id, and the date of that rental.

What is GROUP BY?

GROUP BY often tells the database that we’re going to take multiple rows and get only one row (per group) back. If you’ve read the definition of an aggregate function, that may sound familiar. In using GRROUP BY, we’re largely telling the database to break some of the normal row associations that we’re used to relying on. Essentially each column becomes a group of values that the database needs help in understanding how to whittle down to one value. This means that for any columns that are not defined in our GROUP BY clause, we need to specify an aggregate function to tell the database HOW to turn multiple rows into just one row. Aggregate functions include things like COUNT(), MIN(), MAX(), and AVG(), among others.

Every column specified in our SELECT clause must fall into one of these two categories:

  • Column is included in the GROUP BY clause
  • Column has an aggregate function applied

In Db2, you get an error message if you do not follow this rule – in all cases. There is no way I know of in Db2 to disable this behavior:

$ db2 "select inventory_id, customer_id, max(rental_date) from rental where staff_id=1 group by customer_id limit 1"
SQL0119N  An expression starting with "INVENTORY_ID" specified in a SELECT
clause, HAVING clause, or ORDER BY clause is not specified in the GROUP BY
clause or it is in a SELECT clause, HAVING clause, or ORDER BY clause with a
column function and no GROUP BY clause is specified.  SQLSTATE=42803

Because MySQL (with some settings) does not return such an error message, this is a very common error to get questions about when working with Db2.

GROUP BY in MySQL

Up through 5.6, and with certain settings in 5.7 and above, MySQL behaves a bit differently. Assuming sql_mode does not include only_full_group_by, instead of returning an error, MySQL will transparently use the ANY_VALUE() function to select a value to return. It looks something like this:

mysql> select inventory_id, customer_id, max(rental_date) from rental where staff_id=1 group by customer_id limit 1;
+--------------+-------------+---------------------+
| inventory_id | customer_id | max(rental_date)    |
+--------------+-------------+---------------------+
|         4020 |           1 | 2005-08-22 19:41:37 |
+--------------+-------------+---------------------+
1 row in set (0.02 sec)

I get values for the same query that failed in Db2, with no warnings, but what values am I getting? The intent here is to find the latest rental for this customer, along with the inventory_id for that rental so I can join to get things like the name of the film. If I query to see what inventory_id 4020 for customer_id 1 looks like, I get this:

mysql> select inventory_id, customer_id, rental_date from rental where staff_id=1 and inventory_id=4020 and  customer_id=1;
+--------------+-------------+---------------------+
| inventory_id | customer_id | rental_date         |
+--------------+-------------+---------------------+
|         4020 |           1 | 2005-05-28 10:35:23 |
+--------------+-------------+---------------------+
1 row in set (0.00 sec)

I see from this that this customer did indeed rent inventory_id 4020, but on May 28th, not on August 22 as the first query results suggested.

If I look up what the last rental was indeed for this customer, I get:

mysql> select inventory_id, customer_id, rental_date from rental where staff_id=1 and rental_date='2005-08-22 19:41:37' and  customer_id=1;
+--------------+-------------+---------------------+
| inventory_id | customer_id | rental_date         |
+--------------+-------------+---------------------+
|         1446 |           1 | 2005-08-22 19:41:37 |
+--------------+-------------+---------------------+
1 row in set (0.00 sec)

So why did MySQL come up with this data match? Our initial query was:

select inventory_id, customer_id, max(rental_date) 
from rental 
where staff_id=1 
group by customer_id 
limit 1

MySQL translates this to effectively:

select ANY_VALUE(inventory_id), customer_id, max(rental_date) 
from rental 
where staff_id=1 
group by customer_id 
limit 1

The resulting value often happens to be the “first” value for that column, but it is not guaranteed to be so. It is specifically non-deterministic, which means that we could get different values every time we execute this query.

SQL Mode only_full_group_by

The above behavior only happens when the the sql_mode does not include only_full_group_by. Before MySQL 5.7, only_full_group_by was not in place by default. Starting with 5.7, only_full_group_by is part of the default. If a MySQL installation has been upgraded, it’s likely the defaults from the older version are still in place. How can you tell if only_full_group_by is part of your sql_mode? The easiest way is a couple of queries:

mysql> select @@global.sql_mode;
+-------------------------------------------------------------------------------------------------------------------------------------------+
| @@global.sql_mode                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------------------------+
| ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION |
+-------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select @@session.sql_mode;
+-------------------------------------------------------------------------------------------------------------------------------------------+
| @@session.sql_mode                                                                                                                        |
+-------------------------------------------------------------------------------------------------------------------------------------------+
| ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION |
+-------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

It is important to check both of these. sql_mode can be set both at the global/instance level and also at the individual session/connection level.

What does this kind of query look like with only_full_group_by enabled in MySQL? It looks like this:

mysql> select inventory_id, customer_id, max(rental_date) from rental where staff_id=1 group by customer_id limit 1;
ERROR 1055 (42000): Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'sakila.rental.inventory_id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

This is similar to what we see from Db2 and other RDBMS platforms.

Better Syntax Alternatives

There are several better options for the syntax to solve the real question the example query here is trying to solve. Below MySQL 8, the most obvious one (and only one I can think of) is using a subquery something like this:

mysql> select inventory_id, customer_id, rental_date
    -> from rental a
    -> where staff_id=1
    -> and rental_date=(
    -> select max(rental_date)
    -> from rental b
    -> where a.customer_id=b.customer_id
    -> and staff_id=1)
    ->
    -> and a.customer_id=1;
+--------------+-------------+---------------------+
| inventory_id | customer_id | rental_date         |
+--------------+-------------+---------------------+
|         1446 |           1 | 2005-08-22 19:41:37 |
+--------------+-------------+---------------------+
1 row in set (0.02 sec)

This gets us the right combination of rental_date and inventory_id.

My go-to solution here coming from Db2 and if I were working with MySQL 8 or higher would be to use a CTE (Common Table Expression), or a window function.

Summary

In summary, be particularly careful with using GROUP BY in MySQL without the sql_mode of only_full_group_by enabled. You can get changing and incorrect results that are almost never what the people writing queries actually meant.

ChatGPT

Last week, when I was experimenting with ChatGPT, I asked it to write this article for me. The resulting article didn’t look horrible, but didn’t cover things in the level of detail I have. I used the prompt:

please write a blog entry of 500 words or more in the style of https://datageek.blog describing the pitfalls of not using ONLY_FULL_GROUP_BY with MySQL and how it can lead to unexpected query results.

Check it out here, and see what you think:

As a data geek, one of the most important things to understand when working with SQL databases is the concept of a “group by” clause. The group by clause is used to aggregate data, allowing you to perform calculations on multiple rows of data and return a single result. For example, you might use the group by clause to calculate the average salary of all employees in a specific department.

But what happens when you’re working with a MySQL database and you forget to include the “ONLY_FULL_GROUP_BY” mode? The answer is that you may end up with unexpected query results.

The ONLY_FULL_GROUP_BY mode is a MySQL setting that enforces the SQL standard for the group by clause. When this mode is enabled, the query must include all non-aggregated columns in the group by clause. This means that if you’re trying to group data by a specific column, you must also include that column in the group by clause.

For example, let’s say you have a table called “employees” that contains the following columns: “employee_id”, “first_name”, “last_name”, and “salary”. If you want to group the data by the “last_name” column, your query should look something like this:

SELECT last_name, AVG(salary)
FROM employees
GROUP BY last_name;

If you forget to include the “last_name” column in the group by clause, you’ll get unexpected results. Instead of calculating the average salary for each last name, you’ll get an average salary for the entire table.

But the problem is not only the query but also the default mode of MySQL, before version 8.0, that is set to not enable this mode by default. This means that if you’re working with an older version of MySQL, you might not even realize that you’re not including all non-aggregated columns in the group by clause. And this can lead to unexpected results.

Another pitfall is when you’re trying to select non-aggregated columns that are not included in the group by clause. In this case, MySQL will return an arbitrary value for that column. This can lead to even more confusion and unexpected results.

So, how can you avoid these pitfalls? The first step is to make sure that you have the ONLY_FULL_GROUP_BY mode enabled. You can do this by adding the following line to your MySQL configuration file:

sql_mode = 'ONLY_FULL_GROUP_BY'

If you are using version 8.0 or later, this mode is enabled by default.

The second step is to be mindful of the columns that you’re including in your group by clause. Make sure that you’re including all non-aggregated columns, and be aware of the potential pitfalls of selecting non-aggregated columns that are not included in the group by clause.

In conclusion, the group by clause is a powerful tool for aggregating data in MySQL. However, it’s important to understand the potential pitfalls of not using the ONLY_FULL_GROUP_BY mode. By enabling this mode and being mindful of the columns that you’re including in your group by clause, you can avoid unexpected results and ensure that your queries are returning accurate data.

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, and is now pivoting to focus on learning MySQL. Ember shares both posts about her core skill set and her journey learning MySQL. Ember lives in Denver and work from home

One comment

  1. I do not understand why I would write sql statements like these, selecting some columns, then a max, avg, min or whatever function that would require a group by, not writing the group by. If it is to confuse the Russians, well I don’t think they are that stupid. Lucky that the option ONLY_FULL_GROUP_BY is now default.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.