Using ChatGPT to Write SQL

Working in technology, I’ve heard a number of things about ChatGPT from various news sources.

I was particularly interested hearing some teachers saying it was the end of being able to properly detect students cheating when writing papers. Then I heard a teacher talking about how this was the new ‘calculator’ that teachers need to learn how to leverage in their curriculums. Some talented developers at work were also talking about ways to use ChatGPT to write code and sharing experiences they’ve had with it.

Then a colleague mentioned that ChatGPT can write valid SQL, and I could hardly wait to try it.

The Challenge

At work, I’ve been playing around with some of MySQL’s unusual ways of handling GROUP BY, and I thought I would use the SQL I’ve been toying with there to experiment with ChatGPT.

The SQL I’m playing with is using the Sakila database. I’m using the rental table to demonstrate some of the unexpected behavior I get in MySQL with ONLY_FULL_GROUP_BY mode disabled. The results can be unexpected when I include a column in the SELECT clause that neither has an aggregate function specified, nor is one of the columns I’m grouping on. For reference, Db2 will simply fail and require that every column either have an aggregation function or be included in the GROUP BY clause. MySQL behaves that way only when ONLY_FULL_GROUP_BY is enabled.

Note that I’m using the sakila database in a docker container for this, if you want to try any of the SQL.

The SQL I’m using is designed to display some of the surprising behavior MySQL has in this area. The goal is to list the name and rental date of the last film a user rented, using the Sakila database. This is the SQL that I wrote to display what someone might think works, but it drastically doesn’t:

select inventory_id, customer_id, max(rental_date)  
from rental   
where `customer_id`=1
group by customer_id 
limit 1;
ERROR 1055 (42000): Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'sakila.rental.inventory_id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

In this case, this SQL fails entirely as it would on Db2 because the mode ONLY_FULL_GROUP_BY is in use (which is the default). It fails because we include inventory_id in the SELECT list without either specifying it in the GROUP BY or specifying an aggregate function for it.

If I run this in an environment that isn’t using the mode ONLY_FULL_GROUP_BY, I get something like this:

+--------------+-------------+----------------------------+
 | inventory_id | customer_id | max(rental_date)           |
 +--------------+-------------+----------------------------+
 |         4020 |           1 | 2005-08-22 19:41:37.000000 |
 +--------------+-------------+----------------------------+ 

Note that I get an answer, and while that rental date is correct, the inventory_id is not. In this case, MySQL chooses any value from the column (often the first) and uses it. The value it chooses is not deterministic, and may vary.

This is SQL that actually works in 5.7, even without ONLY_FULL_GROUP_BY, to get a correct answer:

select inventory_id, customer_id, rental_date   
from rental a   
where rental_date=(  select max(rental_date)  from rental b  where a.customer_id=b.customer_id  and staff_id=1)  and a.customer_id=1;
+--------------+-------------+---------------------+
| inventory_id | customer_id | rental_date         |
+--------------+-------------+---------------------+
|         1446 |           1 | 2005-08-22 19:41:37 |
+--------------+-------------+---------------------+
1 row in set (0.00 sec)

This is a correct answer to the question, though I’m eliminating joins so I can use the query for demonstrating the very specific point I’m trying to illustrate. A full answer to the question is:

select title, customer_id, rental_date   
from rental a join inventory i on a.inventory_id=i.inventory_id 
join film f on f.film_id=i.film_id  
where rental_date=(  select max(rental_date)  from rental b  where a.customer_id=b.customer_id and staff_id=1)  
and a.customer_id=1 
and staff_id=1;
+-----------------------+-------------+---------------------+
| title                 | customer_id | rental_date         |
+-----------------------+-------------+---------------------+
| FIREBALL PHILADELPHIA |           1 | 2005-08-22 19:41:37 |
+-----------------------+-------------+---------------------+
1 row in set (0.01 sec)

Enter ChatGPT

Now keep in mind that I’ve never even played with ChatGPT before. I thought I’d just try it. I entered this prompt:

Can you write a valid SQL statement for MySQL 5.7 that uses the Sakila database to find the name of the latest movie that customer id 1 rented from employee id 1?

And here is the result:

I figured my starting try would be too simple, that I’d have to refine and maybe provide data model details. Nope, that’s a correct answer from where I started:

SELECT film.title
FROM rental
JOIN inventory ON rental.inventory_id = inventory.inventory_id
JOIN film ON inventory.film_id = film.film_id
WHERE rental.customer_id = 1 AND rental.staff_id = 1
ORDER BY rental.rental_date DESC
LIMIT 1;
+-----------------------+
| title                 |
+-----------------------+
| FIREBALL PHILADELPHIA |
+-----------------------+
1 row in set (0.09 sec)

I am really impressed that it not only wrote a correct SQL statement, and knew the Sakila data model, but that it both did not make mistakes I’ve seen humans make, and also gave me the code in monospaced, easy to copy text.

I started with a simplified version of the query, and ChatGPT gave me a correct answer, but writing this for one customer for me is just a precursor to writing it for multiple customers, and that’s why I chose the correlated subquery method of solving it (well that, and neither CTEs nor window functions work in MySQL 5.7). Let me try to get Chat GPT to write the more complicated form of the query:

It’s a correct answer. Not only that, I’ve never seen that particular formatting for join syntax –

FROM
    customer c
        JOIN
    rental r ON c.customer_id = r.customer_id

It really calls out the table names, and I love it. The only criticism I could come up with is that it did not enclose identifiers in tick marks, but I’m pretty inconsistent on that myself.

Finally, I wondered if the SQL would change if I specified MySQL 8, because if I were writing this query for MySQL 8, I would use a CTE. And dang it, ChatGPT is right there with me:

More Challenges for ChatGPT

I’m impressed. I expected minor problems, and having to give hints, including having to specify more about the data model. I don’t know that it’s a tool I would use myself to write SQL. I do think that there are limitations to ChatGPT, and some of those are human. Humans aren’t always great at formatting a request properly or frankly even at knowing what they want. I want to experiment more!

I get the impression that many responses to ChatGPT results are “Well, that’s not wrong”, so I decided to ask it to write a whole blog entry for me. I can see several areas I would cover differently, but most of what it has there is not wrong. Here’s the prompt I gave it:

please write a blog entry of 500 words or more in the style of https://datageek.blog describing the pitfalls of not using ONLY_FULL_GROUP_BY with MySQL and how it can lead to unexpected query results.

And here’s what it came up with:

As a data geek, one of the most important things to understand when working with SQL databases is the concept of a “group by” clause. The group by clause is used to aggregate data, allowing you to perform calculations on multiple rows of data and return a single result. For example, you might use the group by clause to calculate the average salary of all employees in a specific department.

But what happens when you’re working with a MySQL database and you forget to include the “ONLY_FULL_GROUP_BY” mode? The answer is that you may end up with unexpected query results.

The ONLY_FULL_GROUP_BY mode is a MySQL setting that enforces the SQL standard for the group by clause. When this mode is enabled, the query must include all non-aggregated columns in the group by clause. This means that if you’re trying to group data by a specific column, you must also include that column in the group by clause.

For example, let’s say you have a table called “employees” that contains the following columns: “employee_id”, “first_name”, “last_name”, and “salary”. If you want to group the data by the “last_name” column, your query should look something like this:

SELECT last_name, AVG(salary)
FROM employees
GROUP BY last_name;

If you forget to include the “last_name” column in the group by clause, you’ll get unexpected results. Instead of calculating the average salary for each last name, you’ll get an average salary for the entire table.

But the problem is not only the query but also the default mode of MySQL, before version 8.0, that is set to not enable this mode by default. This means that if you’re working with an older version of MySQL, you might not even realize that you’re not including all non-aggregated columns in the group by clause. And this can lead to unexpected results.

Another pitfall is when you’re trying to select non-aggregated columns that are not included in the group by clause. In this case, MySQL will return an arbitrary value for that column. This can lead to even more confusion and unexpected results.

So, how can you avoid these pitfalls? The first step is to make sure that you have the ONLY_FULL_GROUP_BY mode enabled. You can do this by adding the following line to your MySQL configuration file:

sql_mode = 'ONLY_FULL_GROUP_BY'

If you are using version 8.0 or later, this mode is enabled by default.

The second step is to be mindful of the columns that you’re including in your group by clause. Make sure that you’re including all non-aggregated columns, and be aware of the potential pitfalls of selecting non-aggregated columns that are not included in the group by clause.

In conclusion, the group by clause is a powerful tool for aggregating data in MySQL. However, it’s important to understand the potential pitfalls of not using the ONLY_FULL_GROUP_BY mode. By enabling this mode and being mindful of the columns that you’re including in your group by clause, you can avoid unexpected results and ensure that your queries are returning accurate data.

Not bad. I’d like to think it’s not as good as I’d do, but you’ll have to wait for me to finish writing my version of that one to see!

Are you using or experimenting with ChatGPT at all? Share your experiences in the comments below, or with me on Mastodon – I’m @ember@hachyderm.io.

Ember Crooks
Ember Crooks

Ember is always curious and thrives on change. She has built internationally recognized expertise in IBM Db2, spent a year working with high-volume MySQL, and is now learning Snowflake. Ember shares both posts about her core skill sets and her journey learning Snowflake.

Ember lives in Denver and work from home

Articles: 556

3 Comments

  1. Hi Ember hope you’re well,

    wow!! it blew my mind — while reading your blog I was already thinking of potential uses to my daily work 🙂
    I heard in the news about it but didn’t think it was that good… i tried to test it using this link https://chat.openai.com but the site mentions it is unavailable at the moment… is that the link you used to test it?

    Best regards
    Isaac

  2. Hi Ember,
    This is so awesome! I wonder if Chat knows the schemas for other databases such as Db2 or PostreSQL. I’m going to see if it can write a query using the Db2 Sample database. If not, can it import the schema from db2look?

    All the best,
    Larry

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.