This post is cross-platform and applies to all RDBMSes.
It is interesting to see, as a person experienced with a technical tool, some of the disadvantages of the newer ways of learning technical topics. When I started in IT, you learned some of the concepts in college, you learned some things working on the job with other DBAs who knew what they were doing, and you filled in the gaps by going to week-long classes once a year to get some detailed and hands-on official training. If you were lucky, you made your mistakes in non-prod environments before you could make them in production ones, but even with that, most DBAs I know managed to make a big mistake in prod every couple of years for at least their first decade of working as a DBA. I learned so much from the mistakes others made, and made mistakes of my own to learn from.
Today’s approach to learning seems to be much more focused on figuring out what you need from online sources – you tube and stack overflow. This sometimes means you learn the exact thing you’re looking for, but you may not realize the greater implications.
Don’t get me wrong, I learn this way myself today. I can’t imagine the mistakes I’m making that I don’t even realize in python and R as I’m learning them on my own. I’m not lamenting for the “good old days” or advocating for returning to an obsolete training model. What I would like to do is to address some of the gaps in my field of technical expertise when I see them. Hopefully this is helpful.
What is a Commit?
I tend to lurk in database and SQL related forums on reddit, and one day someone described a problem they were seeing when using DBeaver to access a PostgreSQL database. They described that sometimes the results of CURRENT TIMESTAMP were current, and other times, they were up to 30 minutes old. As a DBA, I know that CURRENT TIMESTAMP remains constant across a transaction, and so I asked them about committing. They replied with “I’m relatively new to sql, what do you mean by commit?”. As a DBA, this is one of the last things you want to hear from a developer, and yet I would estimate that about 25% of my team’s page outs last year were due to long-running transactions that developers had failed to commit or rollback.
When you retrieve or change data in a relational database, multiple statements are often grouped. The work you ask the database to do is not externalized until you either explicitly or implicitly issue a COMMIT. Some tools that are used to access databases take care of this for you by using autocommit. Autocommit means that each and every statement you run against a database is committed immediately after you issue it. It is a setting that can be set in many tools, including at the command line. If you’re not coming from a database background, this is probably what you expected to happen to begin with. However, some access methods, particularly GUIs, do not use this behavior by default. DBeaver seems to be one of them. If you’re using the SQL Magic with python, it always autocommits, and there’s no way to even do it differently. With DBeaver and other tools, it is just a matter of finding the right checkbox in settings to enable autocommit.
Why Group Statements into Transactions?
The first thing to understand is why databases even do this? I ran the command, I wanted it to happen, right?
The first answer is in the A in ACID. Because relational databases break up information about related entities into different tables, often a logical transaction has to affect more than one table, and therefore consist of more than one statement. The simplest example of this is a banking transaction. Let’s say a simple transaction consists of taking money out of Paul’s account, and putting that money into Melanie’s account. A transfer. Each of those actions is a separate statement, but if I take the money out of Paul’s account, and then a failure happens and the money is never placed into Melanie’s account, that is not acceptable to the bank, or to Paul and Melanie!
To prevent that kind of failure, we group the two statements into a transaction and we make sure that if one of the actions fails, then both actions fail. This is simplified because each real-world transaction consists of dozens, hundreds, or even thousands of statements.
When connecting to a database and not using autocommit, that is the start of a transaction. Every action we take from that point until the next COMMIT or ROLLBACK statement is part of that same transaction. When you disconnect, all of your work is likely either committed or rolled back depending on the tool you’re using.
Locking and Isolation Levels
When you have executed statements, but you haven’t issued a COMMIT, other connections may or may not be able to see the changes you’ve made, depending on their isolation level. An isolation level controls whether you can see dirty (uncommitted) data, and also controls the kinds of locks you acquire on data as you query or change it.
You can lock data even by only selecting it, depending on your access tool and the settings/isolation levels used. For more on isolation levels, check out DB2 Basics: Isolation Levels and Concurrency Phenomena. While that blog entry is specific to Db2, it also includes a table that shows isolation level names across several enterprise RDBMSes. Different platforms differ a bit in the isolation levels and how they are implemented, but the concepts and concurrency phenomena are universal.
The idea of locking data by selecting it is most frequently associated with Repeatable Read or Serializable isolation level. Some database access tools default to this level unless you change it.
Why would you want this behavior? Well, some code you write, you might query data before you change it, and you may not be able to tolerate that data changing between the time you query it and the time you change it. When you get into the nitty-gritty of concurrent access to the same data by thousands of different connections, it gets messy. The good news here is that this is largely a solved problem in RDBMSes, and if you’re looking for concurrent access and the ability to change the same data, a relational database management system will usually do it better than any other data storage methodology out there. This is also the “I” in ACID.
There is yet another way that you can use this transaction control behavior, but you have to be careful with it. If you have autocommit off, you can run a statement, and then from the same connection, query to see what the results of that statement are, before you issue the COMMIT. If you discover that you don’t like the results, you can instead issue a ROLLBACK command to undo all the work you’ve done in that transaction. Once you’ve committed changes they can’t be rolled back without features specific to one RDBMS or another.
The reason you have to be careful is that certain actions may trigger an implicit commit, so be careful if you rely on this without some practice with it.
COMMIT has another main purpose. COMMIT is the point at which your data changes are written out to disk, or externalized. Some people who work with databases don’t realize it, but unlike other storage methodologies where I can only restore to the last time a backup was taken, most relational databases can be restored to any given microsecond. They use varying logging and disk writing methodologies to achieve this, and there are absolutely ways you can configure them so the don’t behave this way, but this is yet another clear advantage of an RDBMS over some other storage systems.
Transaction Size and Logging
Be careful about how much you do in one transaction. Because of logging, if you do too much in one transaction, you can fill up the log space for the database, and cause the database to be unavailable until the transaction is rolled back. One of my early DBA mistakes was issuing a huge delete that ran for three hours before the transaction logs were fully consumed. The delete then had to roll back before any update activity against the database would work, which was another two hours. This was for an e-commerce website and constituted a two-hour outage for that site. Deletes are particularly problematic, but updates or inserts can also run into issues in this area. Always check how many rows will be affected before you run a delete or update statement so you don’t learn this lesson the hard way, too.
What do you need to do with this information? The first thing I, as a DBA, would ask is that you understand whether your application or activity can tolerate dirty data, and if you can, configure your isolation level properly. The second thing is to COMMIT frequently, and NEVER leave a transaction open when you step away from your computer. In most cases, autocommit is your friend, and you should enable it.
Terms and Further Reading
The language around managing transactions is part of the SQL standard, and is referred to as Transaction Control Language (TCL). TCL consists primarily of the COMMIT command and various forms of the ROLLBACK command. It is also possible to set SAVEPOINTs to roll back to, if there are multiple failure scenarios in the same transaction.
For Db2, the section on the certification tests that is most often failed is the section on locking, concurrency, and isolation levels. This is one of the most critical things to understand for your platform if you are designing an application where there is any concurrency of data access at all. A few links by platform:
Let me know in the comments below if there are any other resources you have found useful when learning about concurrency, isolation levels, locking, and transaction control.