I have answered this question so many times and in so many different ways. Conceptually, my article Db2 Basics: The Role of Databases covers the concept of a database or database management system, why we need them and how we use them. But it never fails to amaze me how we in the database discipline use the same term in so many different ways. I could cite examples of nearly every term I use and how a different platform uses it in a different way. This overloading of terms may make it easier for those from the outside looking in, but I suspect it adds confusion to most who do more than scratch the surface.
There are things like the SQL language that we have standards to define, but there is still a lot of wiggle room within those standards. I’ve seen several things in recent years that make me aware of that wiggle room. The first one that comes to mind is when defining an index in MySQL (5.7), you can specify ASC or DESC for each column, but realistically all columns were at the time I was working with it in either ASC or DESC order. A mix of them was allowed in the syntax, but not actually respected in the index structure.
The general concept of a Table is fairly universal, but there are interesting directions that different data types and constraints can take that concept.
ACID is another example. I’ve learned over the years that you can’t just take a platform’s claim to be “ACID Compliant” as a fact, but if the details of transactions matter to you, you have to really dig in and understand all the asterisks that the various platforms and methodologies put on that claim. It is amazing how you can bend a term nearly to its breaking point and still use it. I can say from experience that most platforms do this.
IBM Db2 is no exception. You’ll hear/read IBM say that Db2 uses MVCC, but it is only really true for a specific isolation level and set of parameters, and even that may be for a narrow definition of the terms involved. A lot of the arguments around these terms get into details that I’m not sure even I am qualified to speak intelligently on without some research time and reaching out to a few friends first. These are complicated problems that people have been working on for decades.
My Experiences with the Term Database for Different RDBMS Platforms
I learned some Oracle in college. I had an excellent set of courses that actually included administrative tasks, and it was wonderful preparation for the role I landed at IBM straight out of college working with Db2, even if the actual facts learned were not directly applicable. I then spent a solid 8 years not even looking at another RDBMS. This allowed me to build a strong expertise with Db2, and it would be another 10 years before I really tried to learn another platform very deeply. Somewhere in there, I did learn a bit of Oracle, and for a time had my OCA certification. One of the really stark moments I remember is sitting in an Oracle class, and listening to the really stark line drawn between a database and an instance. In Oracle (at least at that time – 11 was current), a database was the files while an instance was the processes and memory.
My primary platform at the time – Db2 – had no such easy line to draw. Databases and instances both had files, configuration, processes, and memory areas. True, the actual data files belonged to the database, not the instance. This made Db2 more flexible in some ways, with multi-database instances being commonplace, and moving a database from one instance to another being relatively easy. But it also made it very hard to describe the difference between an instance and a database. Understanding when I needed an instance attachment as opposed to a database connection was very confusing when I was new to Db2.
With both Db2 and Oracle, connections are made to a specific named database, and when connected to that database, you cannot natively query other databases in the same instance or on the same server without establishing a separate connection or setting up some process like federation to do so.
I spent some time with MS SQL server and the line was still different. Now I connect to an instance, but databases were a bit more akin to schemas or tablespaces in Db2. There was a system database. But everything is just in dbo anyway. Honestly, I didn’t spend enough time with MS SQL Server to fully understand where they draw the lines. Most of my MS SQL Server experience is on RDS.
I spent an intense year working with MySQL in a high volume OLTP environment. For MySQL there are databases, but database is just a straight synonym for schema. You’re connecting to the server (which is essentially the same thing as the instance for Db2 or Oracle).
Snowflake’s Databases and Warehouses
Snowflake treats the term database as a layer of abstraction above the schema. From an IBM world, this is more akin to Netezza’s double dot notation, allowing the specification of the database when specifying an object in a query. I can very much query and join tables from different databases together without a problem if I want to. But if I want to join with data from another account or even organization in a single query, I need to set up data sharing. Snowflake comes from a base of developers with deep Oracle expertise, and as it fully decouples compute from storage, there is a need to use terms a bit differently. When learning Snowflake as someone with deep administrative and engineering experience with other database platforms, this takes quite a bit of adjustment. In Snowflake, the compute (processes and memory) is in the warehouse. The warehouse does not contain data. Data does not belong to a warehouse. Any warehouse can access any data as long as the role’s permissions allow it.
So What is a Database, Really?
Does any of this even trip up developers who just want a place to stash and retrieve data and not have to think about it? I am guessing that it mostly doesn’t even pop up on their radar. They need to be able to look up a database url format for a platform, and want to store and retrieve data as painlessly as possible. I suspect it does take some effort to understand where the boundaries are that may require data movement or configuration of federation.
This is a question that maybe strikes at my heart more than it may strike at the hearts of others. I’ve built a career on databases and yet, I can’t tell you what a database is without some context. I actually hold the controversial opinion that an MS Excel workbook is a database. Probably not a relational one. Definitely not one that performs optimally at scale. It’s certainly not a full-on RDBMS, and forget multi-user.
In the first full-time job I had as an accounting clerk, the database I used was a filing cabinet. MS Excel has largely replaced that filing cabinet, and is vastly more powerful and user friendly.
I’d love to hear your definition of a database in the comments below!
Could I describe what a database is in less than 1200 words? Maybe, but it wouldn’t be a very complete definition. Ask me again in 10 years and see if my answer has changed!