I’m a couple of months into working with MySQL, and thought I’d share some resources and thoughts on what I’m learning and what is working. At least half of that time was dedicated to a very in-depth onboarding process that barely mentioned a database at all in about a month of content, so it’s closer to a month in. I’m also working at a company which is running MySQL at an unbelievable scale, so I’m not sitting here spinning up servers or poking around databases on a daily basis. Manual interaction is a bug when you’re in a department of 40+ people supporting petabytes of data running on many hundreds of databases running on thousands of containers and VMS.
Keep in mind as you’re reading this that I’m still new to MySQL and my statements about it may be incorrect or use incorrect terminology. Please share anything you find there in the comments so I can correct it.
Information Sources
The absolutely most excellent source I’ve found to study MySQL as an expert in another RDBMS at the beginning has been the book High Performance MySQL. Yes, the book is focused on performance, but that ends up working out really well because it doesn’t start at “This is a table”. It makes assumptions about basic database knowledge, which is great if you understand some RDBMS. I find that I can read it and easily go off on side quests when I find a term I don’t understand or an area I need to understand better. I’m not yet done reading it, and I mostly expect to read it again later this year or next year, as I suspect I’ll get different things out of it without the side quests.
The official MySQL documentation is excellent for digging into topics deeper, and its organization makes sense to me coming from another platform.
On the topics where I want to go really deep or have an oddly specific question that often comes from a knowledge of Db2 internals, I go to the MySQL Internals Manual. This is way too deep for the basics, but really good at getting much more detailed than the Db2 documentation ever does.
Internet searching is good, and there is a lot of random content out there – more so than with Db2. The key, I’ve found is to use several sources and if there are significant conflicts to either ask someone who knows it or dig into the official documentation. There are a lot of people who are wrong on the internet and just because you CAN do something in MySQL (or any other software) doesn’t mean you SHOULD.
Open Source vs. Proprietary
One thing I’m going to be getting used to is the approach from an Open Source perspective and how that differs from the approach from a proprietary software approach. With IBM Db2, if we have a problem and work through everything we can think of and google on our own, we go to IBM and we open a case, and as painful as that process can be, often get an answer. Sometimes the answer is that you’ve done something stupid and you can just do it differently. Sometimes the answer is “working as designed”, meaning IBM has no intention of changing it. Sometimes the answer is to point you to an existing patch to fix the problem. And rarely the answer is a patch coded just for you to fix the problem. In any of those cases, there is some answer provided to you.
From my mentor, I’m learning that the approach in the MySQL world, even at the expert levels, is different. If we run into a sticky problem, we can dig into the code base and find the problem or suggest a code fix ourselves. What’s not different is that you’re still generally going to write up the issue really well including how to reproduce it and open an issue. In this case you’re hoping that your issue will be patched either with code you provide or some other way by Oracle. It’s not terribly likely that you’re going to fully fork the MySQL code and make your own modifications, as that’s tough to maintain over time.
I have obviously never seen the code that makes up Db2, so this is certainly a new way to look at things. In any case, I’m nowhere near deep enough just yet to do this. I’m still working through reconciling the basics.
Similarities
Conceptually, there are a lot of similarities. I don’t have to understand why we need storage abstraction and what tablespaces are. The basic syntax for DML and DDL is relatively similar. Buffer Pools still make sense in general terms. I don’t need instruction on the concepts of data types and some of the data type choices – I only need to look at the edge cases. ENUM, I’m looking at you.
Differences
From what I’ve seen, MySQL, or perhaps how my organization uses MySQL, seems to have an extreme emphasis on high availability. There is an expectation that we never take a full database outage, even on upgrade. The structure of MySQL tables (clustered on Primary Key, with data stored on the leaf pages of the PK index) leads to not needing reorganization, but instead needing rebuilding when a certain narrower set of types of changes are made. There are tools like LHM out there to help make fully online schema changes more possible. While that’s specific to Ruby on Rails, I’m not aware of any similar tools for Db2 beyond the very manual use of things like admin_move_table.
Reads on read-only replicas are much more stable in MySQL than they are on a Db2 HADR standby, and MySQL replication, if done correctly (for everything, not for specific databases/schemas), can handle structural changes and work on everything under MySQL’s control. There are great tools like ProxySQL to spread reads across large numbers of read replicas. I’ve never heard of someone directing read traffic and write traffic from the same application to different Db2 servers. Specific read-only traffic, yes, but usually that’s reserved for something like reporting, and HADR’s limitations have historically made it fairly universally hated for that purpose, even with a VIP that gets redirected if the standby is down. We’ve pushed IBM on continuous availability in Db2 for years, and while it’s better than it used to be, it’s just not where it could be.
There are things they’re doing with MySQL here that I can picture doing with Db2, but that I’d have to wrap in a LOT of custom code to make work. I’m not sure I can picture a petabyte scale Db2 database with hundreds of shards and at least 3 replicas per shard actually working and scaling the way we’re doing it with MySQL. I could probably shard the databases the same way we do here, at the application layer, but the code I’d have to wrap around replicas and schema changes using any flavor of Db2 replication – I shudder to think of it.
One area that stood out to me with differences was the optimizer. Not surprisingly, the Db2 optimizer seems much more advanced than the MySQL optimizer, along with some of the options on indexing. I’m working with MySQL 5.7, so there are some advancements in 8 that I haven’t seen yet, particularly related to the optimizer. In 5.7, when MySQL says it’s a cost-based optimizer, the main thing it looks at is cardinality. Distribution statistics and frequent values are not a part of the picture. I read that for statistics gathering, MySQL samples 8 pages by default and nearly had a heart attack. Then I the realized if it’s only cardinality you’re looking at, that might not be so bad.
In MySQL 5.7 the only join type is nested loop, so it makes looking at explains entirely different in an interesting way. There’s nothing I can do to change the join type, so I’m largely focusing on each table access and how it is working or not working. Which when you come down to it isn’t all that different than Db2. Reading explains isn’t that hard to pick up. Selectivity is still a thing. Table Scans are still bad. Column order in indexes still matters. Too many indexes still slows down inserts, updates, and deletes. You can still tell the optimizer what to do in comments (and this seems a more common and accepted technique in MySQL). But odd things like the primary key being in every index instead of a RID still take thought.
Some of the advanced query performance techniques aren’t there. Column group statistics aren’t a thing in MySQL. I can’t use statistical views. Even little things like I can’t have column A of the index in ascending order while having column B of the index in descending order.
I am appalled by some of the ways MySQL will let you shoot yourself in the foot. Some of the storage engines are just insane (looking at you, BLACKHOLE), and not using strict mode – why are you even using an RDBMS?
My Journey So Far
It somehow manages to be both easier and harder than I thought it would be. A lot is similar, and I love the concepts and problems that still apply. A lot of the details are different. I have a hard time determining how deep to go on side quests and when to go on them. I’ve started verbally, out loud, alone in my home office, firmly telling myself “NO” when I discover yet another thing I know I don’t understand, but also that I don’t have to understand to address the current issue or understand the current concept. I have a list of links to read a mile long to help with this, and haven’t done much yet of going back and actually reading the list. I have an awesome mentor who has endless enthusiasm to answer my endless questions, and a willingness to very slowly work through real world problems while I ask the most basic questions.
As someone who thrives on change and loves learning, I am loving drinking from the firehose and quite happy with the change I’ve made.
I plan to start writing on some MySQL specific topics soon, and am also toying with the idea of a separate site/blog that can help translate concepts between different RDBMS platforms. I have so much to write about!
Thanks for sharing!
I hope someone working in Markham reads your blog and someday Db2 will also be able to do parallel reads on multiple shards. But maybe it will just be replaced in a few years like OS/2 had to go down that alley, because it didn’t fit anymore in.
Hi. I liked so much this article. I have a similar reality. So I could find some help to understand better how I have to use a new RDBMS. Thanks.
When you read stuff like this, we can understand why the market share of DB2 is getting less and less every day.
Thanks for sharing!
This is reminiscent of my experiences as a long-time Oracle DBA having to take tangents into other RDBMSs for one reason or another – including into DB2 when IBM bought the company where I was working. Thanks, Ember, as you were my go-to for several things I did not understand about DB2. My general overview is that the databases are the same, but different. Ultimately, the goal is maximize the RDBMS that you are given, warts and all. Interesting to learn some parts, painful to learn others, and some are firmly in the “Wha?!?” category!
Good article Ember! I am left now wondering how IBM’s other RDBMS, Informix, matches up. From my own experience, which admittedly does not include MySQL, it has most of the features that MySQL offers that you missed in DB2 as well as those included in DB2 that you are missing in MySQL including sharding, multiple secondaries for each server node in the cluster, querying the entire cluster from any node including read-only secondaries, multiple join plans (nested-loop join, hash-join, star schema join). detailed and configurable data distributions, advanced optimizer, in-place table alters, server upgrades with no downtime, database migration with no downtime, etc.
I have worked a tad with Informix, but not as an administrator. Have stored/queried Db2 performance data in it for a few years, so I don’t have a lot of insight. I do know that I needed the LAG function and it came to Informix later than it was in Db2. Also, I know that HCL is doing some stuff with Informix, and that their OneDB product looks to be essentially a repackage of Informix. It was not at the top of my list of systems to learn, as I was looking for something with wider new adoption than either Db2 or Informix, both of which aren’t generally even in the conversation for new applications, outside of companies that are really dedicated to big blue.
I can say that working with MySQL has me more curious about other relational database platforms. So many people ask “Which RDBMS is best?”, which is truly an unanswerable question. For each platform, I could find someone who loves it and someone else who hates it. It is amazing how many of are perceptions are shaped by the technology we work with early in our careers.
I am working with MySQL for 20 years now, and reading your outside perspective is extremely enlightening. Booking.com uses MySQL at a smaller scale that where you are at now: We have only a few 1000 instances, and a few hundred replication trees, but we also are also fewer DBAs — around 18 at peak, a lot less + consulting now.
Everything is automated, we are running databases on unraided frontend blades with local NVME storage, which come in at around 120 Euro/month total cost. We leverage redundance from scaleout in replication to achieve High Availability, and we treat instances like cattle. I have some articles on that in the mysql tag at my Blog (“DevOps meets Databases” and “Databases on un-RAID-ed storage?”).
When experimenting with RDS in Amazon, I was shocked to see how backwards operations there are. No proxy, file based storage redundancy, and downtimes. Things that we left behind when GTID based replication became a thing.
MySQLs executor and optimizer are primitive, I agree. We usually manage to make the data fit, but some queries are slow when they need not be.
“Some of the storage engines are just insane (looking at you, BLACKHOLE)” There are a few use-cases for BLACKHOLE: If you just want the binlog from replication, ie to drive a CDC data source such as Debezium, there is no need to actually store the data. It comes very handy in such cases. Otherwise it is just weird, yes.
“not using strict mode – why are you even using an RDBMS” It wasn’t. With old MyISAM tables, there is no such concept as transactions, and therefore also no rollback. When you are doing a multi-row insert or similar stuff, once you have started there is no stopping, because you cannot roll back. Strict Mode is a leftover from these times, and we are still trying to forget. Set it as strict as you can (it isn’t strict enough even now), and enforce that. Also, death to nontransactional storage engines!