Upgrading MySQL using Replica Promotion
(co-authored by Sean Hickey and Nick Fisher)
Introduction
At Skillz, we use MySQL hosted in AWS RDS (Amazon Web Service’s Relational Database Service) as our primary datastore for our backend services. Our platform is used by people around the world at all times of the day, so there isn’t an easy choice for a maintenance window in which we’d intentionally cause downtime in order to upgrade our databases.
Why Upgrade?
This may seem obvious, but why should we upgrade our databases? There are some decent, pragmatic reasons to not upgrade. By not upgrading, we don’t incur downtime or expend any engineering effort on preparing for the upgrade. Indeed, this line of thinking could make a lot of sense, especially for MySQL because it requires downtime to upgrade.
However, we would be missing out on all the new improvements. MySQL has a strong community that handles patching security vulnerabilities, improving performance, and increasing stability. In our case, we experienced some bugs with MySQL replication that were fixed in a newer version. Occasionally, our replicas would experience a hardware failure, reboot, and fail to restart replication with their primary. The performance improvements, along with this bug fix, provided the return on investment we were seeking to justify the downtime.
Upgrading is Challenging
Our databases are the most critical part of our system. Upgrading them means downtime for the services that depend on them. Rebooting a MySQL database normally takes less than five minutes, so rebooting for an upgrade should only take a few minutes longer, right? Well, according to Percona, upgrading MySQL in RDS can have an unknown amount of downtime due to the internal workings of the InnoDB storage engine. Observations in our load testing environment confirmed this. For us, an in-place upgrade took over an hour to complete because InnoDB had to flush its cache completely to disk before shutting down for the upgrade.
When we tell RDS to upgrade a MySQL instance, the first thing RDS will do is take a backup of the database before the upgrade. This won’t cause any downtime, but the downtime for the upgrade only starts after the backup is finished. It’s difficult to know when the backup will finish, so this can make it hard to specify an exact maintenance window to inform our users about the downtime.
Next, RDS will change a parameter telling InnoDB to perform a slow
shutdown. By default, InnoDB does a fast shutdown using the parameter
innodb_fast_shutdown
. This allows a MySQL instance running with the
InnoDB storage engine to use InnoDB’s recovery mechanism to reload the
buffered changes that haven’t been fully persisted yet. When RDS
changes this parameter for a slow shutdown, a reboot will cause InnoDB
to flush all changes to disk to guarantee everything is fully
persisted before doing the upgrade. This can take an unknown amount of
time depending how many buffered changes there are. At Skillz, we have
some large databases with a lot of buffered changes, so this was
certainly at the forefront of our minds.
This uncertainty about the downtime of an in-place upgrade led us to look at the recommended alternative: promoting an already upgraded read-replica to become the new primary database. Amazon’s documentation has a section called “Upgrading a MySQL database with reduced downtime,” which describes this process. Conceptually, this replica promotion upgrade aligned with our expectations and gave us the minimal amount of downtime for our users we wanted. We chose to go with the replica-promotion upgrade for both this upgrade as well as future upgrades.
Promoting a Read Replica
The process for promoting a read-replica to become the new primary sounds relatively simple. We first create a new read-replica off of the existing primary database. We can safely upgrade this replica in place, even with the unknown amount of downtime that might occur because no services are using it. We then set the old primary database into read-only mode, promote this replica to become a primary database, and then point all our services at the new database. At that point, we can delete the original database as we have no use for it anymore.
Describing the process is simple, but we want to do this while avoiding any potential errors and minimizing downtime. Since most of our backend systems rely on MySQL, this is a very high-risk operation. We also have an extra bit of complexity not covered in the documentation: our primary database already has several read replicas. This means that before we promote our replica to become the new primary, we’re going to have a multi-tiered replication setup. This isn’t difficult to create, but it introduces an extra challenge of replica lag when replicating through an intermediate database.
This blog post by Jean-François Gagné describes working effectively with this type of intermediate replica setup. However, in our case this setup is temporary, and we specifically want all of our replicas to use this intermediate database as though it were their primary database. We might have been able to work around this intermediate replication if we were using Global Transaction IDs (GTIDs), but enabling GTIDs was one of the reasons we needed to update.
One particularly scary problem with the replica promotion upgrade is data loss. Losing data is not an option. We don’t want some of our services to accidentally write data to the old primary after we stop replicating from it. We want to be absolutely sure that the old primary is read-only, and that all changes have been replicated to the new primary before promoting it. For similar reasons, this makes it very difficult to revert if something goes wrong. If we start writing data to the new primary, we can’t revert back to the old primary without losing those new writes. To reduce these risks, we created a script to handle the replica promotion process for us, and we tested our upgrade procedure multiple times in our load testing environment.
In addition to the risks, another downside to the replica promotion upgrade is the cost. We’d be temporarily running a complete duplicate set of databases, effectively doubling our costs. We also can’t instantly create these new databases and switch to them, because it takes a substantial amount of time to create the new replicas. There is a huge amount of data stored in these databases, and they are under heavy use. Creating a new replica takes several hours, followed by almost an entire day of letting it catch up on replica lag that built up while it was being created. This extra cost is only temporary, but it should be noted that this replica promotion upgrade has an extra cost associated with it compared to an in-place upgrade that has no extra cost. So the justification has to be there for both the engineering time investment as well as the AWS cloud bill.
On top of all of this, at Skillz we started splitting out our databases to separate responsibilities as we move toward our new Microservice architecture. As part of this, we actually have two sets of databases that we need to upgrade. If either database experiences downtime, our overall system will experience downtime. To minimize our overall system downtime, we decided that we need to upgrade both sets of databases at the same time. Doing two simultaneous replica promotions didn’t add a lot of complexity to our process, but it doubled the amount of setup required before doing the promotions.
Reducing the Risks with Automation
Swapping out our production databases is an extremely risky situation, so we did everything we could to guarantee success. The replica promotion and switchover step seemed like the highest risk, especially for human error. So we wrote a script to automate that process, in order to minimize our overall downtime and to help prevent human error from being a problem.
The script uses the Python Boto3 library to talk to the AWS API. Additionally, we use some direct connections to the MySQL databases to get better visibility on some of the steps. One of the issues with using the AWS API is that the commands will return instantly and handle the request asynchronously. For this script, we need each step to occur sequentially and to guarantee that each step succeeded. If we simply move on and call all the commands, we’ll get errors complaining that the database state is not set to “available.” So our workaround for this was to poll and wait, sleeping until the database state was “available” before moving on to the next step.
At a high level, the script executes these steps:
- Ensures both the old primary and new primary are “available”
- Sets the old primary into read-only mode (downtime starts)
- Verifies read-only mode with a direct connection
- Waits for replica lag to catch up (this should be very fast, under one second)
- Verifies the latest entries match on an active table
- Promotes the new primary database
- Waits for the new primary to not be “available” (the promotion has started)
- Wait for the new primary to be “available” (the promotion is done)
- Change the DNS to point to the new primary (downtime ends)
- Reboot the old primary to kill existing connections
Writes to the database will fail starting when we set the old primary into read-only mode, until we switch the DNS to point to the new primary – so those two steps are the full database downtime window. However, that’s just the time that the database is unavailable. Our backend services still take some additional time to update their DNS cache and connect to the new database.
One really nice thing about this replica promotion upgrade is that reads are never down. We can continue reading from the old primary database while the new primary is being promoted, because it’s still up and running in read-only mode. Then once the DNS entry switches to the new primary, we can start reading from it instead of the old primary. Even if the DNS takes a little bit of extra time to propagate, our services can continue to read from the old primary in the meantime. While we’ve labeled those steps as “downtime,” technically only writes to the database will fail, so parts of the overall system will continue to operate normally even during this “downtime” window.
Testing the Upgrade
Of course, we’d want to test the whole process before actually doing it on our production databases. We initially did some small-scale tests with throw-away databases to make sure that our promotion/switchover script worked correctly. However, these test databases were tiny, and we were worried that the sheer size of our production databases might cause some unforeseen issues. Luckily for us, we have a load testing environment that is nearly identical to our production environment, so we could do a full-scale test of the upgrade procedure there. We have invested heavily over the years in a Gatling-based load testing framework, which we have integrated into the AWS EC2 ecosystem. This load testing framework allows us to effortlessly simulate any amount of production traffic to give us a realistic setup to test against.
At a high level, this was the whole procedure:
- Create a new replica (“new primary”)
- Upgrade that “new primary” replica
- Create additional replicas based on “new primary”
- Point our DNS entries to use the new replicas (no downtime)
- Run our script to promote and switch to use the “new primary”
- Delete the original databases
We started creating the new replicas in our load test environment, and almost immediately ran into a significant difference. Because of the size of our databases, when creating new replicas, they started with a large amount of replica lag that would take at least a day to catch up. This wasn’t a functional problem, but it did affect the planning around this upgrade. If we had to wait a full day for replica lag to catch up, then we’d need to create the new databases several days before they could be used.
Once the new replicas were ready, we switched over to use the new read-replicas by switching our DNS entries in Route53. The main thing we were worried about with this step was the additional replica lag of replicating through an intermediate database. This was an issue in our load test environment, as our load tests created a large amount of writes to the database. Now that we discovered this problem, we didn’t want to run the multi-tiered replication setup in production for an extended period of time. Since it was relatively quick to switch the DNS entries, we decided that we should do this step immediately before promoting the new primary in order to minimize the risk of extra replica lag.
As we were fighting replica lag, we had a few replica-specific MySQL
parameters enabled on our “new primary” replica, specifically the
slave_parallel_workers
and slave_parallel_type
parameters. We did
not know if these parameters would cause a problem if we left them
enabled while promoting the replica to become a primary
database. Rather than guess, we decided to just try it (in our load
testing environment, of course).
We had everything set up and ready to run the replica promotion and switchover script in our load test environment. The new replicas were already in use as read-replicas, we just needed to switch to the new primary. So we ran the script.
The old primary database was set to read-only mode, and our services started throwing errors as expected. The new primary was successfully promoted to a primary database and stopped replicating with the old primary. The promotion/switchover script completed successfully after almost ten minutes, and then we monitored our services waiting for the errors to stop. Our services all recovered except for one, which didn’t pick up the DNS change for the new primary. Once we replaced the instances for that service, the errors cleared up and everything was looking mostly okay.
Then, we noticed a significant performance problem after the
upgrade. Our database inserts were taking a few seconds each. We
checked the database logs and found the issue: incompatible
parameters. That answered our question about the slave_parallel
settings. We modified our promotion script to include changing the
parameters for this setting right before the promotion step.
One other parameter we modified was the
binlog_group_commit_sync_delay
setting. In our load testing
environment, we found that we needed to set
binlog_group_commit_sync_delay=4000
on the primary instance, but
setting it on the intermediate “new primary” caused our second tier of
replicas to fail to catch up on replica lag.
After setting binlog_group_commit_sync_delay=0
on the
“soon-to-be-primary”, we were able to wrestle replica lag back down to
our sub-second SLA. This article describes how that parameter
helps, but it effectively slows down the primary to help speed up
the replicas. We need this parameter set to 0 while the “new primary”
is a replica database, but right before we promote it, we need to set
this parameter to 4000. So, we added this step to the script right
alongside the slave_parallel
parameters that needed to change before
the promotion as well.
After fixing up the promotion/upgrade script, we went through the whole upgrade process again in our load test environment. We added a manual step to replace the instances for that one service that didn’t update its DNS. The upgrade ran smoothly, and it resulted in only ten minutes of downtime.
The Production Upgrade
After testing the whole procedure multiple times, we were ready to actually do this in our production environment. We created all the new replica databases to get our full new set of databases ready. We gave them plenty of time to catch up on replica lag to make sure that wasn’t an issue. We planned a maintenance window during a time when we have the lowest amount of traffic, to disrupt the smallest number of users. For us, this was at 2am PT, since most of our users are located in North America. Based on our tests, we planned on a 15-minute maintenance window to be safe. We assembled a small team of people representing multiple teams across the company, including our server team, infrastructure team, SDK team, data team, and player communications team, all to help verify that every part of the system recovered gracefully. We also had AWS support on-call and ready to help if anything went wrong.
Then we executed the plan. Starting at around 1am PT, we started moving over the DNS entries for our read-replicas to point to the new set of replicas. We made sure other dependent services successfully switched to the new replicas once the DNS was changed. We sent out a Skillz News notification to warn players about the downtime. Then we ran the promotion/switchover script to actually do the promotion upgrade. As we were monitoring our services, we watched the downtime start as expected and our services started throwing lots of errors. After an anxiety-filled couple of minutes, the script finished, and the errors started to go away. We replaced the instances of our service to make sure the DNS was updated correctly. We had everybody verify their part of the system was working as expected, and we sent a follow-up notification letting players know the downtime was complete.
Surprisingly, while monitoring the errors, the downtime in production was only for six minutes – even better than our tests! All our preparation and testing paid off to make this upgrade go even smoother than expected.
If you’re interested in solving large-scale challenges like this, we’re hiring!