As a startup CTO, one of the most challenging parts of my job is being on call. As the teams technology lead, I am responsible for all aspects of our technology, and that covers DevOps.
When the site is experiencing abnormally high load, and you get paged, it’s time to jump into the fray. Into the hot seat. Where every second counts. I often chat with DevOps engineers at meetups, each has a story of an extraordinarily complex and timely production issue they had to conquer. They read like battle reports. While the details of our own system are highly specific, it’s encouraging to know that I’m not alone in going into battle.
I’d like to recount a battle report of my own.
I was paged on a Monday morning at 9:15am because the logs had a lot of mysql issues. I tailed the logs and saw the same. Checked the production site, it was down. Upon opening up mytop on all of the DB servers, I saw that the slave lag on [Server 2] and [Server 4] were very high. They were hovering at [redacted] seconds, which is the cutoff for whether our cron jobs will still query them. Since our background batch jobs were running, I intuited that there were causing too many writes being processed to the master server.
Since I’ve battled slave lag before, I know that it is caused by having too many mysql writes or too many writes that are slow. In this case, it was a combination of both. Since the mysql slave thread executes slave queries sequentially, the slaves got way behind. You can see the slave lag thread in this screenshot of mytop as ID # 5453272.
In any case, I watched the slave lag thread for a few minutes and intuited that [Feature 1] were causing an issue and that the [Feature 2] job was causing an issue. Fixes here and here. There was also a rogue mysql query being executed on the command line by the root user — A month ago I had set up an job to migrate the [Feature 3] and it never ended after they were all migrated.
[Engineer] then put the site in maintenance mode. It’s actually a really simple change that we make to the app code. If I remember correctly, I think there’s also a way to modify the nginx config to force this condition as well, but we find app-level changes to be a little more straightforward.
After I commited those fixes, I noticed that [Server 1] (the mysql master) was still overloaded. It seemed like it was past the point of no return. Since the slaves were all past their acceptable slave lag, the application was just querying the master server, which was the root cause of the overload. [Server 1] had a load average of 20! No good. Especially since the mysql master is one of the single points of failure in our application. I then restarted the mysql daemon on [Server 1]. That didn’t work, the process had runaway and was frozen. I then executed a hard reboot on the [Server 1] server. It was taking forever to come back online, so I had [Engineer 1] live chat with [Web Host] and they told me it was executing a file system check to prevent data corruption.
At this point, [my boss] was gchatting the hell out of me telling me he was in an important meeting with [important industry bigwig] and he needed to see the site. Major facepalm! You do not want to fuck up your bosses important meeting. So I switch [Server 2] to be the master server (code line that I changed is here). Note that [Server 1] has a master-master relationship with [Server 2]. That means that all queries executed on [Server 2] will eventually be executed on [Server 1] and vice versa. All of the other servers ([Server 3],[Server 4], etc) have a master-slave relationship, so you should never ever ever eever ever ever ever ever ever set them as a master server or else the databases will get out of sync.
At this point, [Server 1] came back online. I reverted the commit to the DB layer.
Oh, and also, php-fpm and nginx were restarted on the app servers a few times. We needed to do this because, as mysql queries get backed up, application servers start to pile up threads of php-fpm and nginx and then themselves become overloaded.
And that was that. Site came back online and everything was sunshine, unicorns, and rainbows.
For future reference, you can kind a list of ‘on call resources’ in our documentation, here.
One thing I’ve been concentrating on recently is providing adequate information to my on-call engineers so that they can make the right decisions when they are in the hot seat. The hot seat requires a combination of experience, critical thinking, and mindful action. It requires you to use a scalpel, not a knife. It requires you to communicate timelines to your clients. It’s live surgery, and you are the surgeon.