Last week, Zoosk’s operations team said goodbye to the data center that has been our home for the past seven or so years. This migration was the product of months of planning that culminated with a rather tense hour of down time as we moved to our new home. Not everything went according to plan, but we counted on that, and at the end of the day the move went better than we could have possibly hoped.
Our goal was to retire our primary production data center and move all services over to a newer facility. Fortunately the new facility had already been built out a couple of years back as a disaster recovery site so the hardware was there, but we had never passed real traffic through it and we knew we had a lot of work to make it ready. We made some major decisions along the way—Zoosk has never taken intentional down time in over eight years of service. We took some calculated risks—it is impossible to know for certain how real users will respond to a down page. And we practiced. Oh did we practice.
From the beginning, Zoosk has had a philosophy that “down time” has no place in modern web application development. Sure, we have had a few outages over the years, but they have fortunately been very brief and mostly due to outside forces. We do not take the site down to deploy code or change things. We do rolling restarts, we swap symlinks, or we drain and undrain pools of Docker containers.
So when we proposed 25 minutes of down time to transfer our entire production environment from one data center to another, eyebrows were raised. Could we do it without an interruption? Yes, of course we could, but we chose not to because the cost to migrate without down time would have been greater.
To migrate without down time would have meant moving piece by piece. We would have had to pick manageable chunks of infrastructure and move them one at a time to their new home. The problem is that this process would have created a large number of unknown scenarios. At some point, we would have had to move a database that needed access from both old and new DCs. But that raised a concern about what would happen when database call latency went from a fraction of a millisecond to multiple milliseconds. So then we considered replicating the database. But that raised a concern about the impact of multiple milliseconds of replication delay. That would have been easy to consider on an individual case, but we weren’t sure about what would happen with hundreds of tables and stored procedures and service tiers interacting. We would have introduced a lot of performance risks, and a high risk for subtle race conditions.
If we moved everything at once, there would be more risk during that one move, but the new infrastructure would be roughly the same logical layout. And that was a big upside. If anything, some parts of the new infrastructure would be faster. So, at all times, the relative performance characteristics of the system would be familiar.
After deciding to move everything at once, we knew that the risk of something going wrong was high, so we did a number of things to compensate.
Anything that needed doing during the process had a single owner. That person may not have been able to complete the task alone, but the buck stopped with them. We were very clear about who owned each task and wrote it down. I liked to ask each person to repeat their deliverable back to me, just to make sure it was clear. A little annoying maybe, but this really helped to ensure that small tasks did not fall through the cracks.
We checked everything in the new environment from physical hardware all the way through the application. This included physical checks, hardware comparisons, OS and package validation, network layout and config review, firewall rules, deployed code checks, and application behavior validation. This turned into a huge punch list that we could track. Punch list items rolled up to leads, who were responsible for signing off on their respective layers.
This gave us a good combination of daily awareness and regular opportunities for course corrections. The weekly check-ins were a sit-down hour, and they gave us a good opportunity to discuss upcoming work and work in progress. Every one of these meetings, up to the last week, uncovered at least one important thing we had not considered.
It occurred to us early on that the process we were envisioning bore a lot of similarities to a shuttle launch. In both cases, the critical event is actually a series of many individual tasks that are carried out rapidly and in a careful order. The status of those tasks needs to be bubbled up to a central decision maker in a clear and concise manner, decisions need to be made quickly and clearly, and alternative scenarios need to be thought through and planned out in detail.
So we read everything we could find about how NASA creates and conducts their launch procedures, and what we ended up with looked very much like a shuttle launch.
We created a Trello board with columns for time windows—T -8 hours, T -60 minutes, and T -0 when the site would come down—and went up through our final T +12 hours when we would be completely done cleaning up. Each column contained a number of cards with an owner and a checklist. Only when all of the checklists were complete, or the flight director had approved an exception, could we proceed. The transition would be announced, and we would move into the next time window. This allowed dozens of peoples’ work to be timed and managed efficiently.
Nearly every aspect of our project had a backup plan. We had a second launch day picked out in case we had to completely abort. We had another Trello board detailing our fail-back steps in the event that we needed to abort backwards to the original data center. We had multiple people who understood each step in case someone turned out to be unavailable. We had support vendors on stand-by in Slack, by phone, or physically in the data centers.
At one point, someone asked what our plan would be if Trello went down. We laughed a little bit, because how could our luck be that bad, but we copied the board over to a Google spreadsheet anyway—and guess what? Something happened, and we weren’t able to access Trello at one point. If nothing else, it gave everyone more confidence to move forward, knowing that we had a solid backup plan every step of the way.
Have tabletop exercises.
I distinctly remember knowing that we were ready when I woke up one morning and realized I had been dreaming about the cutover process. Our practice sessions were two-hour affairs with between six and twenty people in a conference room. Some of these covered the full procedure end to end. Some of them focused on critical windows like T-20 through T+60. Some of them involved different groups. But we kept doing them until I felt like any more and we risked a mutiny.
When game day rolled around, we ended up having a small network configuration problem during our critical T+2 minute window, and the drilling paid off. Everyone remained calm, worked the problem, and we were able to adapt the procedure on the fly with a minimum of delay.
Get ready to cringe here—during all of our meetings and the two-hour tabletop sessions we had a strict no cellphones, no laptops rule. It was painful. It was a psychological experiment. A number of people discovered that they possessed an involuntary reflex to check their phone that they were literally unable to suppress without physical barriers. It was cruel, and it was unusual. But there was a point, and the point was that we wanted everyone to know what everyone else was doing. We all have a very human tendency to tune out information that is not directly relevant to us, but in this case we got enormous value from having everyone engaged. People brought up extremely good questions on parts of the process that had nothing to do with themselves that absolutely saved our bacon. And when our plans had to change on game day it was fine, because everyone knew the plan like the back of their hand.
Ultimately, our goal was 25 minutes of down time and we ended up taking just over an hour. That might sound discouraging, but it was entirely expected. We picked the lowest traffic part of our day, and we expected to run into unknowns. Three hours was roughly our hard-abort window since traffic would begin climbing then, and we came in well under that number.
Outside of that small network config delay, we had literally zero significant issues. It was unnerving. It took me three days to believe that some queue was not backing up and waiting to explode. I checked multiple times to make sure that we really had flipped traffic to the new data center and were not somehow still serving through the old one. We had planned for a week or two of fallout, and we suddenly had to find something else to do. It was as close to 100% successful as any of us are likely to ever experience in a project of this size and complexity.
On a personal note, this was my last major project with Zoosk. I have been a part of the team here for seven years and two months to the day, and I have had the opportunity to participate in a lot of challenges over that time. I have played some part in launching most of our major products and features. There are some big standouts, but as far as teamwork and execution go, this project stands above anything else we have done. I really cannot say enough about the team we have here and what they are able to do. I want to thank them for their trust in me over the years, and all of the hard work they put in to make this move a success, allowing me to leave on a high note. It has been amazing to be a part of such an awesome team guys, thanks.