Tech Disaster Scenario Planning

A few months ago the internet went down at our warehouse. It was in the afternoon, after orders had shipped. Since we’re a business customer, they were able to send out a technician the following afternoon. No big deal – in the meantime we’d use our backup mifi.

Except that we hadn’t used it in a long time. The day-pass plan that we had been grandfathered into wasn’t working. Eventually we figured it out, but it took effort from a lot of people to get it up and running. After the fact, we improved our documentation and upgraded to an ongoing plan to make it much simpler next time.

In discussing all of this with my partner after the fact, I expressed how stressful it can be on me to get thrust into these situations, being that I’m the technology point of contact for everything – our computers, our shipping station, our website, our server. But then I thought about it some more and realized that over the past 5 years or so we’ve been improving our backup options as each mini-disaster occurred. We had them all documented on internal wikis, but not all in one place.

So I spent some time doing just that. I brainstormed every possible technology related disaster we could think of with the team, and then created a single page “cheat sheet” with a few bullets outlining what to do and linking out to guides with more details. This takes the stress off, knowing that if our credit card processor is down or a printer dies we have backup plans that multiple people in the company can execute quickly and easily.

It’s funny how your mindset can be stuck on one set of assumptions (i.e. when there’s a disaster I need to be involved, therefore I need to always be available) even though it’s been a while since that was a reality.

It also helps my thinking when we’re in a crisis. A few days ago our server went completely down and unreachable out of nowhere. Instead of my mind racing through a hundred different scenarios, I recalled the server section I had just written and pretty quickly determined it was likely either a network error on their end or a hardware failure on our machine. It turned out to be a faulty power supply, which was replaced immediately and we were up and running with a total downtime of less than an hour.