Regaining Control of Technology-Related “Emergencies”

About two months ago I reached a breaking point. Since the pandemic had begun, the frequency of technology-related problems that required my immediate attention had increased from the “occasional issue” to a near-daily occurrence. Being on call, more or less, for the past 12 years (since we migrated to Liquid Web) was only sustainable because I had set up good notification systems and because the issues were rare. I’d get a wave related to a particular problem, and then have several months with nothing.

This year has been totally different. We’ve experienced more growth than we had planned for, which means we experienced a few years’ worth of volume-related problems condensed down into six months. Along with external factors like our payment processor PayPal having more downtime, an uptick in spam/fraud behavior, and some bad luck, and you have the perfect recipe for chaos.

My partners were well aware of what was going on, but they – and everyone else at the company – were also busy fighting their own battles related to the unplanned growth. We talked through potential solutions, and eventually decided on an on-call team to alleviate most of my stress on nights and weekends. The idea was that we’d have a team that could handle a variety of situations with basic debugging tools and, if necessary, start a ticket with Liquid Web or PayPal. They’d bridge the gap until I could step in the following day.

Executing on this idea would be no small undertaking. I’d have to find the team, either internally or contract with an external company, set up a bunch of systems and procedures, and then train everyone. And it would all come at a big expense, one that I’m thankful my partners had deemed worth it to keep me sane.

I spent a few full days reviewing the issues over the past few years and then speccing out the entire system. Pretty quickly, I came to a few conclusions that adjusted the course of the project:

  1. Almost all issues resolve themselves within a few hours
  2. Some issues would still require my attention even with a capable on-call team
  3. There are patterns to the issues that I previously wasn’t seeing

I decided to take a step back to see if I could solve the problem without the need for an on-call team, at least for now.

The first step was engaging the team over at Liquid Web for more specifics on what they cover for us and when. With our server cluster we get their “enterprise level” support, which has been fantastic. In the past, they had jumped in when they noticed an issue on the server, above and beyond what’s outlined in their scope of support. After a few emails back and forth, I confirmed what I was looking for. They were monitoring a few things closer than I had realized, meaning that I could scale back my alerts and monitoring efforts a bit.

The second step was to rely a little more on our customer service team. We have coverage during business hours and 5 nights per week (Sunday – Thursday), which we are in the process of expanding as soon as a new team member is fully trained. They are well versed in the types of issues you can run into with PayPal in particular, so I decided to send them a twice-daily digest of any potential issues and put them in their hands to cover them. These are one-off problems, like a customer who may have paid twice, or a customer who may have paid but the payment wasn’t posted to our system. In theory, these should never happen. We’ll go months without an issue, and then they’ll happen all of the time for a while.

Which leads me to the third and most difficult step: automating a few systems for which I was receiving alerts for but needed to manually intervene to resolve. The goal would instead be to have the system take action and for me to receive an email summarizing the action. I’d review that email the following business day. No notification would be sent to my phone unless a very specific criteria was met.

A good example, again, is PayPal. If you take a peek at the PayPal Status History you’ll see that they rarely go a few days without an issue. Previously, I’d get an alert on my phone and then monitor all payment attempts on our site until resolved. As these had increased exponentially, so had my frustration level. When I looked at the historical data though, I realized that almost all of these either didn’t affect us, or affected a few customers that could be resolved by our customer service team. In our history, only one issue absolutely required my intervention. Two others were very serious, but seemingly resolved within a few hours irrespective of us jumping in and starting a ticket. With that information, it was an easy decision to disable the phone notifications for PayPal’s emails and create my own custom email notification that alerts my phone only if a very specific set of criteria are met. Instead of a once-weekly interruption, I might get a once-yearly interruption.

The last part in all of this is taking a step back and realizing that these aren’t true emergencies. As I noted above, most will resolve themselves in a few hours without my intervention. Liquid Web will restart the server. PayPal will fix whatever is wrong. At the end of the day, having my phone go off at 3 AM just isn’t worth it anymore. I’ll figure it out in the morning. We’ll take care of any affected customers. Our business will survive.

It’s been a few weeks now since the entirety of this has been in place. Not to sound overly dramatic, but it has been life changing. I don’t feel that constant low-level stress that something might go wrong at any time. I’m able to leave the house for a few hours without having to arrange a contingency plan (which is either requesting that my business partner Mike cover for me, or taking my Surface Pro with me and hoping the hotspot on my phone works well enough). We very well may still need an on-call team at some point down the road, but this buys us some time on that decision and brought much more immediate relief.