Skip to main content

Blog

You are here:

FAA NOTAM outage should scare all of us into (finally) testing our DRPs

The U.S. air travel system experienced its worst meltdown in years last week after a database in the Notice to Air Missions (NOTAM) service – a system designed to advise pilots of conditions at their destination airports before they are allowed to depart – failed. While initial fears that this was a cyberattack were soon quashed, the actual root cause could prove to be just as disturbing – because it suggests structural weaknesses in how the information technology needs of the world’s busiest airspace are being met.

ONE BAD FILE

The Federal Aviation Administration has confirmed the event was triggered by an employee error that caused a corrupt file in the NOTAM system’s database. Worse, the backup system that would have been used to restore the database also contained the same corrupt file.

This should make IT practitioners cringe – I know I sure did. That’s because the presence of a corrupt backup strongly suggests the FAA’s disaster recovery plan (DRP) had likely not been tested in the leadup to this failure. Because the only way IT leaders would have known they had a problematic backup would have been if they had walked through a test-recovery scenario.

In this case, the undiscovered corrupt backup guaranteed a much longer recovery in the event of an outage – which is precisely what played out.

In the weeks and months ahead, FAA officials will undoubtably be held accountable as the U.S. Department of Transportation moves ahead with its investigation. And rightly so, as this colossal debacle sidelined thousands of flights in the U.S. and slowed service from dozens of countries worldwide.

But we don’t have to be working in the aviation industry to appreciate why understanding this event is so critical to any organization that uses technology.

THE UNIVERSAL TRUTH

We all (should) have DRPs – or business continuity plans (BCPs) – in one form or another. In a report from iLand and Zerto, only 54% of survey respondents said they had a full, company-wide DRP in place.

That’s a frightening figure. And in 2023, we’re long past the point where technology is a luxury. It is a critical pillar of every organization’s ability to deliver, compete, and survive. And organizations that choose to fly without a DRP are quite literally playing with fire.

With this in mind, it’s safe to say we can all be doing more to incorporate the potential for failure into our technology and business planning processes, and ensuring the DRP is a core piece of our IT strategy and culture.

A well-formed DRP understands what can fail, how it can fail, and in what context. But simply documenting what can go wrong isn’t enough. While documenting failure modes is crucial, the actual document can’t simply be something that gets tossed onto a shelf and dusted off every once in a while, if at all.

Any reasonable DRP must incorporate recovery, as well – steps the organization must take to ensure continuity in the event of a natural, accidental, or human-caused disaster. The recovery component should identify specific accountabilities, communication flows, and checkpoints. Ideally, it will include a detailed schedule for real-world recovery testing, with recommended intervals to ensure key leaders, staff, contractors and other stakeholders are conditioned to execute the plan if and when an actual disaster occurs.

The FAA learned the hard way that recovery testing, like a good insurance policy, can save heartache later on. Any organization runs an elevated risk of losing revenue, alienating customers, and damaging the brand if it fails to have the appropriate disaster response resources and capabilities in place. 

THE BEST ADVICE I EVER RECEIVED

Earlier in my career, I was an IT project manager and application development support lead. My mentor at the time said we had two choices when considering how we wanted to support the business areas that relied on us: We could either plan to fail, or we could fail to plan.

The FAA now finds itself on the wrong side of my mentor’s guidance. It needs a better plan, and it needs to test that plan to ensure it’s ready for the inevitable. That means looking at the 30-year-old system at the center of this debacle and fast-tracking plans to update it to current-century standards – and doing so not six years from now, as currently planned, but now.

It also has implications for the rest of us, in any organization and in any industry. Because we need to plan our own plans, too. Because eventually, we’ll all find ourselves needing to recover from an outage or an attack, and no one ever wants to find out the hard way that their backups can’t be restored.

If you’re not sure where your own DRP stands, we’re always here to help you get started.