"Check, recheck, are we safe to go ahead?"
Subscribe now for unlimited access.
or signup to continue reading
It was our job to act with due diligence; plan well, test well, implement with caution. We had to know when to go ahead and when to back out.
That's what a telecom engineer is trained to do as they work on the "big pipe"' that route internet data around the nation, and around the world.
I can't tell you the number of nights that I've sat in dim, cold rooms, in obscure locations, laptop plugged in to a control module of the internet superhighway. No matter how cold, or tired, or wired on cheap coffee, we were drilled with the responsibility we commanded.
Cut a finger, and you bleed, but stab an artery and you haemorrhage. It's like that working the big pipes.
For front-line engineers there is immense personal responsibility, backed up by layers of team responsibility, with every participant contributing a vital part to the stability and resilience of critical infrastructure. There are few, if any, that "know it all" in telecoms.
Generalists rely on experts. Experts achieve little without teams of knowledgeable doers at the front line of operations. It only works because of planning, co-ordination, rigorous process, incredibly smart people doing what they do best, and very practical people bringing it all together with precision.
Last week's Optus outage was a true poor-bugger moment for those of us who have been there. We've all feared it, especially the idea of being the engineer that hit go when it all went wrong. The cause of the Optus network failure is blurry, with alleged involvement of a third-party peering network, and/or a flawed network upgrade cited. Regardless, networks have redundancy, and failure modes, that minimise the chances of, and the scale of, outages. But it's never perfect.
Engineers pursue the holy grail of maintaining "five nines" reliability, which translates to just 5.26 minutes of annual downtime, an impressively high bar to clear. In practice, many telcos offer a 99.9 per cent uptime undertaking, which equates to roughly 8.76 hours of permissible annual downtime.
Optus blew that bar in a day. The duration of that outage was undoubtedly shocking with its implications for businesses and citizens (especially vulnerable citizens), but we should take stock of other issues at play.
While it's natural to expect high reliability from telcos, we must acknowledge that achieving 100 per cent uptime is practically unattainable. Service interruptions can result from a variety of factors, including human and technical errors, natural disasters, and, increasingly, cyber breaches. And with increasingly advanced nefarious factors impacting telecom operators, disruption is likely to become more frequent and more impactful.
Beyond individuals, teams, and leadership all working to best endeavours, there's a national level view which deserves more attention. It's not just telcos under increasing pressure from disruptive forces. It's all of our critical infrastructures. This is acknowledged by government and is addressed in legislation such as the Security of Critical Infrastructure (SOCI) Act.
Which, whilst worthy and needed, sometimes feels like a slow burn in the right direction (perhaps an inappropriate metaphor for infrastructure at risk to natural disaster).
Are there lessons we can take from history? Those of use that worked in telecoms at the turn of this century would tell you that there are.
Twenty-five years ago, when I joined Nortel as a graduate engineer, the world was gripped in anticipation of the rollover from 1999, to the year 2000, and the fear it would herald catastrophic systems failures. The Y2K bug, had a simple cause, where year dates were represented by their last two digits, the rollover from 99, back to 00, was going to cause havoc. Except it didn't. Often referred to as the Y2K bug hoax this was no hoax, it was a success story.
The Y2K bug threat was averted because the risk was proactively recognised, and counter measures were prioritised, engaged with by leaders, championed by governments. There was widespread collaboration and cooperation, with ample planning and resourcing. And all this was wrapped in effective communication leading to global awareness. For those of us who participated, it has become one of the greatest examples of global risk management of our time.
Optus is not an island. By several definitions, its network is part of a global continuum. And Optus is not alone in its responsibility to citizens and businesses. Australia has many systems whose disruption, as we've seen with the DP World docks disruption, will cause immediate and widespread detrimental impact. At some level all these systems intertwine and become co-dependent. Telcos need electricity, electricity needs fuel, fuel needs transportation, transportation needs freely operating transit routes; and so on.
READ MORE:
With an increasing threat landscape targeting increasingly complex and intertwined systems of infrastructure, we face cumulative and cascading impacts from disruption. To mitigate these potential impacts, Australia needs to back itself, building integrated risk management approaches which absorb the best of operations (people, processes, and systems) from across all critical infrastructures, and treat them with a holistic approach.
To analyse, optimise and contingency plan across this level of complexity will take our best brains applying our most advanced analytical technologies, such as artificial intelligence. It's not futuristic, it's here and now; Australia has these capabilities, in the domain of the CSIRO, our universities, and in private sector sovereign capabilities like Sentient Hubs.
Investing in AI technology capable of modelling and predicting critical infrastructure outages or impacts, must be a core priority of this government's resilience agenda. Whilst Optus and DP World will have questions to answer, they will not be the last businesses hit with downtime, but they should be another catalyst for an advanced, analytically informed, nationally integrated approach to critical infrastructure risk management. The next time is coming, let's be ready as a nation.
- Alison Howe is the chief executive officer of the National Institute of Strategic Resilience.