Software Repair

Restore Order from Chaos After Devastating Software Crashes: Our Methods

November 11, 2024

The Moment of Crisis: Chaos Consumes a Tech Giant

It was a warm afternoon in Copenhagen when Maersk, the world’s largest shipping conglomerate, began to lose control. Confusion swept through the company’s gleaming headquarters overlooking the harbor as employees raced to unplug their computers, yelling for colleagues to do the same. Within hours, Maersk’s global network was crippled, its 76 ports and 800 vessels rendered helpless.

The culprit? A vicious piece of malware known as NotPetya, unleashed by Russian military hackers onto an unassuming Ukrainian accounting software. In a matter of hours, the worm had jumped borders and spread indiscriminately, crippling multinationals worldwide and causing over $10 billion in damages – making it one of the most destructive cyberattacks in history.

For Maersk, the fallout was catastrophic. With its systems wiped, the shipping giant found itself dead in the water, unable to receive or process cargo, manage its fleet, or even allow employees to access email. The company’s IT staff scrambled to rebuild its entire global network from the ground up, working around the clock for weeks to restore order from the chaos.

This was no ordinary ransomware attack. NotPetya’s goal was pure destruction, not financial gain. And Maersk was merely collateral damage – an unfortunate casualty in a twisted game of geopolitical one-upmanship between Russia and Ukraine.

The NotPetya outbreak did more than just paralyze Maersk; it exposed the fragility of our increasingly interconnected world. In an era where digital systems underpin nearly every aspect of our lives, a single malicious code could ripple outward, crippling critical infrastructure and wreaking havoc on a global scale.

Responding to the Crisis: Maersk’s Herculean Recovery Efforts

As NotPetya swept across Maersk’s global network, the company activated an emergency response plan, assembling a crisis team in its IT headquarters in Maidenhead, England. This became ground zero for the shipping giant’s painstaking recovery process.

With every Maersk computer rendered unusable, the team faced an enormous challenge: rebuilding the company’s entire digital infrastructure from scratch. Their first priority was to restore the crucial domain controllers that mapped the company’s sprawling network – a task made infinitely more difficult by the fact that only a lone, remote domain controller in Ghana had survived the attack unscathed.

Securing that lone backup was a logistical feat in itself, requiring a complex handoff between Maersk employees in Nigeria and London. But once that was accomplished, the real work began. Over the next several weeks, the Maidenhead team worked tirelessly, using newly purchased laptops and Wi-Fi hotspots to reinstall Maersk’s core systems and applications.

The scale of the effort was staggering. Thousands of Maersk employees, both in-house and contractors from the consulting firm Deloitte, toiled around the clock to restore order. Every single computer had to be wiped and reinstalled, with IT staff lining up rows of devices in the company cafeteria, meticulously configuring them one by one.

“It was a heroic effort,” Maersk’s chairman Jim Hagemann Snabe later told an audience at the World Economic Forum. “In just 10 days, we rebuilt our entire network of 4,000 servers and 45,000 PCs.”

But the company’s recovery went far beyond just technical fixes. Maersk also had to find creative ways to keep its global operations afloat in the meantime, resorting to manual processes and makeshift solutions to continue serving customers. Freight forwarders like Pablo Fernández were left scrambling to reroute cargo and find alternate transportation, while Maersk’s port terminals worldwide struggled to function without access to crucial inventory data.

“Maersk was like a black hole,” Fernández lamented. “It was just a clusterfuck.”

Yet, through sheer determination and resourcefulness, the company slowly clawed its way back. Within two weeks, Maersk had restored basic functionality, allowing it to begin accepting new bookings and gradually bring its ports back online. The full recovery, however, would take months of painstaking work.

Learning from Disaster: Validating Disaster Recovery Plans with Chaos Engineering

The NotPetya attack exposed critical vulnerabilities in Maersk’s disaster recovery planning – vulnerabilities that the company was determined to address. In the aftermath, Maersk leadership recognized the need to thoroughly test and validate their disaster recovery capabilities, rather than relying on untried plans.

This is where Chaos Engineering comes into play. By deliberately injecting failures and disruptions into their systems, Maersk can now proactively identify weaknesses and validate their ability to withstand and recover from catastrophic events.

Chaos Engineering is the practice of purposefully introducing controlled failures into a system to test its resilience. Chaos experiments can simulate a wide range of disruptive scenarios, from network outages and server failures to full-scale regional disasters. By running these experiments in a safe, contained environment, organizations can uncover hidden dependencies, test their disaster recovery procedures, and fine-tune their response plans before a real crisis strikes.

Maersk has embraced this approach, using Chaos Engineering to validate its disaster recovery capabilities across its entire global IT infrastructure. The company now routinely runs “FireDrills” – planned chaos experiments that test its ability to withstand and recover from major disruptions.

“Chaos Engineering has saved us hundreds of engineering hours and reduced our per-dependency test times from hours to just minutes,” says a Maersk IT executive. “It’s an essential part of our efforts to build a more resilient and reliable global operation.”

By proactively stress-testing their systems, Maersk can identify vulnerabilities, optimize their disaster recovery procedures, and ensure that they’re prepared to handle even the most severe disruptions. This approach allows the company to move beyond static disaster recovery plans and towards a more dynamic, adaptive model of resilience.

Chaos Engineering in Action: Simulating a Regional Outage

To demonstrate the power of Chaos Engineering, let’s walk through a hypothetical FireDrill scenario that Maersk might run to validate its disaster recovery capabilities.

Maersk operates a global network of 76 shipping terminals, any one of which could be crippled by a regional disaster. To test its readiness, the company creates a Chaos Engineering experiment that simulates the complete shutdown of a major terminal, such as the one in Elizabeth, New Jersey.

The experiment begins by using a “blackhole” attack to block all incoming and outgoing network traffic to three servers in the Elizabeth terminal. This effectively takes the terminal offline, mimicking the devastating impact of a natural disaster or cyberattack.

As the chaos unfolds, Maersk’s IT team springs into action, following the steps outlined in their disaster recovery playbook. They start by notifying key stakeholders – including executives and customers – about the simulated outage, ensuring that everyone is aware of the situation and its potential impact.

Next, the team begins the process of restoring service. They locate recent backups of the terminal’s servers and work to bring those systems back online. Crucially, they also need to recover the domain controller data that maps the overall network – a step that proved so critical in the real-life NotPetya recovery.

Throughout the process, the team tracks key metrics like recovery time objective (RTO) and recovery point objective (RPO) to measure their performance. The goal is to restore full functionality within the predetermined RTO, while minimizing data loss (RPO).

Once the experiment is complete, the Maersk team gathers for a comprehensive post-mortem. They analyze their response, identify areas for improvement, and update their disaster recovery playbook accordingly. This ensures that the next time a real crisis strikes, they’ll be better prepared to restore order from chaos.

Building Resilience in an Unpredictable World

The NotPetya attack served as a wake-up call for Maersk and countless other organizations worldwide. It exposed the fragility of our interconnected, technology-dependent world and the devastating consequences that can result from even a single point of failure.

But rather than simply accepting this new reality, Maersk has chosen to proactively fortify its defenses. By embracing Chaos Engineering and continuously validating its disaster recovery capabilities, the company is taking a more dynamic, adaptive approach to resilience.

In an age of increasing uncertainty, this kind of forward-thinking mindset is essential. As the pace of technological change accelerates and new threats emerge, organizations must be prepared to weather any storm – whether it’s a global pandemic, a devastating cyberattack, or an unforeseeable black swan event.

By leveraging the power of Chaos Engineering, Maersk and other IT leaders can identify vulnerabilities, optimize their recovery procedures, and ensure that they’re ready to restore order from chaos, no matter what the future may hold.