It’s being called one of the worst IT outages in history.
Millions of people around the world were impacted and some of the worst-hit businesses were banks, hospitals and airlines with some still struggling to fully restore their systems four days on.
In the aviation industry alone, over 1,400 flights were cancelled.
It was a different story for IT Naturally’s customer – one of the world’s leading holiday airlines. They were able to keep flights running smoothly for their passengers.
Here’s what happened:
The Incident
We now know that at around:
- 04:09 UTC: CrowdStrike pushed out a content update that inadvertently caused Windows computers to crash and display a blue screen.
- 05:27 UTC: CrowdStrike reverted the change, updating and correcting the faulty file.
The Challenge
Whilst an update was quickly released by CrowdStrike, the update couldn’t rectify the problem that had already been caused, in that once a Windows computer has been blue-screened, it must be rebooted. Whilst some computers did come back up with a straightforward reboot, many didn’t and these then required physical intervention. How do we overcome this?
Working with a Proactive Partner
Initial Response
Within 10 minutes of receiving the Priority 1 alert, IT Naturally immediately initiated a Major Incident bridge call to assess the situation. Initial reports were unclear, with rumours suggesting a widespread Microsoft outage. IT Naturally quickly determined that the incident was relevant to their airline customer and that CrowdStrike’s faulty update was the cause.
Action Plan
- IT Naturally came up with and documented a clear process for rebooting the machines and coordinated with field personnel to implement this solution swiftly.
- Knowledge of the airline’s IT infrastructure allowed IT Naturally to prioritise critical systems, ensuring no disruption to flight operations.
- The 24/7 Service Desk was briefed and able to provide remote support and instructions to field personnel who needed to physically restart affected devices.
- IT Naturally’s partnership with CrowdStrike enabled quick access to crucial information and updates.
IT Naturally’s Timeline
- 6:30 AM: IT Naturally received a Priority 1 alert indicating a complete business down situation for their airline customer, which could potentially lead to high financial impact if not resolved quickly.
- 6:40 AM: Bridge Call was set up with the customer
- 7:00 AM: The team identified that systems switched on overnight (around 4 AM) were affected, particularly servers and the airline’s Operations Control Center (OCC) which all operate 24/7.
- 11:29 AM: The issue was resolved, and systems were back up and running, with no flights impacted. Some work on non-critical business systems was continued.
Key Factors for Success
- A quick response, early involvement and a documented process enabled IT Naturally to address and resolve the issue.
- The team’s dedication to resolving the issue swiftly and effectively ensuring the airline’s operations continued smoothly.
- Emergency procedures, including authorised emergency access protocols, were in place to handle such critical situations.
- Ongoing 24/7 support and monitoring helped mitigate further risks and ensured all systems were fully operational by lunchtime.
The Outcome
The global IT outage caused by CrowdStrike’s faulty update believed to be human error highlighted the vulnerability of interconnected systems. However, IT Naturally’s quick and efficient response meant that their airline customer experienced no disruptions in flight operations despite other businesses being affected for days.
IT Naturally believes in being a proactive partner who has the people and the processes in place to coordinate and communicate effectively during a crisis.
Our proactive approach and commitment to service excellence were key to successfully navigating this unprecedented global outage.