Institution of Engineering and Technology experts respond to worldwide IT outage, 19 July 2024

0
283
Image by Pexels from Pixabay

Dr Junade Ali, Cyber Security Expert and IET Fellow, said: “The recent software update from CrowdStrike has resulted in a significant global outage, affecting computers running the Microsoft Windows operating system that use the CrowdStrike Falcon security product. This issue has led to widespread disruptions, including air travel delays, interruptions in television broadcasting, and halted supermarket transactions. The NHS, which relies heavily on Windows computers, is also experiencing outages in critical systems used by GP practices. The root of the problem seems to be a defective system file included in the update.

“The scale of this outage is unprecedented, and will no doubt go down in history, potentially surpassing the 2017 WannaCry attacks. Unlike some previous outages that targeted internet infrastructure, this situation directly impacts end-user computers and could require manual intervention to resolve, posing a significant challenge for IT teams globally.

“CrowdStrike are investigating the issue as a P0 incident, indicating the highest level of urgency in addressing the problem. The long-term implications of this outage are yet to be fully understood, but they could be substantial, affecting timely uptake of critical security updates in the future. This incident will provide key learning for the software engineering profession to consider the safety and security implications of software updates.”

Beth Clarke, Digital expert at the Institution of Engineering and Technology, Committee Member for the BCS Special Interest Group in Software Testing, and PhD researcher, said: “It’s too early to know what factors lead to this defect making it into the update, but the cause is probably more complex than just one single point of failure, and the teams at Crowdstrike will likely be investigating this in depth.

“Incidents like this highlight the importance of thorough software testing, and the critical role that software testers still play in the technology sector. In the current market, many companies are reducing the size of, or altogether removing, their software testing and quality assurance teams to reduce costs. In light of this morning’s outage and the global impact it has had, I would hope companies and organisations reflect on their testing strategies and see the immense value in having dedicated testing teams.”

Professor Ian Corden, PhD CEng, Fellow at the Institution of Engineering and Technology, said: “The major IT outages that are occurring around the world today highlight the ever-increasing dependence of national and regional economies, defence and national security, and private individuals on digital services, and hence their security and resilience. With software-based systems becoming ever-more prevalent in our daily lives, the importance of reliably-engineered software and IT systems is now paramount, especially where critical national infrastructure (CNI) is impacted.

“The cause of the outage appears to be a problematic update to CrowdStrike’s Falcon, an endpoint detection and response platform. This update has affected systems, especially those running Microsoft software, leading to widespread service interruptions.

“CrowdStrike Falcon is an endpoint detection and response (EDR) platform designed to protect computers and other devices from cyber threats. It monitors systems for intrusions and responds by blocking malicious activities. Falcon’s software is highly privileged, allowing it to influence computer behaviour to prevent security breaches. It is widely used across many industries to enhance cybersecurity measures​.

“Several large-scale IT outages similar to the recent CrowdStrike Falcon incident have occurred in the past. Notable examples include the British Airways IT failure in 2017, which grounded flights globally due to a power supply shutdown; Amazon Web Services’ S3 outage the same year, impacting numerous major websites; Fastly’s 2021 global outage caused by a software bug; and AT&T’s 2024 nationwide outage resulting from a problematic software update.

“To mitigate IT outages, companies should implement backup systems, deploy redundant infrastructure, conduct regular disaster recovery testing, and develop stringent software update protocols. They should also use advanced monitoring tools, train IT staff in outage response, and work closely with third-party vendors to ensure robust security and mitigation strategies.”

Ian Golding, Digital expert at the Institution of Engineering and Technology, said: “It’s too early to know precisely what has happened although an update to critical cyber security elements in the ecosystem of various providers and systems appears to have malfunctioned, causing mass failure of the computers relied upon for delivering services across these organisations.

“Despite organisations using well known and carefully chosen global IT providers, they all must work seamlessly together.  This interoperability is usually extremely well managed and tested with great skill and diligence, but it is complex, and as we see this can fail occasionally – today the failure and impact appears to be widespread and affecting all sectors from transportation to healthcare. Organisations will be looking at their IT architecture, their dependencies and assets and the associated key risks, including the risks that they expect their trusted providers to manage actively on their behalf.

“Whatever the weak links in the chain that are discovered from today’s outage, the organisations affected will become better prepared with their Plan B for a scenario like this in the future – understanding risks and putting in place resilience and recovery plans are key for these operational platforms affecting so many people today.”

David Smith, Head of Technology Strategy at the Institution of Engineering and Technology, said: “When cloud services go wrong, a large number of customers are affected by the issue. These types of services are updated constantly – a feature of the modern world and how we use technology at a global scale. The likelihood is that an error has occurred in the process of making changes to these services.

“We have seen incidents like this in the past – again where a change to a live service went wrong and had to be rolled back in order to fix it. Organisations should learn from every incident like this, no matter the size, in order to become more resilient to events that effect so many customers around the world.

“All organisations should have a business continuity plan when an event like this occurs, so that they can take the steps needed for their organisation – tackling what their workaround and mitigation approaches might need to be.

“A situation like this also illustrates the effects that all organisations face when building and using technology that relies on these cloud technology services that exist at scale. When a key component of an organisations supply chain, ceases to work, it can stop that organisation from functioning. The trade-off of course is that this technology at scale allows organisations of sizes access to capabilities that were once only available to the largest multinationals. In commoditising access to high technology, we must understand and be able to live with the trade-offs when those services suffer interruptions – unfortunately this will happen again.”


Help keep news FREE for our readers

Supporting your local community newspaper/online news outlet is crucial now more than ever. If you believe in independent journalism, then consider making a valuable contribution by making a one-time or monthly donation. We operate in rural areas where providing unbiased news can be challenging. Read More About Supporting The West Wales Chronicle