Mid-July 2024 brought us two significant software failures. The lesser of the two saw substantial portions of Microsoft’s cloud services become unavailable, while the more considerable issue saw Crowdstrikes’ anti-virus platform take down over 8 million devices. The two outages were enough to shut down banks, ground most major airlines, and stop hospitals from providing services to all but extreme emergencies.
Yet Crowdstrike took down only 8.5% of all Windows Machines, while Microsoft claims that the Azure outage only affected very few customers. Still, each outage is causing economic damage in the hundreds of millions, if not billions. Thus, the postmortem for anyone, not just affected organizations, must look at how we got into this situation and how we can move forward.
Cloud Consolidation and Monopolies
The list of Microsoft acquisitions on Wikipedia is long and fascinating. They contain known names in computing, such as GitHub. Yet, for every big name, you find ten small technological challengers. These small services, such as Kubernetes developer Kinvolk, helped Microsoft catch up to Amazon’s AWS. Each of them also eliminated a player from the game who, in time, might have challenged the technological leaders.
In contrast, Crowdstrike is more targeted with its acquisitions and focuses on maintaining its technological edge and eliminating competitors. Yet, both companies have used them to build and sustain leadership positions and corner customers into their eco-system.
Data Portability and Open Standards
Yet, with different operating systems and security services available, why don’t companies choose to change? It comes down to migrating data between operating systems and cloud platforms. While many believe we own our data, the truth is more complicated. We think of data as similar to a book; once we create or buy it, it’s ours to do as we please.
Yet, the reality is startlingly different. For example, many photographers vowed to switch products when Adobe switched to a subscription-based model for their Creative Suite, including Photoshop. Yet, in the end, the ability to continue working on existing products and open their photos had many creatives come back and pay the monthly fee. After all, they discovered that they don’t truly own their data but are indebted to the ideas of the software manufacturer.
Some governments, such as the EU, have started mandating open standards. Yet, until consumers and businesses demand portability and control of their data, questionable implementations and vendor lock-ins will continue. Thus, we will be all beholden to the few big corporations whose deep pockets managed to corner the market.
Different Cloud Risk Management
Unfortunately, the problems that arose also showcase two very prominent failures in risk management. On the technical level, we have accepted the word of cloud vendors about the safety and viability of their services without question. If your crucial office suite goes down for 5 hours or all your machines get locked up, that can quickly become an existential threat to the business. Yet, we haven’t done the same vetting for cloud services that we subject on-premises software to. We relied on the big names and marketing promises instead of putting the legalese of the terms and conditions against the issues brought by self-hosted software.
This obliviousness has to change. The IT department cannot assume all vendors will be 100% available. We need to go back to building contingencies into the IT systems to handle problems caused by catastrophic failures of a single vendor.
Yet, it isn’t only the IT department that failed in its risk management. The lack of business continuity points to an overarching failure of risk management. Boards have been oblivious to the impact of IT failures for far too long and have not considered the impact IT issues could have on company operations and the bottom line. Cybersecurity incidents hinted at issues with a deeper technical understanding. Yet, the current failures show that many company leaders lack a basic understanding of the underlying IT functions within their companies. However, our economy runs on computers. Until we give it the same attention we offer to legal, operational, and financial risks, we will continue to be only a single step away from a meltdown.
A New Cloud Computing
After the meltdowns of Crowdstrike and Azure, we must rethink how we do cloud computing. We cannot rely on vendor assurances and marketing speech to protect our businesses or personal lives. We must take risk management for cloud services seriously and consider the implications of a single vendor issue. We also must ensure that we can switch vendors if the risk is becoming a threat to our business. Without proper risk management and open data, it will only be a question of time until the next event of this magnitude hits us.