Bad experts on the CrowdStrike failure

We largely don't know what we are talking about

Jul 21, 2024

The news quoting experts on this CrowdStrike failure has gotten absurd. Experts don’t know much, at best are just guessing based on assumptions, at worse are pushing a political agenda. That includes me — I’m not sure the reporters paid attention to my repeated assertions that my statements were just expert guesses. No amount of expertise makes a guess anything more than a guess.

Anyway, I thought I’d write up something “experts” are saying that are wrong, or at least, more nuanced.

The best example is the claim “don’t deploy on Friday”. People assume that CrowdStrike pushed out “code” changes, the sort of thing they do every few months, something you shouldn’t do on Friday. That’s false, they pushed out “content” changes, the sort of thing they do several times a day. Nobody would say such content changes shouldn’t be pushed on Fridays.

Things may be more complicated than this, the “content” may also contain “code”. But with the current knowledge, we can’t assert this was a “deploy on Friday” failure.

This headline about experts saying CrowdStrike “likely skipped checks” is meaningless. Of course a check that would’ve found the failure wasn’t done. But that’s a tautology, like “experts say the failure was caused by a flaw”. Duh!. Every company has a testing regime, and when bugs escape to the real world, they often go back and improve their testing to make sure it doesn’t happen again. This doesn’t mean they were negligent, lazy, slothful, or apathetic. It means testing is hard, and sometimes things slip through.

Hopefully CrowdStrike will make public their analysis of the event and tell us which tests were skipped, but otherwise, we experts are just speculating wildly. I say “we experts” because I got trapped into saying something like this to a journalist and immediately tried to backtrack. It’s so easy to fall into facile statements.

A lot of experts blame Microsoft. This is politics. They hate Microsoft anyway, and are simply looking for reasons.

The dumb techies work on the logic that any “blue screen of death” (BSoD) is a Microsoft bug. After all, alternatives like macOS and Linux don’t have BSoDs. The reality is they do indeed have the kernel crashes, but record them in different ways. My macOS devices silently reboot without showing a screen, and my Linux devices don’t even have a screen. I see these things happened in the logs.

Another Microsoft item is the discussion of drivers or kernel modules. They are inherently dangerous because they can crash the kernel making the system unreachable from the network. The macOS and Linux (alternatives to Windows) have been trying to move code out of the kernel to improve robustness. Maybe that’s an inherent Microsoft problem that they depend more on third-party drivers.

But mostly, it’s a CrowdStrike and hacking problem. CrowdStrike’s drivers are apparently bloated, doing things in the kernel that they could do outside the kernel. Also, Windows is the most hacked platform — and consequently, needs the most complex defenses. The complexity of defensive products on Windows platforms is fundamentally higher than on non-Windows, meaning, you are going to need drivers.

Somebody posted a details of the crash, showing that it was a null pointer reference, something that’s impossible with memory safe language like Rust. This is a stupid statement, as this tweet ably explains:

When I was in college another student was struggling with a programming assignment because their program was crashing with a “divide by zero” bug. The student complained “that can’t be, because divide by zero is impossible!!!”. This is the same logic with memory safe languages — they still have the same bugs.

In any case, the NULL-pointer reference appears not to be true. We don’t know what bug caused the crash. Techies are just speculating.

A lot of experts say this this is the “biggest IT failure” ever, bigger even than the notPetya worm from 2017. This is just a “best guess”, of course, I’m not sure we can measure this.

The notPetya worm caused actual damage, deleting data, keeping systems offline for weeks, disrupting global trade. In contrast, this CrowdStrike failure was quickly fixed in hours. This disruption was broad, but not deep.

CrowdStrike has said they estimated only 8.5 machines we impacted, not the “billions” people claim. This makes it a smaller event than the “Blaster” and “Sasser” worms of 20 years ago, each of which infected well over 10 million systems. But that’s not a proper comparison. Those worms only infected systems that weren’t important enough to patch, where defenders did everything wrong with cybersecurity. In contrast, this flaw impacted those who did everything right, the most important mission critical systems in enterprises.

I have faith in describing this as the biggest IT failure, but it’s an IT failure rather than a security failure. There will be bigger ones in the near future.

Hours after the failure, we already had punditry claiming we need more resiliency in software. It’s the sort of punditry that needs no technical expertise in understanding the problem, and which proposes no technical details for a solution. It’s just a bunch of furious handwaving that has absolutely nothing to do with this incident. It’s no more valuable than people loudly proclaiming “bugs are bad”. A lot of silly people (like Biden) are repeating the “software resilience” line, where solutions are political and ignorant, without any real technical exploration of the problem.

Some decry the monoculture. It’s an analogy for catastrophic failure, that when everything is the same, everything will get wiped out together, such as in the Irish potato famine.

But there is no monoculture here. There are a billion Windows machines in the world, billions of Linux machines, billions of Apple machines. CrowdStrike’s “Falcon” runs on less than 1% of these. 1% is not a monoculture.

Big failures happen simply because so many things are big: Java (log4j), SolarWinds, Adobe (PDF), Fortinet, Atlassian, and Microsoft Exchange (which has nothing to do with Windows). There are so many things where a failure in 0.1% of alternative can cause catastrophic failures. We have long supply chains where a disruption at any step can impact the entire chain.

There is not, and never has been, anything like a monoculture on the Internet. Even if you make your corporate servers/desktops 33% Windows, 33% macOS, and 33% Linux, when something breaks one third of your system, your entire enterprise is affected, as they are all dependent on each other.

There are many EDR competitors to CrowdStrike that also have millions of customers, as well as Microsoft’s own solutions. There is no “monoculture” involved here. It just means different airlines get grounded when the next EDR vendor blue-screens their customers.

Conclusion

All news stories about catastrophes have the same shape. They insist there was some sort of simple thing, some moral weakness, that’s the underlying cause. It’s something the readers can understand when ignorant of technical details. This is the shape of stories whether they are about Boeing, the Secret Service, or CrowdStrike.

The reality is that failures are complex and you don’t know what’s really wrong without a thorough technical analysis. The actual “root cause” will probably be something complex enough that the public can’t understand it.

I talked to a number of reporters, as an “expert”. Some were tech reporters who got the details right and didn’t really need an expert, except as somebody to double-check things. Some were non-tech reporters who got dragged in to write a story who had a ton of misconceptions, needing an expert to explain even basic details.

But “expert” discussion beyond basic facts is just speculation, still not credible even though an “expert” said it.

Cybersect

Discussion about this post