The situation
Imagine you take delivery of a brand new car. As you’re handed the keys the salesperson holds up a shiny box saying, ‘Before you go, you might like to consider installing this third-party package.’
‘What does that do?’ you ask.
‘It boosts the headlights and windscreen wipers.’
‘You mean the car doesn’t have headlights and windscreen wipers built in?’
‘Yes, but the headlights are really feeble, and the wipers only wipe once a second. You really shouldn’t head out on the highway with the car as it is. You really need this package to keep you safe.’
At that point, most sensible buyers would reconsider their purchase. But software buyers don’t. The core operating system that runs most of the world is of such poor quality that only a fool would consider connecting it to the information superhighway without some third-party protection. And that’s precisely what thousands of corporate users did by installing CrowdStrike. Unfortunately, the only thing CrowdStrike didn’t defend against was CrowdStrike itself.
The incompetence: Part 1
First, CrowdStrike. The global crash was caused by an update to an innocuous looking .SYS file. Clearly, it hadn’t been tested. You might think that a company that boasts of serving:
- 298 of the Fortune 500 companies
- 8 out of the top 10 financial service companies
- 8 out of the top 10 food & beverage companies
- 6 out of the top 10 top healthcare providers
- 8 out of the top 10 top manufacturers
- 8 out of the top 10 auto companies
- 8 out of the top 10 technology companies
might take a little more care with its updates. Apparently not.
But SaaS (Software as a Service) stuff is really hard to test, CrowdStrike apologists tell me.
Really? You can’t set up a ‘sacrificial’ server farm representing common configurations and test it on that? Apparently not. It seems the only real alternative is to release the patches and let the users test them for you. Or to put it another way, the people who are paying you to protect them are simply guinea pigs for your changes.
The incompetence: Part 2
The fix was actually very simple:
- Boot into Safe Mode or the Windows Recovery Environment
- Go to the
C:\Windows\System32\drivers\CrowdStrike
directory - Delete the file matching
C-00000291*.sys
- Reboot as normal
But time and time again I read reports that the servers concerned couldn’t be booted into Safe Mode because they were encrypted using BitLocker, and that the keys needed to decrypt BitLocker were stored… on another server which had also been taken down by the automatic update!
I mean, seriously: WTF? Doesn’t anyone understand the very basis of Disaster Recovery?
User GeekyOldFart, writing on The Register website, clearly does [my emphasis]:
On my site I opened the (hard copy, in the safe) “oh shit” file to get the relevant local admin password, left my office, walked briskly down the corridor to the onsite server room and got one DC up in safe mode to do the fix. Then I changed that single-use local admin password before heading back to my office and updating the hard copy file with the new password before locking it away again.Meanwhile the rest of my team were making use of the one DC I’d resurrected to get all the other impacted servers up into “safe mode with networking” now that they could talk to a DC, allowing them to login with their domain admin accounts AND access the BitLocker keys and perform the fix.
Once we had the AD infrastructure up and running the desktop support folks went into high gear busily fixing all the impacted workstations and laptops
A similar story played out on all my employers sites worldwide and we had pretty much every server – even the non-critical ones – back online before noon UTC and 99% of workstations and laptops fixed by mid-afternoon. None of which would have happened that fast without that hard copy file. Sometimes the best tech solution is decidedly low-tech 🙂
It’s a relief to see some IT execs get it. Most don’t.
Oh to be a fly on the wall on Monday morning as these folks explain to their various boards why they’d never put the words Disaster and Recovery together in the same sentence before—and planned for it.
It ain’t just the airlines
The media make a big deal about people queuing at airports, but that’s just a fraction of the problem. The depth of this software calamity was brought home to me by a post from a user called nimbus writing on ycombinator.com:
i work for a diesel truck maintenance and repair shop and its been hell on earth this morning.
– our IT wizard says the fixes wont work on lathes/CNC systems. we may need to ship the controllers back to the manufacturer in Wisconsin.
– AC is still not running. sent the apprentice to get fans from the shop floor.
– building security alarms are still blaring, need to get a ladder to clip the horns and sirens on the outside of the building. still cant disarm anything.
– still no phones. IT guy has set up two “emergency” phones…one is a literal rotary phone. stresses we still cannot call 911 or other offices. fire sprinklers will work, but no fire department will respond.
– no email, no accounting, nothing. I am going to the bank after this to pick up cash so i can make payday for 14 shop technicians. was warned the bank likely would either not have enough, or would not be able to process the account (if they open at all today.)
One little workshop in one little corner of the States: no lathes, no air-con, no security, no phones, no email, no accounting, and a forlorn hope that the bank might be open—and have enough cash—to pay the staff.
One little software glitch and that’s what we’ve come to.
Are we going to take this as a wake-up call and a warning? Or are we going to accept the corporate blather that this will never, ever happen again—until it does?
We’ve built a global system on a corporate monoculture that—if judged in car safety terms—is unsafe at any speed.