5-Minute DevOps: How to Avoid the CrowdStrike Mistake Pt. 2
After CrowdStrike published its preliminary postmortem, I wrote my impressions of the findings and what we might learn to improve our processes and avoid similar problems. Since then, CrowdStrike has released its final report, and my jaw dropped at the things they will start doing that they were not doing before.
To recap, CrowdStrike deploys applications that run as Windows drivers with elevated access to the kernel. Due to Windows’ design, this access is necessary for CrowdStrike’s sensors to detect security threats. Because of that level of access, Microsoft requires the vendors to submit their drivers to Microsoft so they can certify them for kernel access. This certification process takes time; too much time to respond to new threats appropriately. CrowdStrike resolves this by deploying “Rapid Response” configuration changes. Since these are configuration and not application changes, Microsoft does not require them to be certified.
On Friday, July 19, at 04:09 UTC, they deployed a configuration change that caused 8.5 million Windows machines to enter an infinite reboot loop. What can we learn from their changes to quality and engineering practices? A lot.
They defined a contract between the sensor and the configuration files. The contract defined 20 properties, while the new configuration file had 21 properties. Their tests did not validate that contract. People frequently ignore contract testing. If we are producing a contract, we should be testing that we are not making breaking changes. If we are consuming a contract, we should test that we only consume what we need. Contract tests should be table stakes for any interface we have.
They had a fixed-length array with no bounds checking. It’s a really good idea to throw garbage data at your application during testing. I don’t trust any input from anyone, even if I’m creating the input. Combined with the issue above, I wonder if they’ve heard of Postel’s Law, AKA the Robustness Principle.
Be liberal in what you accept and conservative in what you send.
Following this principle, I would look into redesigning how the interface works to be less fragile when receiving unexpected input. We should design our interfaces to ignore things we don’t expect and only create errors if something we need is missing. In their case, it’s even more fundamental. If your array can be overrun, you might want to add some error handling.
“Problematic Template Instance” seems like an understatement. However, based on the following, it sounds like an error in contract versioning and the fundamental CD practice of “production-like test environments.”
On July 19, 2024, two additional IPC Template Instances were deployed. One of these introduced a non-wildcard matching criterion for the 21st input parameter. These new Template Instances resulted in a new version of Channel File 291 that would now require the sensor to inspect the 21st input parameter. Until this channel file was delivered to sensors, no IPC Template Instances in previous channel versions had made use of the 21st input parameter field. The Content Validator evaluated the new Template Instances, but based its assessment on the expectation that the IPC Template Type would be provided with 21 inputs
If I am parsing this correctly, they created the next version of the sensor based on a new contract definition and the new configuration file matched that new contract. Their testing was done with the new contract, and because the deployed sensor was not built as a liberal consumer, the version mismatch made the production sensors fail. This is a great reminder to ensure our test environments match production configuration(s) as closely as possible. Then, we should test our tests by deploying them to production canaries. Test in production… carefully.
Start doing staged deployments? Yeah, that’s probably a good idea. They deliver to millions of machines without visibility into how they are configured. Limiting the blast radius of potential problems (because we will all break something sometime) seems like an obvious choice. In my previous article, I wrote about understanding the deployment risk and designing your delivery process for that risk. It’s astounding that blast radius didn’t seem to be a concern for this kind of delivery scenario. On the plus side, they will NOW start using canary deploys, hopefully starting with their internal servers.
Someone with more Windows Server background needs to help me with this one. Do Windows servers just wait for vendors to install things at their whim? Is there no way to control that? If so, that’s kinda terrifying (because I have data ingestion trust issues). If not then companies like Delta that were impacted have a process problem. We shouldn’t be updating critical infrastructure in the middle of the night. Since this update was deployed at 00:09 Eastern time, I doubt that Delta’s Atlanta office was staffed for an “oops.”
While we are on the topic of Delta…
Delta to seek “at least $500 million” from CrowdStrike and Microsoft
Delta reports that it was impacted for at least five days and is reporting a $550 million cost to regulators. However, what’s their disaster recovery plan? This isn’t a one-off issue for Delta.
- January 2023: Delta suffered a major outage that affected its entire network. The incident caused widespread delays and cancellations, disrupting flights globally. The root cause was identified as a failure in Delta’s internal data center operations, which led to cascading issues across various systems, including flight scheduling and customer service platforms.
- August 2019: Another major outage occurred due to a failure in Delta’s automated check-in and booking systems. This disruption led to delays and cancellations, affecting flights at major hubs like Atlanta and New York. The issue was traced back to a malfunction in Delta’s server network, which caused significant downtime for the airline’s digital services, including mobile apps and websites.
There appear to be gaps in the DR capabilities. CrowdStrike's filing for discovery suggests they are also curious about that, too.
The reality of software is that it’s “use at your own risk,” and CrowdStrike’s Terms and Conditions are explicit about that. They even made it bold type to make sure you didn’t miss it.
THERE IS NO WARRANTY THAT THE OFFERINGS OR CROWDSTRIKE TOOLS WILL BE ERROR FREE, OR THAT THEY WILL OPERATE WITHOUT INTERRUPTION OR WILL FULFILL ANY OF CUSTOMER’S PARTICULAR PURPOSES OR NEEDS. THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION.
Companies like Delta share responsibility for their impact. It appears Delta had insufficient planning or practice for failures. It’s up to all of us to learn from that and to ensure our high-availability systems are actually high-availability. If a third-party vendor can take us down, that’s on us. What can we do to validate our resilience to upstream disasters?
CrowdStrike’s error cost its customers billions of dollars. The apparent misses in CrowdStrike's architecture, test design, and delivery process should give their customers pause but should also act as a reminder and warning to the rest of us to examine our own processes and make sure our houses are in order. It’s far too common for people to ignore risk and only plan for sunshine… until a tornado hits. What are you doing to keep the storms away?