5-Minute DevOps: The Knight’s Capital “CD Failure”

Bryan Finster

--

It’s common when discussing continuous delivery for CD naysayers to play “gotcha” with examples of when they think CD failed in mission-critical situations. They assert that if people depended less on automation and spent more time on human validation, these issues wouldn’t occur. One of their favorite examples is the 2012 Knight’s Capital incident.

In 2012, Knight’s Capital made a software change. Forty-five minutes later, they were $440 million in the hole. This has been written about ad nauseam, and I’ve no intention of repeating everything here. However, since the people using it as an example can’t be bothered to inform themselves before establishing strong opinions, I’ve created this tl;dr to help educate them.

The Fails

  • They had no automated regression testing.
  • They chose to repurpose an existing flag to control the production release. That existing flag previously controlled code that was never intended to run in production. It was designed to create test data to validate other trading programs in pre-production environments.
  • Their deployment process was manual.
  • They had no tested backout plan.
  • The development team was put under delivery pressure when given only 30 days by the CEO to design and deploy changes to integrate with a new NYSE electronic trading system.
  • They had eight servers that needed to be updated before turning on the flag; they only deployed to seven of them.
  • When they activated the new code, the code in the old version on the eighth server was also turned on, which was meant only for generating test data. It bought high and sold low very quickly—usually a bad trading strategy.
  • When they noticed the excessive trading volume, their response was to revert to the previous version, but they left the flag on. Now, all eight servers were buying high and selling low.

No, it wasn’t because feature flags are dangerous. That’s as silly as saying, “Programming is dangerous.”

No, it wasn’t because they relied on too much automation. On the contrary, they are an example of why you want to standardize things with automation and minimize human touchpoints.

The correct takeaway is that if you don’t focus on engineering excellence daily, you won’t have it when it really counts. Use your daily work to improve how you do your daily work.

References

--

--

Responses (1)