As I read it, the interesting part isn’t around the unwrap operation. The interesting part is that a new version was being pushed out incrementally, and the infrastructure components that read from the new version were failing consistently. But the awareness of the failure did not propagate back into the component doing the push.
The point of an incremental rollout is to allow a fraction of traffic to be affected by the new version, as a test of whether the new version is actually good. In order for that to work, failures caused by the new version have to be detected; and that awareness has to propagate back to the system doing the rollout, and cause the rollout to stop.
Or you may be doing a rollout of a new abcserver, and the monitoring system tracks that the abcserver itself is successfully starting and taking queries … but not noticing that the new abcserver’s responses are causing the jklserver (somewhere else in your infra) to crash.
One possible rule is, “If it looks like there’s a global sitewide outage happening, look up all the currently in-progress and recently-completed rollouts. Send an alert to all of those teams. Everyone who’s responsible for a recent rollout gets to ponder deep in their hearts whether there’s any possibility that their rollout could be breaking the world, and decide whether to roll back.”
Another possible rule is, “You can afford to have an agent (currently, probably a human) watching over each rollout; an agent who cares about whether the whole site is crashing hard, and who gets curious about whether their rollout might be responsible.”
As I read it, the interesting part isn’t around the unwrap operation. The interesting part is that a new version was being pushed out incrementally, and the infrastructure components that read from the new version were failing consistently. But the awareness of the failure did not propagate back into the component doing the push.
The point of an incremental rollout is to allow a fraction of traffic to be affected by the new version, as a test of whether the new version is actually good. In order for that to work, failures caused by the new version have to be detected; and that awareness has to propagate back to the system doing the rollout, and cause the rollout to stop.
Yeah, doing an incremental rollout doesn’t save you if you’re not monitoring it.
Or you may be doing a rollout of a new abcserver, and the monitoring system tracks that the abcserver itself is successfully starting and taking queries … but not noticing that the new abcserver’s responses are causing the jklserver (somewhere else in your infra) to crash.
One possible rule is, “If it looks like there’s a global sitewide outage happening, look up all the currently in-progress and recently-completed rollouts. Send an alert to all of those teams. Everyone who’s responsible for a recent rollout gets to ponder deep in their hearts whether there’s any possibility that their rollout could be breaking the world, and decide whether to roll back.”
Another possible rule is, “You can afford to have an agent (currently, probably a human) watching over each rollout; an agent who cares about whether the whole site is crashing hard, and who gets curious about whether their rollout might be responsible.”