Or you may be doing a rollout of a new abcserver, and the monitoring system tracks that the abcserver itself is successfully starting and taking queries … but not noticing that the new abcserver’s responses are causing the jklserver (somewhere else in your infra) to crash.
One possible rule is, “If it looks like there’s a global sitewide outage happening, look up all the currently in-progress and recently-completed rollouts. Send an alert to all of those teams. Everyone who’s responsible for a recent rollout gets to ponder deep in their hearts whether there’s any possibility that their rollout could be breaking the world, and decide whether to roll back.”
Another possible rule is, “You can afford to have an agent (currently, probably a human) watching over each rollout; an agent who cares about whether the whole site is crashing hard, and who gets curious about whether their rollout might be responsible.”
Or you may be doing a rollout of a new abcserver, and the monitoring system tracks that the abcserver itself is successfully starting and taking queries … but not noticing that the new abcserver’s responses are causing the jklserver (somewhere else in your infra) to crash.
One possible rule is, “If it looks like there’s a global sitewide outage happening, look up all the currently in-progress and recently-completed rollouts. Send an alert to all of those teams. Everyone who’s responsible for a recent rollout gets to ponder deep in their hearts whether there’s any possibility that their rollout could be breaking the world, and decide whether to roll back.”
Another possible rule is, “You can afford to have an agent (currently, probably a human) watching over each rollout; an agent who cares about whether the whole site is crashing hard, and who gets curious about whether their rollout might be responsible.”