Auth service silently split-brained in production after multicast discovery failed on Kubernetes
An engineering team discovered their three-replica auth server had been running in a split-brain state since day one, with each node incorrectly believing it was the sole cluster member. The root cause was a peer-discovery routine that silently disabled itself when multicast — unsupported on managed Kubernetes networks — failed to find peers. As a result, singleton background jobs such as data-retention sweeps and customer webhook dispatches ran three times simultaneously, once per node, without triggering any errors. The team resolved the issue by scrapping the gossip-based clustering protocol entirely and replacing it with a blob lease stored in a strongly-consistent cloud storage service. The new design eliminated peer discovery and quorum logic, making cluster leadership a directly readable value rather than something inferred from protocol behavior.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in