Kubernetes Controllers Can Silently Fail on Stale Tokens While Appearing Healthy
A silent failure mode in Kubernetes controllers built on client-go causes them to retry 401 Unauthorized errors indefinitely instead of restarting to fetch a fresh ServiceAccount token. The issue surfaced during a Helm upgrade via ArgoCD on the CNCF Sandbox KAI-Scheduler, where a deleted and recreated Config CR invalidated projected ServiceAccount tokens, leaving the scheduler stuck in a retry loop. Throughout the incident, the pod continued to show Running and Ready status, and ArgoCD reported the application as Synced and Healthy, masking the problem entirely. A fix was proposed by wrapping the HTTP transport layer in rest.Config to call os.Exit on the first 401 response, prompting kubelet to restart the pod and mount a valid token. While controller-runtime recently added opt-in support for custom watch error handlers, no Kubernetes controller addresses 401 errors by default, making this a widespread gap worth auditing.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in