Databricks BOOTSTRAP_TIMEOUT Fixed via AWS PrivateLink Without Adding New Subnets
A team running Databricks on AWS traced persistent BOOTSTRAP_TIMEOUT errors to a centralized egress firewall silently dropping traffic from cluster nodes trying to reach the Databricks control plane and SCC relay. Rather than simply adding a firewall rule, the team chose to route control-plane traffic entirely off the public internet using AWS PrivateLink. The fix required two interface VPC endpoints — one for the workspace REST API and one for the SCC relay — both of which are necessary to allow clusters to start successfully. Because the existing VPC was a fully utilized /24 with no room for new subnets, the team placed the endpoint ENIs directly into existing cluster subnets, consuming only a few private IPs per availability zone. This approach eliminated the bootstrap failure without requiring new subnets, new CIDR blocks, or any routing changes.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in