How a Misconfigured AWS Egress Firewall Caused Databricks BOOTSTRAP_TIMEOUT Errors
A Databricks cluster deployed on AWS inside a customer-managed VPC repeatedly failed to start, producing a BOOTSTRAP_TIMEOUT error after roughly 25 minutes despite all EC2 nodes passing health checks. The cluster was routed through a multi-hop egress path involving a Transit Gateway, an inspection firewall, and a NAT gateway before reaching the internet. The root cause was that the cluster nodes, which had no public IPs under secure cluster connectivity, could not establish outbound communication to the Databricks control plane's relay service. Unlike AWS-native services such as S3 or STS, the Databricks control plane and its secure cluster connectivity relay have no AWS VPC endpoint, meaning egress must be explicitly permitted through the firewall or routed via AWS PrivateLink. The investigation highlighted that a healthy EC2 instance combined with a cluster stuck in INSTANCE_INITIALIZING is a reliable signal of a broken outbound network path rather than an IAM or capacity issue.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)
Log in to join the discussion and vote.
Log in