SShortSingh.
Back to feed

Auth service silently split-brained in production after multicast discovery failed on Kubernetes

0
·1 views

An engineering team discovered their three-replica auth server had been running in a split-brain state since day one, with each node incorrectly believing it was the sole cluster member. The root cause was a peer-discovery routine that silently disabled itself when multicast — unsupported on managed Kubernetes networks — failed to find peers. As a result, singleton background jobs such as data-retention sweeps and customer webhook dispatches ran three times simultaneously, once per node, without triggering any errors. The team resolved the issue by scrapping the gossip-based clustering protocol entirely and replacing it with a blob lease stored in a strongly-consistent cloud storage service. The new design eliminated peer discovery and quorum logic, making cluster leadership a directly readable value rather than something inferred from protocol behavior.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

Log in to join the discussion and vote.

Log in

Related stories

0
ProgrammingDEV Community ·

Developer builds access-aware WordPress search modal that hides gated content from guests

A developer has shared a detailed walkthrough of building a live search feature for a fitness membership site running WordPress, WooCommerce, LearnDash, and WishList Member. The key challenge was preventing default WordPress search from exposing titles and excerpts of member-only content to logged-out visitors. The solution uses a single custom REST API endpoint with access-aware filtering enforced at query time, ensuring gated content never appears in results for unauthorized users. The UI was implemented as an icon-triggered full-screen modal with debounced live results grouped by content type, chosen to avoid cluttering an already dense navigation bar. The backend integrates Relevanssi for relevance-ranked search while gracefully falling back to core WordPress search if the plugin is unavailable, following a "degrade, don't die" reliability principle.

0
ProgrammingDEV Community ·

How to Build a Full Log Monitoring Stack with Grafana, Loki, Promtail and Prometheus

A technical guide outlines how to set up a complete observability stack using four open-source tools: Loki for log storage, Promtail for log collection, Prometheus for metrics scraping, and Grafana for visualization. The entire stack is orchestrated via Docker Compose, with each service defined in a single configuration file and accessible locally through dedicated ports. Promtail is configured to collect logs from WildFly application server directories and forward them to Loki, while Prometheus scrapes metrics from a Spring Boot actuator endpoint at 15-second intervals. Grafana dashboards can then query both data sources to display real-time client status, filter logs for exceptions or specific keywords, and trigger alerts when services go offline for more than five minutes. The guide recommends a minimum of 2–4 vCPUs, 4 GB of RAM, and 10 GB of disk space, and advises using client labels to keep logs and metrics organized across environments.

0
ProgrammingHacker News ·

Best Practices for Avoiding Fallback Failures in Distributed Systems

A technical article published on AWS Builder explores strategies for avoiding fallback mechanisms in distributed systems. The piece addresses how fallback patterns, while intended as safety nets, can introduce cascading failures and unexpected behavior. The article outlines design principles aimed at building more resilient distributed architectures without relying on fallback logic. It has garnered modest engagement on Hacker News, accumulating 5 points and 2 comments since its posting.

0
ProgrammingDEV Community ·

HiFX Builds DBSteward to Solve Per-Database Cost Allocation on Shared Cloud Instances

HiFX developed an open-source tool called DBSteward to address a common cloud billing problem: AWS RDS and most managed database services bill at the instance level, making it impossible to attribute costs to individual databases sharing that instance. This creates friction for finance teams trying to implement chargebacks, obscures noisy-neighbor performance issues, and leaves SaaS providers unable to calculate accurate per-tenant margins. The core challenge is that no single resource metric — CPU, storage, or I/O — fairly represents usage across all database types, and system overhead further complicates any simple cost-split formula. DBSteward sidesteps the billing system entirely, instead collecting granular metrics from within the database engine to build a defensible, weighted cost allocation model. The tool is designed to handle the technical nuances of counter versus gauge metrics and ensures tracked databases are not overcharged for capacity consumed by system-level processes.

Auth service silently split-brained in production after multicast discovery failed on Kubernetes · ShortSingh