Garen: Reliable Cluster Management with Atomic State Reconciliation
This paper was published by FriendliAI at EuroSys 2026.
Abstract:
Modern cluster managers orchestrate large-scale services and resources through a set of controllers, each managing a specific part of the cluster by iteratively reconciling the cluster states into the desired states. However, controllers are prone to various state inconsistencies stemming from asynchrony, concurrency, and failures, posing significant challenges in reliable cluster operation. Our analysis of 51 consistency bugs in Kubernetes controllers reveals that many issues remain unresolved, and even proposed fixes are often rejected for backward incompatibility or performance loss.
We present Garen, a system that implements atomic state reconciliation (ASR), a technique that ensures atomicity and consistency of reconciliation to protect the cluster against state inconsistencies. To achieve full-fledged ASR, Garen addresses several challenges. First, to ensure high scalability, Garen detects conflicts within the minimal states involved in the reconciliation process. To pinpoint this dynamically changing set of states, Garen decomposes the reconciliation logic into smaller, conditionally executed blocks. It then confines conflict detection to the states relevant to the involved blocks in a given reconciliation. Moreover, Garen ensures that all state transitions within ASR comply with dynamic cluster constraints by leveraging the dry-run feature to verify state transitions in advance, allowing for atomic execution. For stateful checks that can introduce races, Garen proposes an ASR-based execution model to prevent such races. Lastly, Garen transpiler automatically instruments existing controllers to use Garen, enhancing usability. Through real-world case studies, Garen resolves 18 previously unresolved consistency bugs without manual code changes. It also maintains scalability across various API request rates and cluster sizes, with a latency overhead under 3%.