October 28, 2025

In Search of an Understandable Consensus Algorithm (Extended Version)

Raft is a consensus algorithm developed as an alternative to Paxos, focusing on simplicity and understandability while providing equivalent results and efficiency. It simplifies the consensus problem into leader election, log replication, and safety, ensuring safety and consistency in distributed systems. Raft handles cluster membership changes efficiently and uses a strong leadership approach to simplify the algorithm. It has been shown to be easier to learn and implement compared to Paxos, with open-source implementations used by several companies. The algorithm ensures log entries are safely replicated and committed, handles leader election, and supports safe configuration changes. Raft's approach prioritizes understandability, correctness, and performance, making it a practical foundation for system building.

October 28, 2025

Root cause analysis from AWS

The Amazon DynamoDB service disruption in the Northern Virginia (US-EAST-1) Region on October 19 and 20, 2025, had three distinct periods of impact on customer applications. The disruption was triggered by a latent defect in the automated DNS management system of DynamoDB, causing increased API error rates. The issue was resolved by restoring DNS information. Additionally, the disruption affected Amazon EC2 instance launches, which experienced increased errors and latencies due to failures in the DropletWorkflow Manager system. Recovery involved re-establishing leases with droplets and propagating network configurations. The Network Load Balancer service also experienced connection errors due to health check failures, which were resolved by disabling automatic health check failovers. Other AWS services like Lambda functions, Amazon Elastic Container Service, and AWS Security Token Service were also impacted but recovered by addressing specific issues. AWS is implementing changes to prevent similar events in the future and improve service availability.

Amazon DynamoDB DNS management race condition endpoint resolution Availability Zones time to recovery. ROOT_CAUSE_ANALYSIS

October 22, 2025

Google Pro Tip Use Back Of The Envelope Calculations

The text discusses using back-of-the-envelope calculations to evaluate different design alternatives, specifically focusing on a scenario of generating an image results page with 30 thumbnails. It emphasizes the importance of estimating performance using common numbers and thought experiments, as advocated by Jeff Dean from Google. The text provides examples of serial and parallel design alternatives, highlighting the significance of understanding system performance metrics and making informed design decisions. It concludes by emphasizing the importance of monitoring and measuring system components for accurate projections.

Back-of-the-envelope calculations Best Design Performance estimation Distributed systems System design Measurement

October 16, 2025

Branching in a Sapling Monorepo

Sapling is an open-source source control system used at Meta for managing a large monorepo. The system introduces directory branching as a solution to the challenges of managing multiple versions of code in a monorepo. Directory branching allows for branching at the directory level, enabling cherry-picking and merging changes between directories while maintaining a linear commit graph at the monorepo level. This approach addresses scalability issues associated with full-repo branching and provides a flexible solution for managing code versions. The system has been well-received by engineering teams at Meta, with various use cases identified for adopting directory branching. Future plans include integrating Git repositories into the monorepo using a lightweight migration mechanism.

Meta Open Source scalable user-friendly open-source source control system monorepo CI Git migrations repository migration directory branching implementation branching workflows