/Distributed System

A Distributed Systems Reading List

- Fred Hebert tl;dr: “This document contains various resources and quick definition of a lot of background information behind distributed systems. It is not complete, even though it is kinda sorta detailed. I had written it some time in 2019 when coworkers at the time had asked for a list of references, and I put together what I thought was a decent overview of the basics of distributed systems literature and concepts.”

featured in #487


Optimism vs Pessimism In Distributed Systems

- Marc Brooker tl;dr: Marc discusses the balance between optimistic and pessimistic assumptions. Optimistic assumptions, which anticipate successful outcomes without immediate coordination, contrast with pessimistic ones that proactively coordinate to prevent potential conflicts. The author exemplifies these concepts through distributed caches, Optimistic Concurrency Control, and leases. By identifying and categorizing these assumptions, developers can better understand and optimize system behavior.

featured in #459


Hints for Distributed Systems Design

- Murat Demirbas tl;dr: “I have seen these hints successfully applied in distributed systems design throughout my 25 years in the field, starting from the theory of distributed systems, immersing into the practice of wireless sensor networks, and working on cloud computing systems both in the academia and industry ever since. These heuristic principles have been applied knowingly or unknowingly and has proven useful. I didn't invent any of these hints. These are collective products of distributed systems researchers and practitioners over many decades.”

featured in #455


Clocks And Causality - Ordering Events In Distributed Systems

- Giridhar Manepalli tl;dr: In distributed systems, logical clocks play a key role in the ordering of system events. What are the various logical clock designs, and how do they help with event ordering? This article answers these questions.

featured in #403


Fallacies Of Distributed Systems

tl;dr: "Fallacies of distributed systems are a set of assertions made by L Peter Deutsch and others at Sun Microsystems describing false assumptions that programmers new to distributed applications invariably make."

featured in #371


The Distributed Computing Manifesto

- Werner Vogels tl;dr: "Today, I am publishing the Distributed Computing Manifesto, a canonical document from the early days of Amazon that transformed the architecture of Amazon’s e-commerce platform. It highlights the challenges we were facing at the end of the 20th century, and hints at where we were headed."

featured in #370


Resiliency In Distributed Systems

- Gergely Orosz tl;dr: "Understanding the ins and outs of distributed systems is important for both backend engineers and for anyone working with large-scale systems. Large-scale systems can mean systems with high load and high queries per second (QPS), storing a large amount of data, or ones built with low latency and high reliability. These systems are pretty common across both Big Tech and high-growth startups."

featured in #355


Fallacies Of Distributed Systems

- Mahdi Yusuf tl;dr: (1) The network is reliable, (2) Latency is zero, (3) Bandwidth is infinite, (4) The network is secure, (5) Topology doesn't change, (6) There is one administrator, (7) Transport cost is zero, (8) The network is homogeneous. 

featured in #323


Distributed Systems Shibboleths

- Joseph Lynch tl;dr: "Shibboleths are historically a word that indicate membership in a particular group or culture.... I have only studied and worked in the field for around a decade, but in that time I believe I have learned to recognize some key “distsys shibboleths” that help me recognize when I can trust what a vendor or other engineer is telling me."

featured in #314


Building Robust Distributed Systems

- Kislay Verma tl;dr: "I have written before on this blog about what distributed systems are and how they can give us tremendous scalability at the cost of having to deal with a more complicated system design. Let’s discuss how we can make a distributed system resilient to random failures which get more common as the system gets larger."

featured in #299