/Scale

Bottleneck: Resilience And Observability

- Punit Lad Carl Nygard tl;dr: The authors delve into the intricacies of resilience and observability in the context of rapidly scaling systems. As systems expand, their complexity can lead to potential failures. Resilience isn't about averting these failures but adeptly managing them. Observability is pivotal for comprehending system behavior, with its three foundational pillars: Metrics, Logs, and Traces. The authors also highlight challenges posed by the vast data volume in observability and the role of automation.

featured in #442


The Perils Of Migrating A Large-Scale Service At Uber

tl;dr: Details of Uber's journey in migrating its invoice generation service, highlighting challenges and lessons learned. The initial service was written in Python and faced scalability issues due to early design choices, accumulated technical debt and a legacy software stack. The new service was developed in Go, chosen for its speed and flexibility. The migration strategy adopted was component-based, focusing on individual system components rather than entire flows. The migration led to a 97% reduction in computing requirements and enhanced self-serve capabilities, reducing engineers' support work from 60% to under 20%.

featured in #442


Optimizing Speed On eBay.com

- Addy Osmani tl;dr: Optimizations include: (1) Search Results Optimization: By sending the first 10 item images along with the header, eBay ensures quicker downloads, reducing the download start time for search result images. (2) Edge Caching for autosuggestion data: suggestions in the search box are cached and served from a CDN, reducing network latency and server processing time. (3) Edge caching for unrecognized homepage users: Content for unrecognized users is cached on eBay's edge network, allowing first-time users to receive content from a nearby server, reducing network latency and server processing time.

featured in #439


In Defense Of Simple Architectures

- Dan Luu tl;dr: Dan discusses the effectiveness of simple architectures in software development, using Wave, a $1.7B company, as an example. Wave's architecture is a Python monolith on top of Postgres, allowing engineers to focus on delivering value to users. The article emphasizes that simple architectures can be created more cheaply and easily than complex ones, even for high-traffic apps. Despite the trend towards complex, microservice-based architectures, Dan argues for the "unreasonable effectiveness" of monoliths, detailing Wave's choices, mistakes, and areas of unavoidable complexity. Simplicity in architecture can lead to success, allowing companies to allocate complexity where it benefits the business.

featured in #439


How We Built The Canva Apps SDK

- Martin Cronjé tl;dr: Martin’s article outlines the development of the Canva Apps SDK, transitioning from a plugin model to a more flexible app-building platform. The process involved building a secure sandboxed environment, creating a new build-and-deploy pipeline, and designing APIs with a focus on simplicity, safety, evolvability, and consistency. Iterative development, continuous feedback, and a balance between alignment and empowerment were key technical strategies in the SDK's creation.

featured in #437


Building And Operating A Pretty Big Storage System Called S3

- Werner Vogels tl;dr: A repost of an article by Andy Warfield, VP of S3, reflects on the vast complexity and operational scale of Amazon's storage software system. Andy discusses the significance of recognizing and mitigating organizational scaling issues, similar to optimizing systems. He also discusses management’s approach to foster team ownership for problem-solving instead of dispensing solutions has led to more engaged and successful engineering outcomes.

featured in #436


Measuring Performance For iOS Apps At Uber Scale

tl;dr: This article discusses how Uber measures performance metrics, specifically focusing on app startup performance on iOS. The article mentions that Uber monitors various critical metrics such as UI flow latency, memory usage, bandwidth, and UI jank. App launch times are highlighted as a crucial industry-standard metric that directly impacts the customer experience.

featured in #431


Meta Developer Tools: Working At Scale

- Neil Mitchell tl;dr: “Every day, thousands of developers at Meta are working in repositories with millions of files. Those developers need tools that help them at every stage of the workflow while working at extreme scale. In this article we’ll go through a few of the tools in the development process. And, as an added bonus, those we talk about below are open source so you can try them yourself.”

featured in #430


Upscaling LinkedIn's Profile Datastore While Reducing Costs

- Estella Pham Guanlin Lu tl;dr: LinkedIn introduced Couchbase as a centralized storage tier cache to address scaling concerns. Challenges arose due to the cache not being backed by primary storage. The blog post discusses the decision, challenges faced, and solutions employed to achieve high cache hit rate, reduced latencies, and cost savings.  

featured in #428


Migrating Netflix To GraphQL Safely

tl;dr: “Doing this safely for 100s of millions of customers without disruption is exceptionally challenging, especially considering the many dimensions of change involved. This blog post will share broadly-applicable techniques (beyond GraphQL) we used to perform this migration. The three strategies we will discuss today are AB Testing, Replay Testing, and Sticky Canaries.”

featured in #423