/Scale

How DoorDash Standardized And Improved Microservices Caching

- Jason Fan Lev Neiman tl;dr: DoorDash's expanding microservices architecture led to challenges in interservice traffic and caching. The article details how DoorDash addressed these challenges by developing a library to standardize caching, enhancing performance without altering existing business logic. Key features include layered caches, runtime feature flag control, and observability with cache shadowing. The authors also provides guidance on when to use caching.

featured in #459


How GitHub Indexes Code For Blazing Fast Search & Retrieval

- Shivang Sarawagi tl;dr: “The search engine supports global queries across 200 million repos and indexes code changes in repositories within minutes. The code search index is by far the largest cluster that GitHub runs, comprising 5184 vCPUs, 40TB of RAM, and 1.25PB of backing storage, supporting a query load of 200 requests per second on average and indexing over 53 billion source files.”

featured in #458


Executing Cron Scripts Reliably At Scale

- Claire Adams tl;dr: Claire discusses the challenges of managing and executing cron scripts in a reliable manner within large-scale infrastructure. “The Job Queue is an asynchronous compute platform that runs about 9 billion “jobs” or pieces of work per day.“ Claire provides insights into techniques such as distributed execution, retries, and monitoring to ensure the dependable execution of cron jobs at scale, highlighting the need for a systematic approach to handle failures effectively.

featured in #457


From Big Data To Better Data: Ensuring Data Quality With Verity

- Michael McPhillips tl;dr: Michael emphasizes that "data quality is paramount for accurate insights," highlighting the challenge of ensuring data reliability. Michael introduces Lyft’s in-house data quality platform, Verity, which has an exhaustive flow that starts with the following steps: (1) Data Profiling: Incoming data is scrutinized for its structure, schema, and content. This allows it to identify potential anomalies and inconsistencies. (2) Customizable Rules Engine: Enables data experts to define specific data quality rules tailored to their unique needs. These rules encompass everything from data format validations to more intricate domain-specific checks. (3) Automated Quality Checks: Once the rules are set, they are applied to incoming data streams, scanning each data point, seeking discrepancies.

featured in #457


How Pinterest Scaled To 11 million Users With Only 6 Engineers

tl;dr: In 2012, Pinterest reached 11.7 million monthly users with just six engineers. The article chronicles Pinterest's journey from its launch in 2010 with a single engineer to its rapid growth. Key lessons include using proven technologies, keeping architecture simple, and avoiding over-complication. Pinterest faced challenges like data corruption due to clustering and had to pivot to more reliable technologies like MySQL and Memcached. By January 2012, they simplified their stack, removing less-proven technologies and focusing on manual database sharding for scalability.

featured in #454


Storage Challenges In The Evolution Of Database Architecture

- Sujay Venaik tl;dr: “Sync service has been running since 2014, and we started facing issues related to physical storage on the database layer. For context, sync service runs on an AWS RDS Aurora cluster that has a single primary writer node and 3-4 readers, all of which are r6g.8xlarge. AWS RDS has a physical storage size limit of 128TiB for each RDS cluster… We were hovering around ~95TB, and our rate of ingestion was ~2TB per month. At this rate, we realized we would see ingestion issues in another 6-8 months.” The team devised a three-pronged strategy: eliminating unused tables, revising their append-only tables approach, and methodically freeing up space from sizable tables. This strategy successfully reclaimed about 60TB of space.

featured in #452


How Instagram Scaled To 14 million Users With Only 3 Engineers

- Leonardo Creed tl;dr: Instagram scaled from 0 to 14 million users within a year (October 2010 to December 2011) with three engineers. The success was attributed to three guiding principles: simplicity, not reinventing the wheel and using proven technologies. The article provides a detailed walkthrough of the tech stack. Instagram relied on AWS, using EC2 and Ubuntu Linux, with the frontend developed in Objective-C. They utilized Amazon’s Elastic Load Balancer, Django for the backend, PostgreSQL for data storage, and Amazon S3 for photo storage, caching using Redis and Memcached.

featured in #449


Building A ShopifyQL Code Editor

- Trevor Harmon tl;dr: “This approach enabled us to provide ShopifyQL features to CodeMirror while continuing to maintain a grammar that serves both client and server. The custom adapter we created allows us to pass a ShopifyQL query to the language server, adapt the response, and return a Lezer parse tree to CodeMirror, making it possible to provide features like syntax highlighting, code completion, linting, and tooltips. Because our solution utilizes CodeMirror’s internal parse tree, we are able to make better decisions in the code and craft a stronger editing experience. The ShopifyQL code editor helps merchants write ShopifyQL and get access to their data in new and delightful ways.”

featured in #448


Keeping Figma Fast

- Slava Kim Laurel Woods tl;dr: Figma's journey in evolving its performance testing system as the company scaled. Initially, Figma used a single MacBook for all its in-house performance testing. However, as the codebase grew more complex and the team expanded, this approach became unsustainable. The article outlines the challenges Figma faced, such as the need for more granular performance tests and the limitations of running tests on a single piece of hardware. To address these issues, Figma adopted a two-system approach: a cloud-based system for mass testing and a hardware system for more targeted, precise tests. Both systems are connected by the same Continuous Integration system and aim to catch performance regressions early in the development cycle.

featured in #444


8 Reasons Why WhatsApp Was Able To Support 50 Billion Messages A Day With Only 32 Engineers

tl;dr: (1) Single responsibility principle. (2) Tech stack. Erlang provides scale with a tiny footprint. (3) Leveraged robust open source and third party libraries. (4) A huge emphasis was given to cross-cutting concerns to improve quality. (5) Diagonal scaling to keep the costs and operational complexity low. (6) Critical aspects were measured so bottlenecks were identified and eliminated quickly. (7) Load testing was performed to identify single points of failure. (8) Communication paths between engineers were kept short.

featured in #443