Essential Reading For Engineering Leaders

From Big Data To Better Data: Ensuring Data Quality With Verity

- Michael McPhillips

Scale
Data

tl;dr: Michael emphasizes that "data quality is paramount for accurate insights," highlighting the challenge of ensuring data reliability. Michael introduces Lyft’s in-house data quality platform, Verity, which has an exhaustive flow that starts with the following steps: (1) Data Profiling: Incoming data is scrutinized for its structure, schema, and content. This allows it to identify potential anomalies and inconsistencies. (2) Customizable Rules Engine: Enables data experts to define specific data quality rules tailored to their unique needs. These rules encompass everything from data format validations to more intricate domain-specific checks. (3) Automated Quality Checks: Once the rules are set, they are applied to incoming data streams, scanning each data point, seeking discrepancies.

featured in #457

How Pinterest Scaled To 11 million Users With Only 6 Engineers

Scale
Architecture

tl;dr: In 2012, Pinterest reached 11.7 million monthly users with just six engineers. The article chronicles Pinterest's journey from its launch in 2010 with a single engineer to its rapid growth. Key lessons include using proven technologies, keeping architecture simple, and avoiding over-complication. Pinterest faced challenges like data corruption due to clustering and had to pivot to more reliable technologies like MySQL and Memcached. By January 2012, they simplified their stack, removing less-proven technologies and focusing on manual database sharding for scalability.

featured in #454

Storage Challenges In The Evolution Of Database Architecture

- Sujay Venaik

tl;dr: “Sync service has been running since 2014, and we started facing issues related to physical storage on the database layer. For context, sync service runs on an AWS RDS Aurora cluster that has a single primary writer node and 3-4 readers, all of which are r6g.8xlarge. AWS RDS has a physical storage size limit of 128TiB for each RDS cluster… We were hovering around ~95TB, and our rate of ingestion was ~2TB per month. At this rate, we realized we would see ingestion issues in another 6-8 months.” The team devised a three-pronged strategy: eliminating unused tables, revising their append-only tables approach, and methodically freeing up space from sizable tables. This strategy successfully reclaimed about 60TB of space.

featured in #452

How Instagram Scaled To 14 million Users With Only 3 Engineers

- Leonardo Creed

tl;dr: Instagram scaled from 0 to 14 million users within a year (October 2010 to December 2011) with three engineers. The success was attributed to three guiding principles: simplicity, not reinventing the wheel and using proven technologies. The article provides a detailed walkthrough of the tech stack. Instagram relied on AWS, using EC2 and Ubuntu Linux, with the frontend developed in Objective-C. They utilized Amazon’s Elastic Load Balancer, Django for the backend, PostgreSQL for data storage, and Amazon S3 for photo storage, caching using Redis and Memcached.

featured in #449

Building A ShopifyQL Code Editor

- Trevor Harmon

Scale
DeepDive

tl;dr: “This approach enabled us to provide ShopifyQL features to CodeMirror while continuing to maintain a grammar that serves both client and server. The custom adapter we created allows us to pass a ShopifyQL query to the language server, adapt the response, and return a Lezer parse tree to CodeMirror, making it possible to provide features like syntax highlighting, code completion, linting, and tooltips. Because our solution utilizes CodeMirror’s internal parse tree, we are able to make better decisions in the code and craft a stronger editing experience. The ShopifyQL code editor helps merchants write ShopifyQL and get access to their data in new and delightful ways.”

featured in #448

Keeping Figma Fast

- Slava Kim Laurel Woods

tl;dr: Figma's journey in evolving its performance testing system as the company scaled. Initially, Figma used a single MacBook for all its in-house performance testing. However, as the codebase grew more complex and the team expanded, this approach became unsustainable. The article outlines the challenges Figma faced, such as the need for more granular performance tests and the limitations of running tests on a single piece of hardware. To address these issues, Figma adopted a two-system approach: a cloud-based system for mass testing and a hardware system for more targeted, precise tests. Both systems are connected by the same Continuous Integration system and aim to catch performance regressions early in the development cycle.

featured in #444

8 Reasons Why WhatsApp Was Able To Support 50 Billion Messages A Day With Only 32 Engineers

Scale
Management

tl;dr: (1) Single responsibility principle. (2) Tech stack. Erlang provides scale with a tiny footprint. (3) Leveraged robust open source and third party libraries. (4) A huge emphasis was given to cross-cutting concerns to improve quality. (5) Diagonal scaling to keep the costs and operational complexity low. (6) Critical aspects were measured so bottlenecks were identified and eliminated quickly. (7) Load testing was performed to identify single points of failure. (8) Communication paths between engineers were kept short.

featured in #443

Bottleneck: Resilience And Observability

- Punit Lad Carl Nygard

tl;dr: The authors delve into the intricacies of resilience and observability in the context of rapidly scaling systems. As systems expand, their complexity can lead to potential failures. Resilience isn't about averting these failures but adeptly managing them. Observability is pivotal for comprehending system behavior, with its three foundational pillars: Metrics, Logs, and Traces. The authors also highlight challenges posed by the vast data volume in observability and the role of automation.

featured in #442

The Perils Of Migrating A Large-Scale Service At Uber

Migration
Scale

tl;dr: Details of Uber's journey in migrating its invoice generation service, highlighting challenges and lessons learned. The initial service was written in Python and faced scalability issues due to early design choices, accumulated technical debt and a legacy software stack. The new service was developed in Go, chosen for its speed and flexibility. The migration strategy adopted was component-based, focusing on individual system components rather than entire flows. The migration led to a 97% reduction in computing requirements and enhanced self-serve capabilities, reducing engineers' support work from 60% to under 20%.

featured in #442

Optimizing Speed On eBay.com

- Addy Osmani

Performance
Scale

tl;dr: Optimizations include: (1) Search Results Optimization: By sending the first 10 item images along with the header, eBay ensures quicker downloads, reducing the download start time for search result images. (2) Edge Caching for autosuggestion data: suggestions in the search box are cached and served from a CDN, reducing network latency and server processing time. (3) Edge caching for unrecognized homepage users: Content for unrecognized users is cached on eBay's edge network, allowing first-time users to receive content from a nearby server, reducing network latency and server processing time.

featured in #439

/Scale