/Architecture

uVitals – An Anomaly Detection & Alerting System

- Venki Appiah Komal Raulkar tl;dr: "Every day, millions of people rely on Uber to move from place to place and have food and groceries delivered. Uber depends on the reliability of its internal systems and the accuracy of data to power its platform. A glitch in its systems can result in a poor user experience and/or a loss in revenue. Major system issues that affect the reliability of our services are detected and mitigated quickly. However, there are several minor issues that take a longer time to detect and mitigate. Such minor issues can collectively result in poor user experiences and revenue loss over time. This is where uVitals comes in, as it surfaces these issues and anomalies when they begin to occur."

featured in #476


How Meta Built The Infrastructure For Threads

tl;dr: The article give examples of two existing components that played an important architectural role in building Threads: (1) ZippyDB, a distributed key/value datastore that provides scalability and flexibility across data centers. (2) Async, an asynchronous serverless function platform that processes trillions of function calls daily across over 100,000 servers. Async defers computing to off-peak hours, reducing time from solution conception to production deployment by handling deployment complexities i.e. queueing, scheduling, scaling, and disaster recovery. This allowed developers to focus on business logic. 

featured in #475


The Architecture Of Serverless Data Systems

- Jack Vanlightly tl;dr: Jack delves into the evolving landscape of serverless, multi-tenant data architectures. He highlights the increasing prevalence of these systems, such as Google’s BigQuery and Amazon’s DynamoDB, and their diverse implementations across various workloads. The article discusses common patterns like disaggregated architectures, where storage and compute are separated, and the challenges of balancing resource sharing with tenant isolation. Jack also explores the nuances of managing heat (load balancing) and achieving high resource utilization in these systems, emphasizing the importance of efficient hardware use while maintaining solid performance and isolation.

featured in #469


How We Organise Our Very Large Python Monolith

tl;dr: "I work on Kraken: a Python application which has, at last count, 27,637 modules. Yes, you read that right: nearly 28k separate Python files - not including tests. I do this along with 400 other developers worldwide, constantly merging in code. And all anyone needs to make a change - and kick start a deployment of the software that runs 17 different energy and utility companies, with many millions of customers - is one single approval from a colleague on Github." The author shares how this is an effective way of working and the monolith’s structure. 

featured in #466


The Architecture Behind A One-Person Tech Startup

- Anthony Simon tl;dr: "This is a long-form post breaking down the setup I use to run a SaaS. From load balancing to cron job monitoring to payments and subscriptions. There's a lot of ground to cover, so buckle up. As grandiose as the title of this article might sound, I should clarify we’re talking about a low-stress, one-person company that I run from my flat here in Germany. It's fully self-funded, and I like to take things slow. It's probably not what most people imagine when I say "tech startup"."

featured in #465


Building Modern Web Applications: 5 Essential Frontend Architecture Principles

- Patrick Roos tl;dr: Principles you should always follow: (1) Async or defer load, consider the critical path. (2) Tree-shake, consciously bundle and eliminate dead code. (3) Define and respect a performance budget. Principles you should follow when possible: (4) Stick to the web platform APIs and web standards. (5) Use new generation frontend frameworks. Patrick discusses each in this post. 

featured in #464


How Pinterest Scaled To 11 million Users With Only 6 Engineers

tl;dr: In 2012, Pinterest reached 11.7 million monthly users with just six engineers. The article chronicles Pinterest's journey from its launch in 2010 with a single engineer to its rapid growth. Key lessons include using proven technologies, keeping architecture simple, and avoiding over-complication. Pinterest faced challenges like data corruption due to clustering and had to pivot to more reliable technologies like MySQL and Memcached. By January 2012, they simplified their stack, removing less-proven technologies and focusing on manual database sharding for scalability.

featured in #454


Storage Challenges In The Evolution Of Database Architecture

- Sujay Venaik tl;dr: “Sync service has been running since 2014, and we started facing issues related to physical storage on the database layer. For context, sync service runs on an AWS RDS Aurora cluster that has a single primary writer node and 3-4 readers, all of which are r6g.8xlarge. AWS RDS has a physical storage size limit of 128TiB for each RDS cluster… We were hovering around ~95TB, and our rate of ingestion was ~2TB per month. At this rate, we realized we would see ingestion issues in another 6-8 months.” The team devised a three-pronged strategy: eliminating unused tables, revising their append-only tables approach, and methodically freeing up space from sizable tables. This strategy successfully reclaimed about 60TB of space.

featured in #452


How Instagram Scaled To 14 million Users With Only 3 Engineers

- Leonardo Creed tl;dr: Instagram scaled from 0 to 14 million users within a year (October 2010 to December 2011) with three engineers. The success was attributed to three guiding principles: simplicity, not reinventing the wheel and using proven technologies. The article provides a detailed walkthrough of the tech stack. Instagram relied on AWS, using EC2 and Ubuntu Linux, with the frontend developed in Objective-C. They utilized Amazon’s Elastic Load Balancer, Django for the backend, PostgreSQL for data storage, and Amazon S3 for photo storage, caching using Redis and Memcached.

featured in #449


Death By A Thousand Microservices

tl;dr: “It’s a simple question - what problem are you solving? Is it scale? How do you know how to break it all up for scale and performance? Do you have enough data to show what needs to be a separate service and why? Distributed systems are built for size and resilience. Can your system scale and be resilient at the same time? What happens if one of the services goes down or comes to a crawl? Just scale it up? What about the other services that are going to get hit with traffic? Did you war-game the endless permutations of things that can and will go wrong? Is there back pressure? Circuit breakers? Queues? Jitter? Sensible timeouts on every endpoint? Are there fool-proof guards to make sure a simple change does not bring everything down? The knobs you need to be aware of and tune are endless, and they are all specific to your system’s particular signature of usage and load.“

featured in #449