/Scale

How Uber Finds Nearby Drivers At 1 Million Requests Per Second

tl;dr: H3 is a hexagonal-shaped hierarchical geospatial indexing system created at Uber dividing Earth’s surface into cells on a flat grid, giving each cell a unique identifier with a 64-bit integer. Uber finds nearby drivers by identifying the relevant cells covering the rider's area and then listing the drivers in those cells sorted by the estimated time of arrival (ETA). H3 offers the benefits of a hierarchical indexing system and a hexagonal grid system.

featured in #477


uVitals – An Anomaly Detection & Alerting System

- Venki Appiah Komal Raulkar tl;dr: "Every day, millions of people rely on Uber to move from place to place and have food and groceries delivered. Uber depends on the reliability of its internal systems and the accuracy of data to power its platform. A glitch in its systems can result in a poor user experience and/or a loss in revenue. Major system issues that affect the reliability of our services are detected and mitigated quickly. However, there are several minor issues that take a longer time to detect and mitigate. Such minor issues can collectively result in poor user experiences and revenue loss over time. This is where uVitals comes in, as it surfaces these issues and anomalies when they begin to occur."

featured in #476


How Meta Built The Infrastructure For Threads

tl;dr: The article give examples of two existing components that played an important architectural role in building Threads: (1) ZippyDB, a distributed key/value datastore that provides scalability and flexibility across data centers. (2) Async, an asynchronous serverless function platform that processes trillions of function calls daily across over 100,000 servers. Async defers computing to off-peak hours, reducing time from solution conception to production deployment by handling deployment complexities i.e. queueing, scheduling, scaling, and disaster recovery. This allowed developers to focus on business logic. 

featured in #475


Data Quality Score: The Next Chapter Of Data Quality At Airbnb

- Clark Wright tl;dr: "With 1.4 billion cumulative guest arrivals as of year-end 2022, Airbnb’s growth pushed us to an inflection point where diminishing data quality began to hinder our data practitioners. Weekly metric reports were difficult to land on time. Seemingly basic metrics like “Active Listings” relied on a web of upstream dependencies. Conducting meaningful data work required significant institutional knowledge to overcome hidden caveats in our data." Clark discusses the implementation of a Data Quality Score.

featured in #471


How Uber Computes ETA At Half A Million Requests Per Second

tl;dr: "A single trip usually takes around 1000 ETA requests.Yet computing ETA is a difficult problem. Because the distance between the source and destination is not a straight line. Instead it consists of complex street networks and highways." Engineers split a route into smaller partitions to find the shortest path amongst each partition, factoring in variables, such as traffic.

featured in #470


Real-Time Analytics For Mobile App Crashes using Apache Pinot

tl;dr:  "At Uber, we have built a system called “Healthline” to help with our Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR) issues and to avoid potential outages and large-scale user impacts. Due to our ability to detect the issues in real time, this has become the go-to tool for release managers to observe the impact of canary release and decide whether to proceed further or to rollback. In this article we will be sharing details on how we are leveraging Apache Pinot™ to achieve this in real time at Uber scale."

featured in #463


Making Sure Your Auth System Can Scale

- James Hickey tl;dr: The balance between authentication security and performance is a perpetual challenge. This article dives into the heart of this issue, emphasizing the trade-off between stringent security practices and system scalability. You'll find practical tips to maintain secure auth while meeting customer demands, and discover strategies to make sure your systems remain secure and efficient.

featured in #462


Switching Build Systems, Seamlessly

- Patrick Balestra tl;dr: Patrick chronicles Spotify's shift to Bazel. The move was driven by the need for a scalable build system for their growing codebase. The transition, which began in earnest in 2020, involved running two build systems side by side, adapting existing tools, and extensive testing. By 2023, the iOS Spotify app was fully built with Bazel, resulting in significant improvements in build times and developer experience.

featured in #461


Automating Dead Code Cleanup

tl;dr: "SCARF contains a subsystem that automatically identifies dead code through a combination of static, runtime, and application analysis. It leverages this analysis to submit change requests to remove this code from our systems. This automated dead code removal improves the quality of our systems and also unblocks unused data removal in SCARF when the dead code includes references to data assets that prevent automated data cleanup. "

featured in #460


Maxjourney: Pushing Dicord's Limits With A Million+ Online Users In A Single Server

- Yuliy Pisetsky tl;dr: "With that growth, those servers started slowing down and creeping ever closer to their throughput limits. As that’s happened, we’ve continued to find many improvements to keep making them faster and pushing the limits out further. In this post we’ll talk about some of the ways we’ve scaled individual Discord servers from tens of thousands of concurrent users to approaching two million concurrent users in the past few years."

featured in #460