Genie: Uber’s Gen AI On-Call Copilot
tl;dr: “For building an on-call copilot, we chose between fine-tuning an LLM model or leveraging Retrieval-Augmented Generation (RAG). Fine-tuning requires curated data with high-quality, diverse examples for the LLM to learn from. It also requires compute resources to keep the model updated with new examples.”featured in #558
Making Uber’s ExperimentEvaluation Engine 100x Faster
tl;dr: “How we made efficiency improvements to Uber’s Experimentation platform to reduce the latencies of experiment evaluations by a factor of 100x (milliseconds to microseconds). We accomplished this by going from a remote evaluation architecture (client to server RPC requests) to a local evaluation architecture (client-side computation). Some of the terminology in this blog post (e.g., parameters, experiments, etc.) is referenced from our previous blog post on Uber Experimentation. To learn more, check out Supercharging A/B Testing at Uber.”featured in #556
Introducing Netflix’s Key-Value Data Abstraction Layer
tl;dr: “In this post, we dive deep into how Netflix’s KV abstraction works, the architectural principles guiding its design, the challenges we faced in scaling diverse use cases, and the technical innovations that have allowed us to achieve the performance and reliability required by Netflix’s global operations.”featured in #552
Should We Decompose Our Monolith?
- Will Larson tl;dr: “Even as popular sentiment has generally turned away from microservices, many engineering organizations have a bit of both, often the reminents of one or more earlier but incomplete migration efforts. This strategy looks at a theoretical organization stuck with a bit of both approaches, let’s call it Theoretical Compliance Company, which is looking to determine its path forward.”featured in #550
featured in #542
Meet Chrono, Our Scalable, Consistent, Metadata Caching Solution
tl;dr: From the team at Dropbox, “If we wanted to solve our high-volume read QPS problem while upholding our clients’ expectation of read consistency, traditional caching solutions would not work. We needed to find a scalable, consistent caching solution to solve both problems at once. This article discusses Chrono, a scalable, consistent caching system built on top of Dropbox’s key-value storage system.“featured in #536
Odin: Uber’s Stateful Platform
- Jesper Borlum Gianluca Mezzetti tl;dr: “The Odin platform aims to provide a unified operational experience by encompassing all aspects of managing stateful workloads. These aspects include host lifecycle, workload scheduling, cluster management, monitoring, state propagation, operational user interfaces, alerting, auto-scaling, and automation. Uber deploys stateful systems at global, regional, and zonal levels, and Odin is designed to manage these systems consistently and in a technology-agnostic manner.” This post provides an overview of Odin’s origins, the fundamental principles, and the challenges encountered early on.featured in #534
Building And Scaling Notion’s Data Lake
tl;dr: “In the past three years Notion’s data has expanded 10x due to user and content growth, with a doubling rate of 6-12 months. Managing this rapid growth while meeting the ever-increasing data demands of critical product and analytics use cases, especially our recent Notion AI features, meant building and scaling Notion’s data lake. Here’s how we did it.”featured in #533
How Discord Uses Open-Source Tools For Scalable Data Orchestration & Transformation
- Zach Bluhm tl;dr: “Until recently, we’ve been using an in-house orchestration system that’s provided the foundation for Discord’s data analytics over the last five years. As our data organization grew, it became apparent that both self-service and top-notch observability would be key for our ability to effectively scale as a team. The team embraced an ambitious project: to overhaul our data orchestration infrastructure using modern, open-source tools sharing the candid lessons learned along the way, and how our new system is powering over 2000 dbt tables today.”featured in #532
How Canva Collects 25 Billion Events Per Day
- Long Nguyen tl;dr: “These use cases are powered by a stream of analytics events at a rate of 25 billion events per day (800 billion events per month), with 99.999% uptime. Our Product Analytics Platform team manages this data pipeline. Their mission is to provide a reliable, ergonomic, and cost-effective way to collect user interaction events and distribute the data to a wide range of destinations for consumption.”featured in #531