Evolution Of ML Fact Store

- Vivek Kaushal tl;dr: "This post will focus on the large volume of high-quality data stored in Axion — our fact store that is leveraged to compute ML features offline. We built Axion primarily to remove any training-serving skew and make offline experimentation faster. We will share how its design has evolved over the years and the lessons learned while building it."

featured in #321

Rapid Event Notification System at Netflix

- Ankush Gulati David Gevorkyan tl;dr: "In this blog post, we will give an overview of the Rapid Event Notification System at Netflix and share some of the learnings we gained along the way." Authors cover design decisions, architecture, observability and more. 

featured in #300

The Four Innovation Phases Of Netflix’s Trillions Scale Real-time Data Infrastructure

- Zhenzhong Xu tl;dr: "I hope this post will help platform engineers develop their cloud-native, self-serve streaming data platforms and scale use cases across many business functions (not necessarily from our success but maybe more from our failures)."

featured in #294

Designing Netflix

- Ankit Sirmorya tl;dr: Ankit guides us through the architecture plan for a Netflix style application, planning for the following scale: 100 million active users registered, 2500 MB uploaded every minute, 10 combinations of resolution and codec formats supported, 3 videos watched daily.

featured in #290

Fixing Performance Regressions Before They Happen

tl;dr: “This post describes how the Netflix TVUI team implemented a robust strategy to quickly and easily detect performance anomalies before they are released — and often before they are even committed to the codebase.”

featured in #288

OOPS Writeups

- Lorin Hochstein tl;dr: Operational Surprises (OOPS) is when something unexpected happened in operations and presents an opportunity to discover how the observed system behavior deviated from the mental model of how the system is supposed to behave. The template shared in this post is based on the template used at Netflix.

featured in #274

Building Confidence In A Decision

tl;dr: "This is the fifth post in a multi-part series on how Netflix uses A/B tests to inform decisions and continuously innovate on our products." This post covers helpful questions to ask when thinking through if a test is conclusive enough.

featured in #270

What is an A/B Test?

tl;dr: How A/B test are run at Netlfix: the importance is on "building intuition." The posts covers the basics of an A/B test, "why it’s important to run an A/B test versus rolling out a feature and looking at metrics pre- and post- making a change, and how we turn an idea into a testable hypothesis."

featured in #254

Edgar: Solving Mysteries Faster With Observability

- Elizabeth Carretto tl;dr: "Edgar helps Netflix teams troubleshoot distributed systems efficiently with the help of a summarized presentation of request tracing, logs, analysis, and metadata." A run through of how it works.

featured in #204

Growth Engineering at Netflix - Accelerating Innovation

tl;dr: Insight into the signup architecture at Netflix. An overview of the UX on mobile / TV and how the stack is configured to support this flow. If you're paywalled, click the link in this tweet.

featured in #147