/Architecture

Real Time Presence Platform System Design

tl;dr: “In layman’s terms, the presence status shows whether a particular user is currently online or offline. The presence status is popular on real-time messaging applications and social networking platforms such as LinkedIn, Facebook, and Slack. The presence status represents the availability of the user for communication on a chat application or a social network.”

featured in #408


Tracing Notifications

- Suman Karumuri George Luong tl;dr: The engineering team at Slack embarked on a project to improve debugging notifications. “Debugging notification issues within our systems was difficult because each system had a different logging pipeline and data format, making it necessary to look at data with different formats and backends. This process required deep technical expertise and took several days to complete.”

featured in #407


The Inner Workings Of Distributed Databases

- Alex Pelagenko tl;dr: “We analyze how several popular time-series / OLAP databases implement high availability to highlight the pros and cons of each approach.” Alex also reviews the fundamentals of distributed databases.

featured in #407


Real-time Messaging

- Sameera Thangudu tl;dr: From the engineering team at Slack, “we’ll describe the architecture that we use to send real-time messages at scale. We’ll take a closer look at the services that send the chat messages and various events to these online users in real time.”

featured in #406


You Want Modules, Not Microservices

- Ted Neward tl;dr: Ted dissecting the concept of a microservice to “get to the real root of what's going on” arguing there's a mis-match between its promise and what it actually delivers.

featured in #402


Pull The Andon Cord

- Taylor Pearson tl;dr: The Andon Cord was a rope that hung in Toyota factories that instantly could stop all work on the assembly line, which workers were encouraged to pull when they saw an issue. Once pulled, a manager came down to look the issue but the worker who pulled the rope was the one that came up with the solution. This process had 2 benefits: (1) It made workers feel trusted and part of the company’s output. (2) It dramatically increased quality as workers had a lot of tacit knowledge that managers didn’t.

featured in #401


Automating Safe, Hands-Off Deployments

- Clare Liguori tl;dr: “In this article, we walk through the steps a code change goes through in a pipeline at Amazon on its way to production. A typical continuous delivery pipeline has four major phases - source, build, test, and production. We’ll dive into the details of what happens in each of these pipeline phases for a typical AWS service, and provide you with an example of how a typical AWS service team might set up one of their pipelines.”

featured in #401


Keeping The Cloudflare API 'All Green' Using Python-Based Testing

- Elie Mitrani tl;dr: This article discusses Scout, an automated system running Python tests verifying the end to end behavior of Cloudflare’s APIs. Scout evaluates APIs in production-like environments, green lights a production deployment and monitors the behavior of APIs in production. This article dives deep into how it operates.

featured in #399


How Discord Stores Trillions Of Messages

- Bo Ingram tl;dr: “Our Cassandra cluster exhibited serious performance issues that required increasing amounts of effort to just maintain, not improve.” Bo discusses the troubles with Cassandra and the migration to ScyllaDB, a Cassandra-compatible database written in C++.

featured in #396


Scaling Media Machine Learning At Netflix

tl;dr: Netlfix’s goal in building ML infrastructure is to reduce the time from ideation to productization for the company. The team built infrastructure to (1) Access and process media data (e.g. video, image, audio, and text) (2) Training large-scale models efficiently. (3) Productize models in a self-serve fashion. (4) Store and serve model outputs for consumption.

featured in #396