/Observability

So We Shipped An AI Product. Did it Work?

- Phillip Carter tl;dr: “Like many companies, earlier this year we saw an opportunity with LLMs and quickly but thoughtfully started building a capability. About a month later, we released Query Assistant to all customers as an experimental feature. We then iterated on it, using data from production to inform a multitude of additional enhancements, and ultimately took Query Assistant out of experimentation and turned it into a core product offering. However, getting Query Assistant from concept to feature diverted R&D and marketing resources, forcing the question: did investing in LLMs do what we wanted it to do?”

featured in #454


Bottleneck: Resilience And Observability

- Punit Lad Carl Nygard tl;dr: The authors delve into the intricacies of resilience and observability in the context of rapidly scaling systems. As systems expand, their complexity can lead to potential failures. Resilience isn't about averting these failures but adeptly managing them. Observability is pivotal for comprehending system behavior, with its three foundational pillars: Metrics, Logs, and Traces. The authors also highlight challenges posed by the vast data volume in observability and the role of automation.

featured in #442


Service Delivery Index: A Driver for Reliability

- Matthew McKeen Ryan Katkov tl;dr: The article introduces the Service Delivery Index – Reliability (SDI-R), a metric designed to measure and drive service reliability at Slack. As the company grew, the need to move from a reactive to a proactive approach to reliability became evident. SDI-R, a composite metric of successful API calls, content delivery, and user workflows, provides a common understanding of reliability across the organization. It helps in spotting trends, identifying regressions, and setting customer expectations. The article details how SDI-R evolved, the tools and processes that support it, and the lessons learned.

featured in #440