Essential Reading For Engineering Leaders

How DoorDash Secures Data Transfer Between Cloud And On-Premise Data Centers

- Roger Zeng

Data
Architecture

tl;dr: "In this post, we will discuss how we established a secure, stable, and resilient private network connection between DoorDash microservices and our vendor’s on-premise data centers by leveraging the network facilities from our cloud provider, AWS."

featured in #384

McDonald’s Event-Driven Architecture: The Data Journey And How It Works

- Vamshi Krishna Komuravalli Damian Sullivan

tl;dr: Here is a typical data flow of how events are reliably produced and consumed from the platform: (1) Initially, an event schema is defined and registered in the schema registry. (2) Applications that need to produce events leverage producer SDK to publish events. (3) When an application starts up, an event schema is cached in the producing application for high performance. The authors continue to discuss how data flows through the system.

featured in #380

Reverse Engineering TikTok's VM Obfuscation (Part 1)

tl;dr: "The platform has implemented various methods to make it difficult for reverse-engineers to understand exactly what data is being collected and how it is being used. Analyzing the call stack of a request made on tiktok can begin to paint the picture for us."

featured in #379

The Difficult Life Of The Data Lead

- Mikkel Dengsøe

Data
DataEngineer

tl;dr: "My take on what’s the most common root cause for the strain on data managers, is that it’s most often with stakeholders. They are not deliberately being difficult (I hope) and often have good intentions to push for their own business goals. But many stakeholders don’t know how to work with data people. In high-growth companies you often have stakeholders coming from all kinds of backgrounds." Mikkel elaborates in this post.

featured in #353

Stop Aggregating Away The Signal In Your Data

- Zan Armstrong

Data
DataScience

tl;dr: "Aggregation is the standard best practice for analyzing time series data, but it can create problems by stripping away crucial context so that you’re not even aware of how much potential insight you’ve lost. In this article, I’ll start by discussing how aggregation can be problematic, before walking through three specific alternatives to aggregation with before / after examples."

featured in #339

Data Teams Are Getting Larger, Faster

tl;dr: "But something happens when a data team grows past 10 people. You no longer know if the data you use is reliable, the lineage is too large to make sense of and end-users start complaining about data issues every other day." Mikkel discusses how to deal with scaling teams.

featured in #334

Emerging Architectures For Modern Data Infrastructure

- Matt Bornstein Jennifer Li Martin Casado

tl;dr: "To help data teams stay on top of the changes happening in the industry, we’re publishing in this post an updated set of data infrastructure architectures. They show the current best-in-class stack across both analytic and operational systems, as gathered from numerous operators we spoke with over the last year."

featured in #304

Why We Switched Our Data Orchestration Service

- Guillaume Perchais

Data
Scale

tl;dr: "Within Spotify, we run 20,000 batch data pipelines defined in 1,000+ repositories, owned by 300+ teams — daily. The majority of our pipelines rely on two tools: Luigi (Python) and Flo (Java). The data orchestration team decided to move away from these tools, and in this post, the team details why the decision was made, and the journey they took to make the transition."

featured in #301

Why Becoming A Data-Driven Organization Is So Hard

- Randy Bean

tl;dr: Being data-driven has been a priority for companies but many have seen mixed results. According to a survey of executives, company culture is a harder hurdle to clear than any technical problem, and the explosion of the amount of data, privacy concerns and data ownership keep making the task harder. Randy offers three principles: (1) Think different and be creative. (2) Fail fast, learn faster. (3) Focus on the long-term.

featured in #296

The Next Big Challenge For Data Is Organizational

- Bryan Offutt

Data
Management

tl;dr: "Software development teams have a few key characteristics that make them efficient, even at scale:" (1) Specialization of specific roles. (2) Modularization, as problems are broken into chunks. (3) Clarity of ownership. (4) Organizational buy-in that tech debt needs management. Bryan argues that the structure around data teams are in different place for each of the above characteristic, and we are at a "tipping point" for this to change.

featured in #273

/Data