/Data

Big Data Is Dead

- Jordan Tigani tl;dr: This post will make the case that the era of Big Data is over. It had a good run, but now we can stop worrying about data size and focus on how we’re going to use it to make better decisions. I’ll show a number of graphs; these are all hand-drawn based on memory. If I did have access to the exact numbers, I wouldn’t be able to share them. But the important part is the shape, rather than the exact values.

featured in #390


SQL Should Be Your Default Choice For Data Engineering Pipelines

- Robin Linacre tl;dr: "SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable. A new SQL engine - DuckDB - makes SQL competitive with other high performance dataframe libraries, making SQL a good candidate for data of all sizes."

featured in #387


How DoorDash Secures Data Transfer Between Cloud And On-Premise Data Centers

- Roger Zeng tl;dr: "In this post, we will discuss how we established a secure, stable, and resilient private network connection between DoorDash microservices and our vendor’s on-premise data centers by leveraging the network facilities from our cloud provider, AWS."

featured in #384


McDonald’s Event-Driven Architecture: The Data Journey And How It Works

- Vamshi Krishna Komuravalli Damian Sullivan tl;dr: Here is a typical data flow of how events are reliably produced and consumed from the platform: (1) Initially, an event schema is defined and registered in the schema registry. (2) Applications that need to produce events leverage producer SDK to publish events. (3) When an application starts up, an event schema is cached in the producing application for high performance. The authors continue to discuss how data flows through the system.

featured in #380


Reverse Engineering TikTok's VM Obfuscation (Part 1)

tl;dr: "The platform has implemented various methods to make it difficult for reverse-engineers to understand exactly what data is being collected and how it is being used. Analyzing the call stack of a request made on tiktok can begin to paint the picture for us."

featured in #379


The Difficult Life Of The Data Lead

- Mikkel Dengsøe tl;dr: "My take on what’s the most common root cause for the strain on data managers, is that it’s most often with stakeholders. They are not deliberately being difficult (I hope) and often have good intentions to push for their own business goals. But many stakeholders don’t know how to work with data people. In high-growth companies you often have stakeholders coming from all kinds of backgrounds." Mikkel elaborates in this post.

featured in #353


Stop Aggregating Away The Signal In Your Data

- Zan Armstrong tl;dr: "Aggregation is the standard best practice for analyzing time series data, but it can create problems by stripping away crucial context so that you’re not even aware of how much potential insight you’ve lost. In this article, I’ll start by discussing how aggregation can be problematic, before walking through three specific alternatives to aggregation with before / after examples."

featured in #339


Data Teams Are Getting Larger, Faster

tl;dr: "But something happens when a data team grows past 10 people. You no longer know if the data you use is reliable, the lineage is too large to make sense of and end-users start complaining about data issues every other day." Mikkel discusses how to deal with scaling teams.

featured in #334


Emerging Architectures For Modern Data Infrastructure

- Matt Bornstein Jennifer Li Martin Casado tl;dr: "To help data teams stay on top of the changes happening in the industry, we’re publishing in this post an updated set of data infrastructure architectures. They show the current best-in-class stack across both analytic and operational systems, as gathered from numerous operators we spoke with over the last year."

featured in #304


Why We Switched Our Data Orchestration Service

- Guillaume Perchais tl;dr: "Within Spotify, we run 20,000 batch data pipelines defined in 1,000+ repositories, owned by 300+ teams — daily. The majority of our pipelines rely on two tools: Luigi (Python) and Flo (Java). The data orchestration team decided to move away from these tools, and in this post, the team details why the decision was made, and the journey they took to make the transition."

featured in #301