/Data

You Don’t Have To Sacrifice Streaming Data Performance To Cut Cloud Costs

tl;dr: Redpanda is faster and more efficient than Apache Kafka… but how much faster exactly? We ran 200+ hours of benchmarks to find out how both platforms perform for various workloads and hardware configurations. Here’s our breakdown on how Redpanda achieves 10x the performance while cutting cloud spend by over $500k.

featured in #405


Balancing Quality And Coverage With Our Data Validation Framework

- Alexey Sanko tl;dr: Dropbox had a data validation problem, and this post discusses how they implemented a new quality check system in their big data pipelines that achieves a “balance of simplicity and coverage - providing good quality data, without being needlessly difficult or expensive to maintain.”

featured in #397


Big Data Is Dead

- Jordan Tigani tl;dr: This post will make the case that the era of Big Data is over. It had a good run, but now we can stop worrying about data size and focus on how we’re going to use it to make better decisions. I’ll show a number of graphs; these are all hand-drawn based on memory. If I did have access to the exact numbers, I wouldn’t be able to share them. But the important part is the shape, rather than the exact values.

featured in #390


SQL Should Be Your Default Choice For Data Engineering Pipelines

- Robin Linacre tl;dr: "SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable. A new SQL engine - DuckDB - makes SQL competitive with other high performance dataframe libraries, making SQL a good candidate for data of all sizes."

featured in #387


How DoorDash Secures Data Transfer Between Cloud And On-Premise Data Centers

- Roger Zeng tl;dr: "In this post, we will discuss how we established a secure, stable, and resilient private network connection between DoorDash microservices and our vendor’s on-premise data centers by leveraging the network facilities from our cloud provider, AWS."

featured in #384


McDonald’s Event-Driven Architecture: The Data Journey And How It Works

- Vamshi Krishna Komuravalli Damian Sullivan tl;dr: Here is a typical data flow of how events are reliably produced and consumed from the platform: (1) Initially, an event schema is defined and registered in the schema registry. (2) Applications that need to produce events leverage producer SDK to publish events. (3) When an application starts up, an event schema is cached in the producing application for high performance. The authors continue to discuss how data flows through the system.

featured in #380


Reverse Engineering TikTok's VM Obfuscation (Part 1)

tl;dr: "The platform has implemented various methods to make it difficult for reverse-engineers to understand exactly what data is being collected and how it is being used. Analyzing the call stack of a request made on tiktok can begin to paint the picture for us."

featured in #379


The Difficult Life Of The Data Lead

- Mikkel Dengsøe tl;dr: "My take on what’s the most common root cause for the strain on data managers, is that it’s most often with stakeholders. They are not deliberately being difficult (I hope) and often have good intentions to push for their own business goals. But many stakeholders don’t know how to work with data people. In high-growth companies you often have stakeholders coming from all kinds of backgrounds." Mikkel elaborates in this post.

featured in #353


Stop Aggregating Away The Signal In Your Data

- Zan Armstrong tl;dr: "Aggregation is the standard best practice for analyzing time series data, but it can create problems by stripping away crucial context so that you’re not even aware of how much potential insight you’ve lost. In this article, I’ll start by discussing how aggregation can be problematic, before walking through three specific alternatives to aggregation with before / after examples."

featured in #339


Data Teams Are Getting Larger, Faster

tl;dr: "But something happens when a data team grows past 10 people. You no longer know if the data you use is reliable, the lineage is too large to make sense of and end-users start complaining about data issues every other day." Mikkel discusses how to deal with scaling teams.

featured in #334