Introduction To Streaming For Data Scientists
tl;dr: "With luck you shouldn’t have to build or maintain a streaming system yourself. Your company should have infrastructure to help you with this. However, understanding where streaming is useful and why streaming is hard could help you evaluate the right tools and allocate sufficient resources for your needs."
featured in #342
Data Mesh — A Data Movement and Processing Platform @ Netflix
tl;dr: "As the system evolves to solve more and more use cases, we have expanded its scope to handle not only the CDC use cases but also more general data movement and processing use cases:" (1) Events can be sourced from more generic applications. (2) Catalog of available DB connectors is growing. (3) More processing patterns such as filter, projection, union, join, etc...
featured in #341
Stop Aggregating Away The Signal In Your Data
tl;dr: "Aggregation is the standard best practice for analyzing time series data, but it can create problems by stripping away crucial context so that you’re not even aware of how much potential insight you’ve lost. In this article, I’ll start by discussing how aggregation can be problematic, before walking through three specific alternatives to aggregation with before / after examples."
featured in #339
Organizing And Scaling An Effective Data Team
tl;dr: The scope of a data team should include: (1) Ensuring focus on the right hierarchy of input & output metrics. (2) Steering the roadmap through insightful analysis & research. (3) Driving optimization through experimentation and ML. (4) Developing and maintaining data infrastructure. Rob outlines how the data team should evolve, and it's function within a startup, as it grows.
featured in #302
Algorithms For Decision Making
tl;dr: "This book provides a broad introduction to algorithms for decision making under uncertainty. We cover a wide variety of topics related to decision making, introducing the underlying mathematical problem formulations and the algorithms for solving them."
featured in #299
On Owning A Software Problem
tl;dr: What is a low-friction small thing that most will not notice, but that when they do, is a sign of craftsmanship, expertise, and pride in one's work? Vicki has created a list relevant for ML and Data Science: (1) Python code has type annotations. (2) Accurate documentation of a repo and an easy, reproducible way to run the project. (3) Formatted and linted SQL statements. And more.
featured in #293
Data To Engineers Ratio: US vs Europe
tl;dr: "The median data to engineers ratio for the US companies I looked at is 1:7 compared to 1:4 for European companies. And the design to engineers ratio is 1:9 for both groups. This post gives some answers to why this is but also leaves some questions unanswered."
featured in #282
What Is The Right Level Of Specialization? For Data Teams And Anyone Else
tl;dr: The specialization of data teams into many different roles e.g. data scientist, data engineer, analytics engineer, ML engineer etc is "generally a bad thing driven by the fact that tools are bad and too hard to use." He elaborates on this stance, here.
featured in #255
The Unknown Features of Python’s Operator Module
tl;dr: Python's Operator module "might not seem so useful, but with help of just a few of these functions you can make your code faster, more concise, more readable and more functional."
featured in #241
How I Beat the Berlin Rental Market With A Python Script
tl;dr: Gian runs us though an analytics tool that evaluates units in the Berlin rental market. The output analyzes each unit compared in price to its neighbor, the likeliness of a price increase for that unit, and the unit selling out within 6 hours.
featured in #229