/Data Science

Introduction To Streaming For Data Scientists

- Chip Huyen tl;dr: "With luck you shouldn’t have to build or maintain a streaming system yourself. Your company should have infrastructure to help you with this. However, understanding where streaming is useful and why streaming is hard could help you evaluate the right tools and allocate sufficient resources for your needs."

featured in #342


Data Mesh — A Data Movement and Processing Platform @ Netflix

tl;dr: "As the system evolves to solve more and more use cases, we have expanded its scope to handle not only the CDC use cases but also more general data movement and processing use cases:" (1) Events can be sourced from more generic applications. (2) Catalog of available DB connectors is growing. (3) More processing patterns such as filter, projection, union, join, etc...

featured in #341


Stop Aggregating Away The Signal In Your Data

- Zan Armstrong tl;dr: "Aggregation is the standard best practice for analyzing time series data, but it can create problems by stripping away crucial context so that you’re not even aware of how much potential insight you’ve lost. In this article, I’ll start by discussing how aggregation can be problematic, before walking through three specific alternatives to aggregation with before / after examples."

featured in #339


Organizing And Scaling An Effective Data Team

- Rob Dearborn tl;dr: The scope of a data team should include: (1) Ensuring focus on the right hierarchy of input & output metrics. (2) Steering the roadmap through insightful analysis & research. (3) Driving optimization through experimentation and ML. (4) Developing and maintaining data infrastructure. Rob outlines how the data team should evolve, and it's function within a startup, as it grows.

featured in #302


Algorithms For Decision Making

- Mykel Kochenderfer Tim Wheeler Kyle Wray tl;dr: "This book provides a broad introduction to algorithms for decision making under uncertainty. We cover a wide variety of topics related to decision making, introducing the underlying mathematical problem formulations and the algorithms for solving them."

featured in #299


On Owning A Software Problem

- Vicki Boykis tl;dr: What is a low-friction small thing that most will not notice, but that when they do, is a sign of craftsmanship, expertise, and pride in one's work? Vicki has created a list relevant for ML and Data Science: (1) Python code has type annotations. (2) Accurate documentation of a repo and an easy, reproducible way to run the project. (3) Formatted and linted SQL statements. And more.

featured in #293


Data To Engineers Ratio: US vs Europe

- Mikkel Dengsøe tl;dr: "The median data to engineers ratio for the US companies I looked at is 1:7 compared to 1:4 for European companies. And the design to engineers ratio is 1:9 for both groups. This post gives some answers to why this is but also leaves some questions unanswered."

featured in #282


What Is The Right Level Of Specialization? For Data Teams And Anyone Else

- Erik Bernhardsson tl;dr: The specialization of data teams into many different roles e.g. data scientist, data engineer, analytics engineer, ML engineer etc is "generally a bad thing driven by the fact that tools are bad and too hard to use." He elaborates on this stance, here.

featured in #255


The Unknown Features of Python’s Operator Module

- Martin Heinz tl;dr: Python's Operator module "might not seem so useful, but with help of just a few of these functions you can make your code faster, more concise, more readable and more functional."

featured in #241


How I Beat the Berlin Rental Market With A Python Script

- Gian Segato tl;dr: Gian runs us though an analytics tool that evaluates units in the Berlin rental market. The output analyzes each unit compared in price to its neighbor, the likeliness of a price increase for that unit, and the unit selling out within 6 hours.

featured in #229