Essential Reading For Engineering Leaders

- Marc Brooker

Database

tl;dr: “I believe that one of the things that’s holding back databases as an engineering discipline is a lack of good benchmarks, especially ones available at the design stage. The gold standard is designing for and benchmarking against real application workloads, but there are some significant challenges achieving this ideal.” Marc discusses an approach to develop benchmarks that shine light on a database’s design decisions.

featured in #493

How Uber Serves Over 40 Million Reads Per Second From Online Storage Using An Integrated Cache

Cache
Database

tl;dr: “Docstore is Uber’s in-house, distributed database built on top of MySQL. Storing tens of PBs of data and serving tens of millions of requests/second, it is one of the largest database engines at Uber used by microservices from all business verticals. Docstore users and use cases are growing, and so are the request volume and data footprint. This post discusses the challenges serving applications that require low-latency read access and high scalability.

featured in #491

The Billion Row Challenge (1BRC) - Step-By-Step From 71s To 1.7s

- Marko Topolnik

Database
Optimization

tl;dr: “The main thing I'd like to show you in this post is that a good part of that amazing speed comes from easy-to-grasp, reusable tricks that you could apply in your code as well. Towards the end, I'll also show you some of the magical parts that take it beyond that level.”

featured in #491

Let's Talk About Joins

- Crystal Lewis

Database

tl;dr: “In general, there are two ways to link our data, horizontally or vertically. When linking or joining data horizontally we are matching rows by one or more variables (i.e., keys), making a wider dataset. When joining vertically, column names are matched and datasets are stacked on top of each other, making a longer dataset. Joins can be done in many different programs (e.g., SQL, R, Stata, SAS). Most of this post will be applicable to any language, but examples in R will be provided.”

featured in #481

The One Billion Row Challenge

- Gunnar Morling

Database

tl;dr: "Your mission, should you decide to accept it, is deceptively simple: write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station. There’s just one caveat: the file has 1,000,000,000 rows!"

featured in #477

Building A Faster Hash Table For High Performance SQL Joins

- Andrei Pechkurov

Performance
Database

tl;dr: Andrei delves into QuestDB’s unique hash table, FastMap, designed to enhance SQL execution for JOIN and GROUP BY operations. FastMap employs open addressing and linear probing, optimized for high performance in database environments. It supports variable-size keys and fixed-size values, facilitating efficient data handling and updates. Notably, FastMap operates off-heap, reducing garbage collection pressure to improve performance.

featured in #475

Database Fundamentals

- Tony Solomonik

Database

tl;dr: "I tried thinking which database I should choose for my next project, and came to the realization that I don't really know the differences of databases enough. I went to different database websites and saw mostly marketing and words I don't understand. This is when I decided to read the excellent books Database Internals by Alex Petrov and Designing Data-Intensive Applications by Martin Kleppmann. The books piqued my curiosity enough to write my own little database I called dbeel. This post is basically a short summary of these books, with a focus on the fundamental problems a database engineer thinks about in the shower."

featured in #474

The Case Of A Curious SQL Query

- Justin Jaffray

tl;dr: Justin provides a deep dive into SQL's foundational aspects, highlighting the importance of a formalized approach to query behavior. Using the example of "predicate pushdown," Justin presents a SQL query that behaves differently across databases like DuckDB, SQLite, and CockroachDB. "I think it's a fun little mind bender that gives you some insight into the internals of these databases query engines without having to actually look at any code."

featured in #460

Storage Challenges In The Evolution Of Database Architecture

- Sujay Venaik

tl;dr: “Sync service has been running since 2014, and we started facing issues related to physical storage on the database layer. For context, sync service runs on an AWS RDS Aurora cluster that has a single primary writer node and 3-4 readers, all of which are r6g.8xlarge. AWS RDS has a physical storage size limit of 128TiB for each RDS cluster… We were hovering around ~95TB, and our rate of ingestion was ~2TB per month. At this rate, we realized we would see ingestion issues in another 6-8 months.” The team devised a three-pronged strategy: eliminating unused tables, revising their append-only tables approach, and methodically freeing up space from sizable tables. This strategy successfully reclaimed about 60TB of space.

featured in #452

Upsert In SQL

- Anton Zhiyanov

Database

tl;dr: Anton discusses the "Upsert" operation in SQL, which inserts new records and updates existing ones. The author provides interactive examples and demonstrates how different Database Management Systems handle upserts, including MySQL, SQLite and PostgreSQL.

featured in #452

/Database