/Database

Inside New Query Engine Of MongoDB

- Nikita Lapkov tl;dr: A significant overhaul of the Query Execution Engine has been announced. The article provides an in-depth look into the technical aspects of this change. The previous engine, termed "Classic," was built around JSON documents, leading to inefficiencies in complex queries. The new Slot Based Engine (SBE) introduces "slots" as a means to pass data, optimizing the process. Nikita delves into the architecture, data flow, and challenges faced during the transition.

featured in #449


Tumblr Shares Database Migration Strategy With 60+ Billion Rows

tl;dr: The article delves into Tumblr's database migration strategy. With a massive MySQL database spanning 21 terabytes and 60+ billion rows, Tumblr sought a migration approach that minimized user impact. Initially considering a brute force method, they later adopted the CQRS pattern, which separates database read and write operations. To combat latency issues, Tumblr introduced a database proxy in the local data center, which maintained persistent connections to the remote leader and allowed for connection pooling. This strategy ensured minimal user disruption during migration.

featured in #447


Teréga Replaced Its Legacy Data Historian with InfluxDB, AWS, And IO-Base

- Jessica Wachtel tl;dr: Teréga, a French gas company, faced challenges with outdated IT systems. Recognizing a gap in available cloud-native data historians, they turned to InfluxDB. With InfluxDB, they developed Indabox for efficient data collection and IO-Base, hosted on AWS, for robust data storage. This InfluxDB-centric solution significantly modernized Teréga's IT landscape.

featured in #446


This Is How Quora Shards MySQL To Handle 13+ Terabytes

tl;dr: With data storage requirements in the tens of terabytes and 100,000 queries per second, Quora chose MySQL for its improved read performance. To manage rapid data growth and high write queries, Quora implemented both vertical and horizontal sharding techniques. Vertical sharding involves moving different tables to different servers, improving write scalability. Horizontal sharding, on the other hand, splits a large table into multiple smaller tables. Quora opted to build its sharding solution instead of using third-party service for low latency and easy reuse of existing logic.

featured in #445


Fuzz Testing Is the Best Thing To Happen To Our Application Tests

- Andrei Pechkurov tl;dr: The team at QuestDB faced challenges with segfaults, data corruption, and concurrency bugs. To address these, the team implemented fuzz testing, an automated software testing technique that provides invalid or unexpected data to a program to monitor for exceptions. This article details the process of introducing fuzz testing, revealing critical issues and leading to more robust database performance. The team also collaborated with SQLancer, a tool for testing SQL Database Management Systems, to uncover issues in their SQL engine.

featured in #441


NetApp Leverages Time Series Data for Real-Time Resource Trending and Alerting

tl;dr: NetApp, a leader in cloud data services, storage systems and software, uses time series data for real-time resource trending, SLO/SLI calculations, and alerting. Their SRE team identifies trends in resource consumption for critical Linux servers, DB monitoring, and custom resource monitoring. The team appreciates the high ingest, tool integration, and performance of InfluxDB, their chosen time series database. They also value its integration with Grafana for dashboards and Slack for global team communication.

featured in #440


A Strategic Approach To Replacing Data Historians

- Jason Myers tl;dr: Transitioning from legacy data historians to modern technologies in IoT / OT stacks can be achieved strategically. This involves automating manual processes, managing changes in legacy technology, and considering new tools during equipment upgrades and operational growth. Start small, scale sensibly.

featured in #436


Joins 13 Ways

- Justin Jaffray tl;dr: “Relational inner joins are really common in the world of databases, and one weird thing about them is that it seems like everyone has a different idea of what they are. In this post I’ve aggregated a bunch of different definitions, ways of thinking about them, and ways of implementing them that will hopefully be interesting. They’re not without redundancy, some of them are arguably the same, but I think they’re all interesting perspectives nonetheless.”

featured in #429


What Is A Vector Database?

- Roie Schwaber-Cohen tl;dr: This post reviews key aspects of a vector database — how it works, algorithms it uses, and the additional features that make it operationally ready for production scenarios.

featured in #422


You Don't Always Need Indexes

- Jeff Kaufman tl;dr: “Sometimes you have a lot of data, and one approach to support quick searches is pre-processing it to build an index so a search can involve only looking at a small fraction of the total data. The threshold at which it's worth switching to indexing, though, might be higher than you'd guess.” Jeff illustrates cases where full scans were better engineering choices.

featured in #418