Applications of Tries in Apache Cassandra Berlin Buzzwords 2024

Applications of Tries in Apache Cassandra
.ical

2024-06-11 16:00–16:40, Kesselhaus

Log-structured merge-tree databases rely on a combination of in-memory buffers and hierarchies of immutable on-disk structures, which are two cases where a trie data structure is most efficient. Join us for a glimpse at the ways in Apache Cassandra applies and benefits from this combination.

Tries (also called Prefix or Radix Trees) are efficient data structures for representing maps. They offer a few advantages over more traditional comparison-based structures such as B-Trees, including compact representation and theoretically optimal lookup complexity. Despite their efficiency, tries have historically seen limited application in database technologies, in part due to their less favorable performance as mutable on-disk structures.

A log-structure merge-tree (LSM tree) is a data structure tailored for storing an indexed map within a database. It leverages immutable on-disk structures commonly referred to as SSTables (Sorted String Table), organized in a hierarchy, coupled with an in-memory buffer, the memtable, which sorts and indexes as much data as possible in memory until it fills up and is written out to disk. Reads from an LSM tree merge data from all components; to keep read complexity in control, SSTables are reorganized from time to time in a process called compaction. This design allows LSM trees to exploit the difference between serial and random disk writes, to offer tunable trade-offs between read and write performance, and crucially for this talk, to avoid mutable on-disk structures.

Apache Cassandra is a NoSQL database built on an LSM tree architecture. Recently we have started to take advantage of trie data structures in the implementations of the databases's memtables and SSTables to achieve significant improvements in many aspects of the database's performance. This talk will describe how we were able to replace comparison-based structures in Cassandra with tries, explore the reasons why tries are efficient at the task of organizing ordered data, and dive into the optimizations applied to suit the structure the specific applications, as well as the hardware and software environment that they target. We aim to demonstrate the practical applicability of tries in database systems, but also discuss the journey of turning a theoretical data structure into its useful practical implementation in a real-world computing environment.

Branimir Lambov

Core database engineer for Apache Cassandra

Applications of Tries in Apache Cassandra .ical 2024-06-11 16:00–16:40, Kesselhaus

Applications of Tries in Apache Cassandra
.ical

2024-06-11 16:00–16:40, Kesselhaus