2024-06-10 –, Frannz Salon
TopK Queries are SQL queries that contain both an order by and a limit clause. The optimization presented in this talk uses runtime information to skip scanning data that could not be a part of the final result, thereby significantly accelerating these types of queries.
This talk is about a recently added optimization to the Snowflake Cloud Database for speeding up TopK queries. In it the audience will learn about how we use runtime information to skip scanning data partitions that will not contribute to the result of the query.
TopK Queries are SQL queries that contain both an order by and a limit clause. This is a very common query pattern. Examples are queries looking for the last k events (order by time descending), the k cheapest products (order by price ascending) or simply any limit query that is expected to deterministically return the same k rows (order by any unique identifier).
In contrast to limit queries without an order by, the query engine cannot simply return the first k rows it has scanned. Instead it has to take all data into consideration to find the k smallest or largest rows.
We can, however, come up with an upper / lower bound for values that can make it into the result after having seen at least k rows. This boundary can then be used to filter out rows that will not be a part of the result.
Our optimization takes this a step further and additionally uses the boundary for pruning entire data partitions. This means that we compare it with metadata for these partitions and based on this comparison may decide to skip downloading and scanning a partition entirely if none of its contained rows can become a part of the result.
Juliane is a Software Engineer in the Snowflake Berlin office, where she works on accelerating query performance by extending Snowflake’s pruning capabilities as part of the Search team. She holds a M.Sc. from Hasso Plattner Institute, Germany.