Thank you for joining us for today’s 5 Minute Tech Challenge! We’re so glad you’re a part our community. Today, we learn from Dave Holmes-Kinsella, a full-stack player-coach data scientist at Synctera. Over to you, Dave!
A common problem I encounter amongst dev teams is that code which was performant during development and testing starts to fail at scale. One of the patterns that’s potentially problematic is
SELECT * FROM …
ORDER BY (...)
LIMIT *n*
I’ll give a brief explanation of why this can be problematic and share some examples that hopefully will show you how window functions, these sometimes intimidating features, can become another tool in your repertoire.
The general thinking around this pattern is along the lines of “find the most recent transaction, or highest value order or …”. Almost any pattern works in the development environment when our data volumes are typically small, but as we scale into production, performance issues are inevitable.
The potential performance issue with LIMIT *n*
is that, in most RDBMS database environments, we are calculating the entire result set, sorting it, and then returning the first n rows. That can be a remarkably expensive operation.
This is where we should consider window functions. One of my favorite blogs includes this definition:
A window function performs a calculation across a set of table rows that are somehow related to the current row… Behind the scenes, the window function is able to access more than just the current row of the query result.
So, what to do with it? Let’s work through an example that shows how to use a window function and how this pattern is extensible. I’ll use the Pagila dataset, referenced in the PostgreSQL wiki, which is all about film rentals.
Let’s assume that our goal is to find the following from the dataset:
The most popular film ever rented
The top 3 most popular films ever rented
The top 3 most popular films, by country
The top 3 most popular films, by country and year
Simple, no? A novice analyst might think:
The most popular film ever. That’s easy. It’s just select some_sort_of_max, right?
Top 3? That’s order by with limit
Top 3 by country? … uhhh
I’ve downloaded the dataset into a local copy of PostgreSQL (PG). PG is one of my favorite databases; but BigQuery, SnowFlake, MySQL, Redshift… they all have window functions. Use your favorite database to do this.
Step 1: Prepare the Data
The first thing I’m going to do is create a simple temporary table to make the tutorial a little simpler.
create temporary table summary_stats as
select f.film_id, f.title, cy.country,
date_trunc('month', r.rental_date) rental_month, count(*) rental_count from public.film f
join public.inventory i on f.film_id = i.film_id
join public.store s on i.store_id = s.store_id
join public.address a on s.address_id = a.address_id
join public.city c on a.city_id = c.city_id
join public.country cy on c.country_id = cy.country_id
join public.rental r on i.inventory_id = r.inventory_id
group by 1, 2, 3, 4;
In practice, I’d wrap this data into a Common Table Expression (CTE), but let’s keep things simple for now.
Step 2: Orient ourselves in the data
select * from summary_stats limit 5;
Step 3: Find the most commonly rented film
We’ll define “most commonly rented” as the film with the greatest number of rentals, irrespective of whether a person rented that title more than one time or not. We can do it quickly and simply, using limit.
select title, sum(rental_count)
from summary_stats group by 1 order by 2 desc limit 1;
Step 4: Find the top 3 most commonly rented films
select title, sum(rental_count)
from summary_stats group by 1 order by 2 desc limit 3;
So far so good. However, there’s no simple path forward using ORDER BY + LIMIT
. We are stuck at this time. Enter our friends window functions.
Step 5: Learn window functions
What we need is for a piece of SQL that will stack-rank films according to the way in which we want to group them — by country, to start with. We already have part of this: we count the number of rentals and we sort by that number, in descending order. Now we need a way to group them and to make this numbering.
First, we can use the window function row_number()
for the numbering — it’s one of a cadre of numbering functions.
select
title,
country,
sum(rental_count) rental_count,
row_number() over (order by sum(rental_count) desc)
from summary_stats
group by 1, 2
order by 3 desc;
There are two things to notice about this:
If you’re tied for first space (e.g. Love Suicides and Idols Snatchers), you get sequential number assignments. If you don’t want that, use
rank()
ordense_rank()
It’s not obvious here but the ranking is independent of the final sort order.
Also, since the sort is only on country, if you re-ran the query you could get different results!
The answers aren’t wrong — those top 2 are still correct, tied for first place with 20 rentals each. ProTip #1: never assume sorting in your queries, ever! You must always fully declare your sort in your final SELECT
statement.
Let’s sort the query by film title and see how the rankings change:
select
title,
country,
sum(rental_count) rental_count,
row_number() over (order by sum(rental_count) desc)
from summary_stats
group by 1, 2
order by 1 asc;
Counter-intuitive but really important to know. ProTip #2: you can have multiple rankings in your query — maybe you want one ranking for number of rentals, another for the number of different people who rent that movie. You can do that with separate select list items.
We can expand our query to make the rankings by country, using:
row_number() over (partition by country order by sum(rental_count) desc)
The partition by
tell us us how we want to do the grouping and the order by sum(rental_count) desc
tells us how we want to assign the ranks (1, 2, 3, etc.). That gives us the following query:
select
title,
country,
sum(rental_count) rental_count,
row_number() over (partition by country order by sum(rental_count) desc) ranking
from summary_stats
group by 1, 2
order by 2 desc;
This is showing us the stack ranked list, showing Canada first since we’re sorting by country in descending order. In order to find the top 3 by country, we need to filter on the ranking. I use a CTE (aka “with statement”) to create the rankings, and then I select from the results:
with input as (
select
title,
country,
sum(rental_count) rental_count,
row_number() over (partition by country order by sum(rental_count desc) ranking
from summary_stats
group by 1, 2)
select * from input
where ranking <= 3
order by country;
That’s an interesting result! Even though Bucket Brotherhood was the most commonly rented film overall, it doesn’t show up in either of Australia or Canada’s top 3. Let’s do a quick check, which incidentally shows why CTEs are awesome — you can re-use them:
with input as (
select
title,
country,
sum(rental_count) rental_count,
row_number() over (partition by country order by sum(rental_count desc) ranking
from summary_stats
group by 1, 2)
select * from input
where ranking <= 3
union all
select * from input where title = 'BUCKET BROTHERHOOD';
Summary
Although ORDER BY ... LIMIT *n*
is a convenient way to do simple “top N” sorting, it may prove to be non-performant as your data scales. Using window functions makes it easy to build extensible code which is clear in purpose. And remember: clear beats clever! Or, as Uncle SQL likes to say:
Clear > Clever
Want to discuss more? Reach out to me here for 30 minutes to talk about anything.
Coming up in two weeks, we will be learning from Aditya Gune, a tech writer and software engineer at Jetty where he enjoys solving tech problems that allow others to solve human problems. Aditya will be teaching us why Desktop Apps Aren’t Dead: A web engineer’s learnings from developing apps for Windows OS. I’m excited in a way that can only be captured by some WordArt!