A must-know for any data practitioner
Have you ever found yourself writing long SQL queries with many steps or subqueries containing aggregate functions to calculate seemingly simple metrics? Have you ever wondered if you “couldn’t just calculate along the table differently to make things easier”? Then it might be time to have a look at window functions.
Window functions are a handy and powerful feature in SQL that allows you to perform calculations across a number of rows similar to aggregate functions. But other than aggregate functions, which you might use with a GROUP BY, they don’t return a single value for a group of rows, but a value for each row in the set. Let’s look at an example:
PARTITION BY user_id
Why and when should you use window functions?
One advantage of window functions is that they allow you to work with aggregate and non-aggregate values combined because the rows are not collapsed together. This opens up a way to calculate many things in one step that would otherwise require multiple.
Window functions are also relatively simple to use and read compared to multiple subqueries and self-joins. They reduce the complexity and number of steps in your queries, which makes it easier to maintain down the road.
In this way, they can also help with performance issues. For example, you can use a window function instead of having to do a self-join or subqueries.
Important window functions
You can use functions like SUM() and COUNT() as window functions but there are some special functions that are only available as window functions.
Here are some of the most important window functions in PostgreSQL:
- RANK(): returns the rank of each row within a set of rows. The rank is determined by the order of the rows in the set. Rank 1 would be the first row. Tied rows receive the same rank, which can lead to gaps.
- DENSE_RANK(): similar to RANK(), but it does not skip any ranks if there are ties in the values of the rows. For example, if two rows have the same value and are ordered as the first and second rows, their ranks will be 1 and 1 and the next will receive 2, not 1 and 1 and 3 as it would be the case with RANK().
- ROW_NUMBER(): returns a unique, number for each row in the set. The first row is assigned a row number of 1, the second row is assigned a row number of 2, and so on.
- NTILE(): divides the rows into a specified number of groups, or “tiles” and assigns a tile number to each row. For example, if you specify NTILE(3) and have 9 rows in the set, the first 3 rows will be assigned a tile number of 1, the next 3 rows will be assigned a tile number of 2, and the final 3 rows will be assigned a tile number of 3. This function can be used to calculate quantiles.
- 5. LAG() and LEAD(): These functions are used to access values from previous or subsequent rows in the set. For example, if you have a set of values and you want to compare the value in each row to the value in the previous row, you could use the LAG() function to access the previous value.
Let’s look at some examples!
RANK(), DENSE_RANK, and ROW_NUMBER(): To use these functions, you would specify the name of the column that you want to rank, and the ORDER BY clause that determines the order of the rows in the set.
For example, to rank the rows in a table by the number of values in the “sales” column, you could use the following query:
ORDER BY COUNT(*)
ORDER BY COUNT(*)
ORDER BY COUNT(*)
GROUP BY user_id;
In this example, the respective ranking function assigns a rank of 1 to the row with the highest number of values in the “sales” column, a rank of 2 (or one if ties are allowed) to the row with the second-highest, and so on.
NTILE(): To use the NTILE() function, you need to specify the number of tiles that you want to create, as well as the ORDER BY clause that determines the order of the rows in the set. For example, to divide the rows in a table into 4 tiles (quartiles), you could use the following query:
ORDER BY sales)
As you can see we can now easily determine the top/bottom quartiles’ sales. I found this function to be very useful to find the values above or below a certain percentage/quantile.
LAG() and LEAD(): To use the LAG() or LEAD() functions, you would specify the name of the column that you want to access, the number of rows to move, and the default value to use if the specified row does not exist. For example, to access the value in the previous row in the “sales” column, you could use the following query:
PARTITION BY user_id
ORDER BY sale_year
) AS prev_year_sales
Here, the LAG() function will access the value in the previous row in the “sales” column, and return 0 if there is no previous row (for example, if the current row is the first row in the set).
OVER, PARTITION BY, and ORDER BY
The OVER clause is an optional part of the syntax for many window functions in PostgreSQL. It specifies the criteria that determine the set of rows that the function will operate on. For example, if you want to use a window function to calculate the average “sales” value for each year, you could use the OVER clause to define the set of rows as all rows with the same “year” value. The OVER clause typically includes the PARTITION BY and ORDER BY clauses, which are used to divide the rows into groups (or partitions) and determine the order of the rows within each partition.
The OVER clause is a powerful feature of window functions in PostgreSQL, as it allows you to control the criteria that determine the set of rows that the function will operate on. This allows you to perform calculations on groups of rows, or even on individual rows, depending on your needs.
If you need even more control over the included rows you can use row selection in your OVER clause. A row selection clause determines the rows that will be included in the set that the window function operates on. The row selection clause can be used to specify the number of rows to include, or the range of rows to include, based on the position of the current row in the set. This can be useful if you only want to include a certain number of rows from the beginning or end of the set, or if you want to exclude certain rows from the calculation. Here are some examples that you might encounter frequently:
- ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: This clause specifies that all rows in the set should be included in the calculation. This is the default behavior if no row selection clause is specified, so you won’t see it a lot but it’s the most used row selection.
- ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING: This clause specifies that all rows from the current row to the end of the set should be included in the calculation.
- ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: This clause specifies that all rows from the beginning of the set to the current row should be included in the calculation. I found this to be very useful in cumulative calculations.
- ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING: This clause specifies that the current row and the two rows immediately preceding and following it should be included in the calculation. You can change the numerical values to adjust the size and shape of your calculation window.
Here are a few more things, which are good to know
- Window functions can be used in the SELECT, WHERE, and HAVING clauses of a query. This allows you to use the results of the window function in various parts of the query, and even to use them to filter the rows that are returned by the query.
- Window functions can be combined with aggregate functions. For example, you can use a window function to calculate the rank of each row, and then use an aggregate function like SUM() to calculate the total sum of the ranks. This can be very useful in the calculations of fractions and percentages.
- Window functions can be used with the GROUP BY clause. When used in combination with the GROUP BY clause, window functions will calculate the value for each group of rows, rather than for the entire set of rows.
- Window functions can be used with subqueries and CTEs. This allows you to perform calculations on the results of a subquery/CTE, rather than on the original data set.
- There is a WINDOW keyword is used to define the window frames for window functions in PostgreSQL. A window frame is the set of rows that a window function operates on, and is determined by the OVER clause and any row selection clauses that are included in the function’s syntax. By using the WINDOW keyword, you can specify the window frames for multiple window functions in a single query, rather than repeating the OVER and row selection clauses for each function. This can make your queries more readable and easier to maintain.
AVG(sales) OVER w1 ,
SUM(sales) OVER w1
WINDOW w1 AS (PARTITION BY year
ORDER BY month
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
In conclusion, window functions are a powerful and versatile tool in PostgreSQL that allows you to perform complex calculations on sets of rows. Whether you need to calculate ranks, divide rows into groups, or access values from other rows, window functions have you covered.
I hope you learned something and can improve your calculations and analysis with window functions!