Window functions
Over
Window functions allows us to compute aggregate information without collapsing rows, let’s try it with an example. Say you want to calculate the average salary by department of the company.employees table but show the comparison between the employee, their salary and the department average salary, a first try could look like:
select
e.first_name,
e.last_name,
d.department_name,
department_salaries."Average department salary"
from company.employees e
join company.departments d using (department_id)
join (
select
d2.department_id,
avg(e2.salary) "Average department salary"
from company.employees e2
join company.departments d2 using (department_id)
group by 1
) as department_salaries using (department_id)
order by 4 descNow let’s compare it with using a window function:
select
e.first_name,
e.last_name,
e.salary,
d.department_name,
avg(e.salary) over (partition by e.department_id) department_average_salary
from company.employees e
join company.departments d using (department_id)
order by department_average_salary descWe can achieve the same without using subqueries. Now let’s examine the structure of the window function:
avg(e.salary)is the aggregate function that we want to compute.oversignals PostgreSQL that instead of usingGROUP BY, we’ll specify a window function.(partition by e.department_id)is the windows definition, it tells PostgreSQL to calculate the aggregate functionavg(e.salary)for all rows that have the samee.department_id, which is the same action that we do when we useGROUP BY.
The important concept of window functions is that it let’s you process groups of rows called a window, you can use these windows to calculate aggregate functions but there are other types of functions that can be used as well. For example, what if we want to rank the employees based on their salary among the same job title and department?:
select
e.first_name,
e.last_name,
d.department_name,
j.job_title,
e.salary,
rank() over (partition by e.department_id, j.job_id order by e.salary desc) "Salary rank between peers",
round(avg(e.salary) over (partition by e.department_id, j.job_id), 2) "Dep Avg Salary",
rank() over (order by e.salary desc) "Salary rank between the whole company",
sum(e.salary) over () "Total salary"
from company.employees e
join company.departments d using (department_id)
join company.jobs j using (job_id)
order by d.department_name, j.job_title, "Salary rank between peers" ascFor the rank() function we need to specify a way to rank the window, which we do with order by e.salary desc, and we can further specify the window by using PARTITION BY, but it’s not necessary, in the above example we have ranked peers salary but also among the whole company. Further more the sum(e.salary) over () "Total salary" specifies an empty window, which means that it takes all rows, it sums up all salaries across the company.
Window frame
Our window can be further customized, we may not want to process the whole windows for example, just the first rows up until the current row or from the current row until the last one. This concept is called a window frame and it can be very useful for certain operations. Let’s take the following example:
select
"BillingCountry",
"Total",
round("Total" / sum("Total") over () * 100 , 2) "% over countries",
round(sum("Total") over (order by "Total" desc) / sum("Total") over () * 100 , 2) "Accumulated %"
from (
select
"BillingCountry",
sum("Total") "Total"
from "Invoice"
group by 1
) as invoice
order by 2 desc- We calculated the % of sales each country represents over the total sales
sum("Total") over ()on"% over countries" - If we wanted to calculate an accumulated % that each country represents, we can define a window frame with the
(order by "Total" desc)windows definition, by default it will only consider rows from the beginning until the current row, so it only sums up sales until the current row.
Now window frames have multiple options so bear with me. They’re 3 options that you essentially want to control:
- Where to start (
frame_start) - When to stop (
frame_end) - If some row in the middle needs to be excluded (
frame_exclusion)
For the frame_start and frame_end we can use the following clauses:
UNBOUNDED PRECEDING --> from the first row
offset PRECEDING --> how many rows before the current row
CURRENT ROW --> Start or end until the current row
offset FOLLOWING --> how many rows after the current row
UNBOUNDED FOLLOWING --> until the last row
offsetis a number or expression that returns a number
If you don’t specify an ORDER BY clause inside a window definition, automatically a window frame get’s created with the following structure:
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
For now take for granted that ROWS appear at the beginning, we’ll see later that it’s a peer group
This means that the window frame that will be processed is contained between the the first row of the window until the last row.
When you define an ORDER BY, the frame takes this structure:
ORDER BY x ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Now rows from the beginning until the current row get’s processed. Let’s see some examples in action:
select
"BillingCountry",
"Total",
-- This is the default behavior without window frames
sum("Total") over
(order by "Total" desc rows between unbounded preceding and unbounded following)
as "Sum of all totals",
-- This is the default behavior by only specifying order by
-- You can skip the BETWEEN clause if you only specify an frame_start,
-- it will automatically determine that the frame_end is CURRENT ROW
sum("Total") over
(order by "Total" desc rows unbounded preceding)
as "Sum from beginning until current row",
-- This calculates the sum of sales of the current row until the last one
sum("Total") over
(order by "Total" desc rows between current row and unbounded following)
as "Sum from current row until last row",
sum("Total") over
(order by "Total" desc rows between 2 preceding and 2 following)
as "Sum from 2 rows from and since current row"
from (
select
"BillingCountry",
sum("Total") "Total"
from "Invoice"
group by 1
) as invoice
order by 2 descFor the frame_exclusion we can use the following clauses after the frame_start and frame_end:
EXCLUDE NO OTHERS --> The default, it does not exclude anything
EXCLUDE CURRENT ROW --> Don't take into account the current row
-- You'll understand the following options once we jump to peer groups
EXCLUDE GROUP --> Exclude the current peer group
EXCLUDE TIES --> Exclude the peer group but not the current row
select
"BillingCountry",
"Total",
sum("Total") over (
order by "Total" desc
rows --> Peer group
between unbounded preceding --> frame_start
and unbounded following --> frame_end
exclude current row --> frame_exclusion
)
as "Sum of all totals minus current total"
from (
select
"BillingCountry",
sum("Total") "Total"
from "Invoice"
group by 1
) as invoice
order by 2 descPeer group
select
"BillingCountry",
"Total",
round("Total" / sum("Total") over () * 100 , 2) "% over countries",
round(sum("Total") over (order by "Total" desc) / sum("Total") over () * 100 , 2) "Accumulated %"
from (
select
"BillingCountry",
sum("Total") "Total"
from "Invoice"
group by 1
) as invoice
order by 2 descFor the above query, one problem that you could find with the "Accumulated %" column is that for rows that have the same "Total" value it calculates the aggregate function sum("Total") as if it took into account all the rows that have the same value. This happens because when for PostgreSQL the CURRENT ROW expression means different things depending on how we define peer groups.
Peer groups exists in 3 types:
GROUPS: rows that are ordered to have the same value will be inside a peer group, this is the default peer mode. When we useCURRENT ROW,offset PRECEDINGoroffset FOLLOWING, we are referring to groups and not unique rows.ROWS: each row forms it’s unique peer group. When we useCURRENT ROW,offset PRECEDINGoroffset FOLLOWING, we are referring to unique rows.RANGE: this mode allows us to use different offsets types, for example instead of usingBETWEEN 2 preceding and 2 followingwe could useBETWEEN interval '1 day' preceding and interval '2 day' following. For this mode to work we need to only use one column on theORDER BYclause inside the window definition.CURRENT ROWworks the same as inGROUPSmode.
Let’s rewrite our previous example:
select
"BillingCountry",
"Total",
round("Total" / sum("Total") over () * 100 , 2) "% over countries",
round(
sum("Total") over (order by "Total" desc
groups between unbounded preceding and current row
) /
sum("Total") over () * 100 ,
2) "Accumulated % in group mode",
round(
sum("Total") over (order by "Total" desc
rows between unbounded preceding and current row
) /
sum("Total") over () * 100 ,
2) "Accumulated % in row mode"
from (
select
"BillingCountry",
sum("Total") "Total"
from "Invoice"
group by 1
) as invoice
order by 2 descThe column "Accumulated % in row mode" uses row mode so we get the behavior that we expect. Now let’s try a RANGE mode:
select
dates.day "Day",
f.flight_no,
f.departure_airport,
f.arrival_airport,
f.scheduled_departure,
f.scheduled_arrival,
count(*) over (
partition by f.departure_airport
order by f.scheduled_departure
range between
interval '4 hour' preceding
and
interval '4 hour' following
exclude current row
) "Flights departing from same airport within a 4 hour frame"
from generate_series(
'2017-09-01'::date,
'2017-09-02'::date,
interval '1 day'
) as dates (day)
left join bookings.flights f on date_trunc('day', f.scheduled_departure) = dates.dayFor each row (flight), we counts the number of flights depart from the same airport within 4 hours of the current flight departure. We exclude the current row to not count it, if we excluded by groups or ranges we could have potentially not counted other flights that depart at the same time as the current flight.
Filter
We can combine the FILTER clause if we want some rows to not be included in our window, for example for the past query we may not want to include cancelled flights:
select
dates.day "Day",
f.flight_no,
f.departure_airport,
f.arrival_airport,
f.scheduled_departure,
f.scheduled_arrival,
count(*) filter (where f.status != 'Cancelled')
over (
partition by f.departure_airport
order by f.scheduled_departure
range between
interval '4 hour' preceding
and
interval '4 hour' following
exclude current row
) "Flights overlapping departing from same airport"
from generate_series(
'2017-09-01'::date,
'2017-09-02'::date,
interval '1 day'
) as dates (day)
left join bookings.flights f on date_trunc('day', f.scheduled_departure) = dates.dayWindow clause
Finally, we may have a query that uses the same window expression for multiple columns, PostgreSQL allows us to define it once and use it multiple times with the WINDOW clause:
select
"BillingCountry",
"Total",
round("Total" / sum("Total") over all_rows * 100 , 2) "% over countries",
round(sum("Total") over start_to_current_row / sum("Total") over all_rows * 100 , 2) "Accumulated %"
from (
select
"BillingCountry",
sum("Total") "Total"
from "Invoice"
group by 1
) as invoice
window all_rows as (),
start_to_current_row as (order by "Total" desc
rows between
unbounded preceding and
current row
)
order by 2 descAn additional advantage is that we can give contextual names to the window expressions, so it’s easier to read in the future.
Common window functions
We have seen the use of aggregate functions like sum and avg used with windows but it’s not exclusive to them, any other aggregate function can be used as well. Additionally we saw the use of functions that only work with windows like rank, but they’re others that are useful too:
select
e.first_name,
e.last_name,
d.department_name,
e.salary,
-- Indexes every row without counting peer groups
row_number() over department_order_by_salary,
-- Rank leaves gaps, check the IT department
rank() over department_order_by_salary,
-- Dense_rank does not leave gaps
dense_rank() over department_order_by_salary,
-- Relative rank of the current row: (rank - 1) / (total rows - 1)
percent_rank() over department_order_by_salary,
-- Relative rank of the current row:
-- (number of rows preceding or peer with current row) / (total rows)
cume_dist() over department_order_by_salary,
-- Find the salary of the peer group before (offset of 1) the current peer group,
-- if none is found, default to the current salary
lag(e.salary, 1, e.salary) over department_order_by_salary,
-- Same as lag but with following peer groups
lead(e.salary, 1, e.salary) over department_order_by_salary,
-- This finds the first peer group of the window, in this case is like
-- asking for the max salary of the department
first_value(e.salary) over department_order_by_salary,
last_value(e.salary) over department_order_by_salary,
-- Find the second highest salary
nth_value(e.salary, 2) over department_order_by_salary,
-- Divide all employees in five buckets of salaries
ntile(5) over (order by e.salary desc)
from company.employees e
join company.departments d using (department_id)
window department_order_by_salary as (partition by d.department_id order by e.salary desc)
order by d.department_name, 5 ascExercises
Solve them in the query editor.
- Rank flights departing on the first and second of september 2017. The rank must be grouped by the departure airport and ordered by the duration of flights.
- For the
"Invoice"table, calculate weekly sales and compare them against last week sales, the average weekly sales of the last month and last three months.