Groups, aggregations and sorting
Aggregated metrics per group
Now were getting into the fun stuff! Part of the analyst journey is to make sense of big datasets of information and summarize it into a few metrics. Aggregate Function had helped us previously to achieve this summary metrics but they seemed a little bit inconvenient taking into account that they collapsed the result set to a single row. Take for example:
select count(*) from "Track"It’s great to know how many tracks do we have in the database, but it’s a little bit too summarized. Say that we want to know how many tracks do we have of each artists, the first approach would be to do something like this:
select count(*) from "Track" where "Composer" = 'AC/DC'We would have to repeat it for every artists, it would a little bit of time don’t you think? Another approach would be to use the FILTER clause like:
select
count(*) filter (where "Composer" = 'AC/DC') "AC/DC songs",
count(*) filter (where "Composer" = 'Jerry Cantrell') "Jerry Cantrell songs",
count(*) filter (where "Composer" = 'Audioslave') "Audioslave songs"
from "Track"Not great either. Luckily, SQL comes to the rescue with the GROUP BY clause, run the following query:
select
"Composer",
count(*)
from "Track"
group by "Composer"Now we are talking! What happened is that we defined a group using the GROUP BY clause, this instructs PostgreSQL to calculate every aggregate function for every unique "Composer". We could even not calculate any aggregated function and we would still receive the unique "Composer"s:
select "Composer" from "Track" group by "Composer"This is equivalent to using
DISTINCTwith the advantage of being able to calculate aggregate metrics
Sort results
It was great to have the number of tracks per composer, but you know what would even be greater? If we could have the result set ordered by the number of tracks produced by each composer. We can use the ORDER BY clause:
select
"Composer",
count(*) "# Tracks"
from "Track"
group by "Composer"
order by "# Tracks" desc nulls firstIf you define an alias column name like
"# Tracks"then you must use that name if you want to refer to them inGROUP BYorORDER BYclauses.
descmeans descending order, by default Postgres usesascwhich means ascending
nulls firstsends any NULL result to the top,nulls lastsends them to the bottom
Caution: combining aggregated and non aggregated columns
Remember that if we use aggregate functions with groups, we’re still collapsing results but to each unique element of the groups, so we cannot mix non aggregate columns with aggregated columns, to make it work we need to group the non aggregate columns. The following query does not work:
select
"Composer",
"UnitPrice",
count(*)
from "Track"
group by "Composer"That’s because "UnitPrice" is non aggregated while count(*) is, we would have to add "UnitPrice" to the GROUP BY clause to make it work:
select
"Composer",
"UnitPrice",
count(*)
from "Track"
group by "Composer", "UnitPrice"Although the query works, it does something different than previous examples, that’s because we’re creating multiple groups:
Multiple groups
Defining multiple columns to group instructs PostgreSQL to calculate aggregate metrics for any unique combination of values from those columns. For example, let’s count the number of flights grouped by the departing airport and also it’s fare condition:
select
ad.airport_name,
tf.fare_conditions,
count(*)
from bookings.bookings b
left join bookings.tickets t using (book_ref)
left join bookings.ticket_flights tf using (ticket_no)
left join bookings.flights f using (flight_id)
left join bookings.airports_data ad on ad.airport_code = f.departure_airport
where true
and b.book_date >= '2017-01-01'
and b.book_date <= '2017-02-01'
and f.status in ('Arrived', 'Scheduled')
group by 1, 2
order by 3 descInstead of using the column name, we can use the column number in the
GROUP BYandORDER BYclauses. This is especially useful when we use multiple columns group or sort columns (we’ve usedGROUP BYclauses with 19 columns in the past!)
We can also ask which artist have the most playing time on all playlists:
select
p."Name",
at."Name",
sum(t."Milliseconds" / 1000) "Seconds",
count(*) "Tracks"
from "Track" t
inner join "Album" al using ("AlbumId")
inner join "Artist" "at" using ("ArtistId")
inner join "PlaylistTrack" plt using ("TrackId")
inner join "Playlist" p using ("PlaylistId")
group by 1, 2
order by 3 descCalculated groups
Anything column can be grouped, even the result of a function, for example, if we wanted to calculate the flight average sales amount per month, we would first have to calculate the month using the booking date:
select
date_trunc('month', book_date) "Booking month",
avg(total_amount) "Avg Booking sale",
count(*) "# Bookings",
sum(total_amount) "Total sales"
from bookings.bookings
group by 1
order by 1 ascYou can also use the CASE WHEN clause to create your own categories and group them. Let’s create a category for tracks depending on their duration, say that if a song last for less then a minute we will consider it a short track, common track if it takes until 5 minutes, long track if it takes until ten minutes and long track if it takes more:
select
case
when "Milliseconds" < 1000 * 60 then '1. Short track'
when "Milliseconds" between 1000 * 60 and 1000 * 60 * 5 - 1 then '2. Common track'
when "Milliseconds" between 1000 * 60 * 5 and 1000 * 60 * 10 then '3. Long track'
else '4. Very long track'
end "Track type",
count(*) "# Tracks",
avg("Milliseconds" / 1000) "Avg seconds"
from "Track"
group by 1
order by 1 ascHaving
You may have noticed that the WHERE clause goes before the GROUP BY clause, that’s because the WHERE conditions are evaluated before any grouping operations. What happens if we want to filter results based on the aggregated metric? Then we’ll use the HAVING clause. Let’s do an example:
select
tf.ticket_no "Ticket Number",
sum(case when f.status in ('Cancelled', 'Delayed') then 1 else 0 end) "Cancelled or delayed flights",
count(*) "Flights"
from bookings.ticket_flights tf
inner join bookings.flights f using (flight_id)
group by 1
having true
and bool_or(f.status in ('Cancelled', 'Delayed'))
and not bool_and(f.status in ('Cancelled', 'Delayed'))This example searches for tickets (group of flights from a unique passenger) where at least 1 of their flights were cancelled or delayed, but not all of them. The first thing that you can note is that the HAVING clause expects a boolean expression as the WHERE clause, the difference is that we use aggregate functions.
Grouping sets, rollup and cube
As we saw previously, multiple column groups will be used to calculate aggregate metrics per unique combinations of the columns values, but PostgreSQL gives us options to control how these combinations are created. For example, say we want to calculate the sales of employees in different years and months, we could group it like this:
select
date_trunc('year', i."InvoiceDate") "Year",
date_trunc('month', i."InvoiceDate") "Month",
concat_ws(' ', e."FirstName", e."LastName") "Employee",
sum(i."Total") "Sales"
from "Invoice" i
inner join "Customer" c using ("CustomerId")
inner join "Employee" e on e."EmployeeId" = c."SupportRepId"
group by 1, 2, 3
order by 1, 2, 3The above query gives us exactly how much an employee sold in a specific month of a year, but what if we also want to calculate how much we sold on a given month, or given year in total? One way would be to create another query and reduce the number of grouped columns, but we have specific clauses to achieve what we want in a single query:
Grouping sets
The GROUPING SETS clause allows us to define exactly on what groups will we perform aggregate functions:
select
to_char(i."InvoiceDate", 'yyyy') "Year",
to_char(i."InvoiceDate", 'mm-Mon') "Month",
concat_ws(' ', e."FirstName", e."LastName") "Employee",
sum(i."Total") "Sales"
from "Invoice" i
inner join "Customer" c using ("CustomerId")
inner join "Employee" e on e."EmployeeId" = c."SupportRepId"
group by grouping sets ((), (1), (1, 2), (1, 2, 3))
order by 3 nulls first, 2 nulls first, 1 nulls firstTo create a GROUPING SET we first open parenthesis and we define inside each grouping set separated by commas. Each set is contained inside parenthesis so ((), (1), (1, 2), (1, 2, 3)) means 4 grouping sets:
(): means no grouping, so the aggregate function will be calculated for the entire dataset that we have retrieved (after theJOINandWHEREclause have been evaluated). In the above example we’ll observe a row that has NULL for"Year","Month"and"Employee"but a result for"Sales", this means that"Sales"is the total sales of the dataset.(1): means that we want to compute an aggregate function using the first column as a group (we can use the column name as well like in the first examples). We’ll observe rows where we only have the different years and a value for sales. Those are the total sales of those years.(1, 2): the same as before but we also group by the month.(1, 2, 3): this computes the sales per employee per month and year.
Rollup
Certain GROUPING SETs are so common that we have PostgreSQL shorthands. ROLLUP (1, 2, ...) is a shorthand for GROUPING SETS ((1, 2, ...), ..., (1, 2), (1), ()). This is very useful for hierarchical information. The first grouping set example could be rewritten as:
select
to_char(i."InvoiceDate", 'yyyy') "Year",
to_char(i."InvoiceDate", 'mm-Mon') "Month",
concat_ws(' ', e."FirstName", e."LastName") "Employee",
sum(i."Total") "Sales"
from "Invoice" i
inner join "Customer" c using ("CustomerId")
inner join "Employee" e on e."EmployeeId" = c."SupportRepId"
group by rollup (1, 2, 3)
order by 3 nulls first, 2 nulls first, 1 nulls firstCube
If we want to use all possible combinations of GROUPING SETs then we can use the CUBE clause in the form of:
CUBE ( 1, 2, 3 )Is equivalent to:
GROUPING SETS (
( a, b, c ),
( a, b ),
( a, c ),
( a ),
( b, c ),
( b ),
( c ),
( )
)If we use it on the same example then before we’ll calculate the total sales for any year, month, employee and all combinations from these.
select
to_char(i."InvoiceDate", 'yyyy') "Year",
to_char(i."InvoiceDate", 'mm-Mon') "Month",
concat_ws(' ', e."FirstName", e."LastName") "Employee",
sum(i."Total") "Sales"
from "Invoice" i
inner join "Customer" c using ("CustomerId")
inner join "Employee" e on e."EmployeeId" = c."SupportRepId"
group by cube (1, 2, 3)
order by 3 nulls first, 2 nulls first, 1 nulls firstWe may not want to calculate aggregate functions for "Month", instead if we use the month we want it always to be accompanied by the year:
select
to_char(i."InvoiceDate", 'yyyy') "Year",
to_char(i."InvoiceDate", 'mm-Mon') "Month",
concat_ws(' ', e."FirstName", e."LastName") "Employee",
sum(i."Total") "Sales"
from "Invoice" i
inner join "Customer" c using ("CustomerId")
inner join "Employee" e on e."EmployeeId" = c."SupportRepId"
group by cube ("Year", ("Year", "Month"), "Employee")
order by 3 nulls first, 2 nulls first, 1 nulls firstCombining groupings
GROUPING SETs can be nested to achieve higher control, for example, we may not want to calculate the total sales of the entire dataset, so we force it to always use the "Year" as a grouping:
select
to_char(i."InvoiceDate", 'yyyy') "Year",
to_char(i."InvoiceDate", 'mm-Mon') "Month",
concat_ws(' ', e."FirstName", e."LastName") "Employee",
sum(i."Total") "Sales"
from "Invoice" i
inner join "Customer" c using ("CustomerId")
inner join "Employee" e on e."EmployeeId" = c."SupportRepId"
group by "Year", grouping sets ((), ("Month"), "Employee", ("Month", "Employee"))
order by 3 nulls first, 2 nulls first, 1 nulls firstWe can also force "Employee" to be the last grouping set:
select
to_char(i."InvoiceDate", 'yyyy') "Year",
to_char(i."InvoiceDate", 'mm-Mon') "Month",
concat_ws(' ', e."FirstName", e."LastName") "Employee",
sum(i."Total") "Sales"
from "Invoice" i
inner join "Customer" c using ("CustomerId")
inner join "Employee" e on e."EmployeeId" = c."SupportRepId"
group by "Year", cube(("Month")), "Employee"
order by 3 nulls first, 2 nulls first, 1 nulls firstNote on communication
Using complex GROUPING SETs can be a great analytical tool to create quick statistics or aggregate metrics, but avoid using them to communicate insights as it’s very easy for stakeholders to get lost, especially because it gets hard to interpret the results. Remember that analysis and communication are 2 completely different challenges and uses distinct set of tools and skills.
Exercises
Solve them in the query editor.
- Create a table with the favorite genres per customer based on the tracks that they bought (1 genre per customer).
- Create a table that shows the airports with the accumulated delays sorted by descending order.
- Who are the artist that have more play time?
- Count the number of bookings, passengers and total sales per year and month (consider using grouping sets).
- Calculate the total sales of each artist each week mentioned in the cross join section.