Subqueries
Congratulations for passing the basic queries section! Even without knowing any advance methods you’ve learn very powerful tools that will empower you to leverage your database data. But why stop here? Let’s go from good to incredible.
Select from subquery
Until know we have learn to create a query to fetch, transform and aggregate data from tables. That’s great but they’re data problems where we would benefit from being able to fetch data from another query and that’s where subquery enters. Subqueries are queries inside another query that can return a scalar value (which is a 1x1 table) or another table.
Let’s create an example:
select
concat_ws(', ', e."LastName", e."FirstName") "Employee",
sum(i."Total") "Employee sold on 2010",
(select sum(i2."Total") from "Invoice" i2 where date_trunc('year', i2."InvoiceDate") = '2010-01-01') "Total sold on 2010",
round(
100 * (select sum(i."Total")::numeric / sum(i2."Total") from "Invoice" i2 where date_trunc('year', i2."InvoiceDate") = '2010-01-01'),
1
) || '%' "% Sold by employee"
from "Employee" e
inner join "Customer" c on c."SupportRepId" = e."EmployeeId"
inner join "Invoice" i using ("CustomerId")
where true
and date_trunc('year', i."InvoiceDate") = '2010-01-01'
group by 1, 3Take the time to read each statement of the query, it’s complicated but essential to develop the patience and skill to understand the intention of a query
The above query uses two subqueries, one to calculate the total sales of 2010 and another one to calculate the percentage of sales that each employee had on the total invoices of 2010. Without using subqueries it would not be able to mix two different aggregations (one per employee and another per year).
Both subqueries return a scalar value
Another important feature is the ability of subqueries of referencing data outside of it, let’s analyze our two subqueries:
The first subquery
(select sum(i2."Total") from "Invoice" i2 where date_trunc('year', i2."InvoiceDate") = '2010-01-01')is completely independent from the original queryThe second subquery
(select sum(i."Total")::numeric / sum(i2."Total") from "Invoice" i2 where date_trunc('year', i2."InvoiceDate") = '2010-01-01')uses the statementsum(i."Total")which is data that comes from the outer query
From subquery
We have seen how to call subqueries inside the SELECT, additionally we can also use it from FROM statement. In fact you’ll see that we can call subqueries in practically any clause. For example, we’ll calculate how many tracks have been created by each composer. Now the "Composer" column can contain multiple composers, so we can use the string_to_array function to create an array of composers separated by \``, and then use theunnest` function to return the array as a column of multiple values. Finally we create an outer query that count the amount of times a composer is mentioned in a track and we get this:
select
"Artist"."Artist",
count(*) "# Tracks composed"
from (
select
unnest(string_to_array("Composer", ',')) "Artist",
"TrackId"
from "Track"
) as "Artist"
group by 1
order by 2 descJoin subquery
We can join to a subquery result. In this case we’re using the first example, but instead of calculating two times the total sales on the SELECT statement, we do it once in the JOIN statement and reference it in the SELECT.
select
concat_ws(', ', e."LastName", e."FirstName") "Employee",
sum(i."Total") "Employee sold on 2010",
"Total"."Total" "Total sold on 2010",
round((sum(i."Total")::numeric / "Total"."Total") * 100, 1) || '%' "% Sold by employee"
from "Employee" e
inner join "Customer" c on c."SupportRepId" = e."EmployeeId"
inner join "Invoice" i using ("CustomerId")
inner join (
select sum(i2."Total") "Total"
from "Invoice" i2
where date_trunc('year', i2."InvoiceDate") = '2010-01-01'
) as "Total" on true -- We use a true condition because we want all rows to join against the total sales value
where true
and date_trunc('year', i."InvoiceDate") = '2010-01-01'
group by 1, 3Lateral subquery
The only limitation of using subquery in the FROM and JOIN clauses is that we cannot use data from the outer query as we could from the SELECT clause. To achieve it we need to use the JOIN LATERAL clause, the reason is quite simple, you first need to define your outer data before using it in a subquery. That means that FROM and JOIN clauses are run first, then JOIN LATERAL clauses are run using the previous information. Let’s do an example, say we want to calculate how many scheduled flights for next week did we have each day on august 2017, we would need to create 30 different queries to get that number but using JOIN LATERAL were able to loop the results of the generate_series result and create a subquery for each row:
select
dates::date "Day",
flights."All flights" as "Next week scheduled flights",
flights."Flights departing LED" as "Next week scheduled flights departing LED",
flights."Flights departing GOJ" as "Next week scheduled flights departing GOJ",
flights."Flights departing KHV" as "Next week scheduled flights departing KHV"
from generate_series(
'2017-07-01',
'2017-07-01'::date + interval '1 month',
interval '1 day'
) as dates
join lateral (
select
count(*) "All flights",
count(*) filter (where f.departure_airport = 'LED') "Flights departing LED",
count(*) filter (where f.departure_airport = 'GOJ') "Flights departing GOJ",
count(*) filter (where f.departure_airport = 'KHV') "Flights departing KHV"
from bookings.flights f
where true
and date_trunc('day', f.scheduled_departure) >= dates
and date_trunc('day', f.scheduled_departure) <= dates + interval '1 week'
) flights on true
JOIN LATERALis also a great tool to reuse calculations, that’s because if we define once a calculation the subquery, we can use it multiple times in theSELECTstatement
Filtering using subqueries
Another great use for subqueries is inside the WHERE clause, we can use it in multiple ways.
Scalar value
The following query finds all "Invoice"s that have a "Total" amount above the average "Total" amount.
select *
from "Invoice"
where true
and "Total" > (select avg("Total") from "Invoice")ANY & ALL
Another way of using subqueries in a WHERE statement is by checking if an element exists in a list. In this example we are finding all invoices that belong to the top 3 employee sellers:
select i.*
from "Invoice" i
join "Customer" c using ("CustomerId")
join "Employee" e on e."EmployeeId" = c."SupportRepId"
where true
and e."EmployeeId" = ANY (
select e2."EmployeeId"
from "Employee" e2
join "Customer" c2 on c2."SupportRepId" = e2."EmployeeId"
join "Invoice" i2 using ("CustomerId")
group by 1
order by sum(i2."Total") desc
limit 3
)You can use the
ALLclause if we want to check that all results returned by the subquery succeeds the condition
Exists
The EXISTS clause returns a boolean value which is true if the subquery that it evaluates returns at least 1 result, if NULL is returned then it returns false. The most basic example is:
select exists (select 1); -- Will return true because select 1 returns the value 1
select exists (select 1 where false); -- Will return false because the subquery returns no resultThis is very useful for when we need to return some information but we need to check for some condition that we do not want to use in the SELECT clause. The following example finds all tickets that have had at least one flight delayed for more then 15 minutes:
select *
from bookings.tickets t
where true
and exists (
select
from bookings.flights f
join bookings.ticket_flights tf using (flight_id)
where true
and tf.ticket_no = t.ticket_no
and f.actual_arrival is not null
and f.actual_departure is not null
and (f.scheduled_arrival + interval '15 minute' < f.actual_arrival
or f.actual_departure + interval '15 minute' < f.actual_departure)
limit 1
)
limit 1CTE
Subqueries can be a great tool for complex queries but they can be difficult to read. If our query is getting out of hand, we could use Common Table Expressions (or CTEs), which is used to create and organize temporal virtual tables that only exist while running the query. Another benefit is that CTEs are only evaluated once, so it’s a way of reducing computation (which for heavier queries can speed up query execution times). Let’s start with a simple example:
with country_sales as (
select
"BillingCountry" "Country", sum("Total") "Total"
from "Invoice"
group by 1
order by 2 desc
), total_sales as (
select sum("Total") "Total Sales" from country_sales
)
select
"Country",
"Total" "Country total sales",
round(100 * ("Total" / "Total Sales"), 1) || ' %' "% Sales over total"
from country_sales
join total_sales on trueTo create a CTE we need to use the
WITHclause only once, then specify the name that you want to give your CTE followed by theASclause and a subquery. If you want to create multiple CTEs then you need to separate each subquery with a,and repeat the same syntax
Once the CTE is created, we can use it below as it was a table that exists in the database. That means that it can be referenced from below
FROM,JOINand other clauses
The above query calculates the countries total sales and the percentage it represents over the total sales. This can be achieve using subqueries but there’re 2 advantages:
- We can give a specific name to each subquery which increases readability as it communicates directly the objective of that subquery
- We avoided calling the
"Invoice"table twice because we’re calculating the total sales using the result of the first CTE.
Recursive CTE
Something that conventional subqueries cannot do are recursive queries, but CTEs can! This means that you can use the results of an initial query to run another query, and again use the new results for another query and do it again until no results are returned. This is forms a loop that for certain situations may be necessary, like when you don’t know how many times should you repeat a query. Let’s see a basic example by creating a counter:
with recursive counter as (
select 1 as "Index"
union all
select counter."Index" + 1 from (select 0) as x
join counter on true
where counter."Index" < 10
)
select * from counterIf you run it you’ll find that it creates a numeric series from 1 until 9. The steps to do this are:
We first start by writing
WITH RECURSIVEwhich instructs PostgreSQL that thecounterCTE will use a recursive algorithmThen inside the CTE we define the first query, we’ll use the results of this query to start a loop. In this case we just created a query that returns the number 1
Then we’ll use
UNIONorUNION ALLto append the results of the first query with the results of the second query (as many times as the second query is executed)Now define the second query. This query can use the results of the first query, which in the first run is just the number 1. We’ll use the
JOINclause, to access this value and sum it by 1. The results of this second query will be stored in thecounterCTE.Because the second query returned a result (the number 2), we’ll run the second query again, but this time the
counterCTE will have the value 2, this means that the second iteration of the second query will return 3We’ll rerun the second query until it returns
NULL, which happens when the value reaches 10 (check theWHERE clause)Finally, the
counterCTE will unite all results (1 until 9) and make it available down the line
You cannot use the
FROMclause to reference the CTE at the second query, that’s why we used the subquery(select 0)which does nothing
Please don’t use this in production, it’s just an example to illustrate a point, instead use the
generate_seriesfunction as it’s order of magnitudes more efficient.
Recursive queries are especially useful when you have a table with a value that references itself. Let’s imagine we have a table with persons and their parent ids, we could fetch the entire family of one member with:
with recursive sample_data as (
SELECT * FROM (
VALUES
(1, 'George', NULL, NULL),
(2, 'Annie', NULL, NULL),
(3, 'Gaby', 1, 2),
(4, 'Jose', 1, 2),
(5, 'Agusto', 4, NULL),
(6, 'Sara', NULL, 3)
) AS t (id, name, father_id, mother_id)
), sara_family as (
select * from sample_data where name = 'Sara'
union all
select sd1.*
from sample_data sd1
join sara_family sf on sf.father_id = sd1.id or sf.mother_id = sd1.id
)
select * from sara_family
VALUESis a clause that is very useful to create data, is equivalent to multipleSELECTstatements united byUNION ALLs
To edit the column names of the
VALUESresult we useAS t (id, name, father_id, mother_id)
We can mix recursive CTE with non-recursive CTE, the
sample_dataCTE is not actually recursive, onlysara_familyis
Let’s take the company.employees table, for each employee we also have his manager using the column manager_id so we could find the hierarchy for a particular employee:
with recursive william_lineage as (
select
employee_id "Id",
manager_id "Manager Id",
first_name "Name",
1 "Level"
from company.employees where employee_id = 206
union
select
e.employee_id "Id",
e.manager_id "Manager Id",
e.first_name "Name",
l."Level" + 1 "Level"
from company.employees e
join william_lineage l on l."Manager Id" = e.employee_id
)
select * from william_lineageExercises
Solve them in the query editor.
- The following query does not separate correctly composers that have their names separated with & and / separators. Fix it to include all artists:
select
"Artist"."Artist",
count(*) "# Tracks composed"
from (
select
unnest(string_to_array("Composer", ',')) "Artist",
"TrackId"
from "Track"
) as "Artist"
group by 1
order by 2 desc- Find all tickets with at least 1 canceled flight in business class.
- Find the whole hierarchy of managers and underlings of employee 114 (use the
company.employeestable) - Recursively calculate the accumulated percentage sale of each billing country that represents at least 10% of total sales. This accumulated percentage should equal 100% at then end.