Quantcast
Channel: Planet PostgreSQL
Viewing all articles
Browse latest Browse all 9644

Andrew Dunstan: Only GROUP BY what you really need to

$
0
0
The old rule used to be that if you have a query that contained aggregated columns, you have to GROUP BY every other column selected. These days you are allowed to omit columns that are provably functionally dependent on one or more of the other grouped by columns. In practice, that means you can omit any columns that are not in the table's primary key if all the primary key columns are grouped by.

Sometimes people, often including me, just do this fairly mindlessly, but sometimes it bites you. Consider this simple query:
SELECT a.id as a_id, a.properties, sum(b.amount) as expenses
FROM people a
JOIN expenses b on a.id = b.person_id
GROUP BY a.id, a.properties
We don't really want the expenses grouped by the person's properties. We just put that in because the parser complains if we don't. And if people turns out to be a view which joins a couple of tables, we probably can't leave it out either. This can increase the amount of sorting that the GROUP BY requires, which can sometime have dramatic effects on performance. But even worse, there are cases where this can actually cause the query to be unrunnable. One such case is if properties is a JSON column.

That might surprise you. It has certainly surprised a couple of people I know. The reason is that there is no equality operator for JSON.

So, how can we write this so we only GROUP BY what we really need to? One way is to pick up the extra column later in the query, after we have done the grouping, like this:
WITH exp as 
(
SELECT a.id as a_id, sum(b.amount) as expenses
FROM people a
JOIN expenses b on a.id = b.person_id
GROUP BY a.id
)
SELECT exp.*, p.properties
FROM exp
JOIN people p ON p.id = exp.a_id
This might look a bit silly. We're adding in an extra join to people that we shouldn't need. But it turns out in my experience that this actually often works pretty well, and what you pay by way of the extra join is often paid for by the fact that you're simplifying the GROUP BY, and that it is processing smaller rows, uncluttered by the extra columns you want carried through. And, in the case of a JSON column, it has the virtue that it will work.

I often get called in to look at queries that run slowly and have huge GROUP BY clauses (I have seen them with 50 or so columns). I almost always start by reducing the GROUP BY to the smallest set possible, and this almost always results in a performance gain.

Viewing all articles
Browse latest Browse all 9644

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>