Andrew Dunstan: Only GROUP BY what you really need to

The old rule used to be that if you have a query that contained aggregated columns, you have to GROUP BY every other column selected. These days you are allowed to omit columns that are provably functionally dependent on one or more of the other grouped by columns. In practice, that means you can omit any columns that are not in the table's primary key if all the primary key columns are grouped by.

Sometimes people, often including me, just do this fairly mindlessly, but sometimes it bites you. Consider this simple query:

SELECT a.id as a_id, a.properties, sum(b.amount) as expenses
FROM people a
   JOIN expenses b on a.id = b.person_id
GROUP BY a.id, a.properties

We don't really want the expenses grouped by the person's properties. We just put that in because the parser complains if we don't. And if people turns out to be a view which joins a couple of tables, we probably can't leave it out either. This can increase the amount of sorting that the GROUP BY requires, which can sometime have dramatic effects on performance. But even worse, there are cases where this can actually cause the query to be unrunnable. One such case is if properties is a JSON column.

That might surprise you. It has certainly surprised a couple of people I know. The reason is that there is no equality operator for JSON.

So, how can we write this so we only GROUP BY what we really need to? One way is to pick up the extra column later in the query, after we have done the grouping, like this:

WITH exp as 
(
  SELECT a.id as a_id, sum(b.amount) as expenses
  FROM people a
     JOIN expenses b on a.id = b.person_id
  GROUP BY a.id
)
SELECT exp.*, p.properties
FROM exp 
   JOIN people p ON p.id = exp.a_id

This might look a bit silly. We're adding in an extra join to people that we shouldn't need. But it turns out in my experience that this actually often works pretty well, and what you pay by way of the extra join is often paid for by the fact that you're simplifying the GROUP BY, and that it is processing smaller rows, uncluttered by the extra columns you want carried through. And, in the case of a JSON column, it has the virtue that it will work.

I often get called in to look at queries that run slowly and have huge GROUP BY clauses (I have seen them with 50 or so columns). I almost always start by reducing the GROUP BY to the smallest set possible, and this almost always results in a performance gain.

Andrew Dunstan: Only GROUP BY what you really need to

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112