Shaun M. Thomas: PG Phriday: 10 Ways to Ruin Performance: MAXimized Value

I apologize for putting this series on a short hiatus last week for the 4th of July. But worry not, for this week is something special for all the developers out there! I’m going to try to make your life easier for a change. Screw the database!

As a PGDB (PostgreSQL) DBA, it’s easy to get tied up in performance hints, obscure syntax, and mangled queries, but it’s really all about the people. These men and women who hit our databases all day long in an attempt to hit insane deadlines often stare at the screen in defeat, trying to get something to work because they’re afraid to ask for help. I know, because I used to be one of them in my bygone developer days.

Sometimes performance isn’t just about the database. Queries can be insanely complex for seemingly simple requests. Even developers who are familiar with SQL syntax don’t speak it fluently, and will often come up with rather bizarre—yet functional—solutions to get a specific data set. Nobody wins when this happens, not the poor dev who fought a truculent query all day, nor the DBA who probably doesn’t understand all the business logic that drove it.

Let’s use an example from a company I worked with in the past. This data set should give us something fairly close to what happened back then.

CREATETABLE sys_order
(
    order_id     SERIAL       NOTNULL,
    account_id   INTNOTNULL,
    product_id   INTNOTNULL,
    item_count   INTNOTNULL,
    order_dt     TIMESTAMPTZ  NOTNULLDEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL);
 
INSERTINTO sys_order (account_id, product_id, item_count,
       order_dt, valid_dt)SELECT(a.id % 1000)+1,(a.id % 100000)+1,(a.id % 100)+1,
       now()-(id % 1000||'d')::INTERVAL+(id ||'ms')::INTERVAL,CASEWHEN a.id % 499=0THENNULLELSE now()-(id % 999||'d')::INTERVALENDFROM generate_series(1,1000000) a(id);
 
ALTERTABLE sys_order ADDCONSTRAINT pk_order_id
      PRIMARYKEY(order_id);
 
CREATEINDEX idx_order_account_id
    ON sys_order (account_id);
 
CREATEINDEX idx_order_order_dt
    ON sys_order (order_dt DESC);
 
ANALYZE sys_order;

Consider the scenario: retrieve all of the order details for the most recent order on every account in the system that ordered something in the last month. The developer tasked with solving this conundrum knew about GROUP BY syntax, and also had enough experience to realize it wouldn’t work. Using GROUP BY is mandatory for all non-aggregate columns in a query. Thus, if we wanted the maximum order date for each account, we’d lose all of the other order information.

Given those constraints, the developer came up with this:

SELECT o.*FROM sys_order o
  JOIN(SELECT account_id,MAX(order_dt)AS order_dt
          FROM sys_order
         WHERE order_dt >CURRENT_DATE-INTERVAL'1 month'GROUPBY account_id
       ) s USING(account_id, order_dt);

It may not be immediately obvious, but this approach only worked because the order system was built in such a way that an account could not order two different items simultaneously. Without that scenario, more than one row would come back for each account when joined back to the order table. But beyond that caveat, there’s an even more distressing implication. Forget the previous discussion if you can, then ask yourself this question: what does that query do?

I’ll admit I scrached my head for a couple minutes the first time I encountered it. The intent is almost completely obfuscated by the structure, which really defeats the purpose of SQL as a language. Generally the idea of SQL is to make a request, not tell the database how to fulfill it. So that’s why it’s a good thing SQL has syntax specifically to address this kind of request.

In PGDB, the DISTINCT ON clause lets us specify columns that should be used in determining whether or not a row is actually distinct, independent of any potential aggregation. It works like this: the first unique combination of any listed columns wins. That means we can naturally control the output simply by issuing a standard ORDER BY.

Here’s how that looks:

SELECTDISTINCTON(account_id)*FROM sys_order
 WHERE order_dt >CURRENT_DATE-INTERVAL'1 month'ORDERBY account_id, order_dt DESC

Now look at the query and ask the same question as before: what does this query do? Well, for each distinct account, return all columns for the most recent order in the last month. Isn’t that much more straight-forward? As a bonus, this doesn’t have the potential for breaking if the system changes and allows batch ordering of multiple products per transaction.

So we now have a query that’s easier to understand, is safer and more consistent, and uses far simpler SQL syntax. The only possible drawback is that DISTINCT ON is often a bit slower than a more convoluted approach. But that’s fine; developer time is valuable. As a DBA, I know which query I’d rather try and debug!

Hopefully, I’m not the only one.