Recently on the Postgres Slack, I encountered an interesting performance issue involving a SQL query that joins two tables with an ANY filter applied to one of the tables.
The problematic SQL was similar to the following:
SELECT
tbl1.col1
FROM
tbl1
INNER JOIN tbl2 ON tbl1.col1 = tbl2.col1
WHERE
tbl2.col1 IN (1,2,3);
Table tbl1 is joined with tbl2 on the column col1 from their respective tables.
A filter condition is applied to tbl2 on the same joining column col1 i.e join with tbl1. Let’s check the underlying problematic execution plan with mock tables.
--Tested on PostgreSQL 16.3
create table tbl1 as
select col1, col1::text as col2 , col1*0.999 as col3
from generate_series(1,100) as col1;
create table tbl2 as
select col1, col1::text as col2 , col1*0.999 as col3
from generate_series(1,10) as col1;
explain (analyze, buffers)
SELECT
tbl1.col1
FROM
tbl1
INNER JOIN tbl2 ON tbl1.col1 = tbl2.col1
WHERE
tbl2.col1 = ANY (ARRAY[1,2,3]);
Execution plan .
QUERY PLAN
----------------------------------------------
Hash Join (cost=1.18..3.58 rows=3 width=4) (actual time=1.354..1.404 rows=3 loops=1)
Hash Cond: (tbl1.col1 = tbl2.col1)
Buffers: shared hit=2
-> Seq Scan on tbl1 (cost=0.00..2.00 rows=100 width=4) (actual time=0.687..0.705 rows=100 loops=1)
Buffers: shared hit=1
-> Hash (cost=1.14..1.14 rows=3 width=4) (actual time=0.586..0.586 rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=1
-> Seq Scan on tbl2 (cost=0.00..1.14 rows=3 width=4) (actual time=0.034..0.041 rows=3 loops=1)
Filter: (col1 = ANY ('{1,2,3}'::integer[]))
Rows Removed by Filter: 7
Buffers: shared hit=1
Planning Time: 3.131 ms
Execution Time: 2.049 ms
(14 rows)
Key Observations from Problematic Execution Plan
- ANY (‘{1,2,3}’::integer[]) Filter is not pushed to acces path for tbl1 as filter, though it is joined with same column on which filter is applied.
- Access method for table tbl1 is not influnced by filter apply to tbl2.col1 though it is join on the same column in the SQL.
Let’s do some testing on checking whether filter will be pushed for different condition on same SQL.
SQL 1 – Equality Filter tbl2.col1 = 1;
explain (analyze, buffers)
SELECT
tbl1.col1
FROM
tbl1
INNER JOIN tbl2 ON tbl1.col1 = tbl2.col1
WHERE
tbl2.col1 = 1;
QUERY PLAN
----------------------------------------------
Nested Loop (cost=0.00..3.38 rows=1 width=4) (actual time=0.144..0.168 rows=1 loops=1)
Buffers: shared hit=2
-> Seq Scan on tbl1 (cost=0.00..2.25 rows=1 width=4) (actual time=0.115..0.134 rows=1 loops=1)
Filter: (col1 = 1)
Rows Removed by Filter: 99
Buffers: shared hit=1
-> Seq Scan on tbl2 (cost=0.00..1.12 rows=1 width=4) (actual time=0.024..0.028 rows=1 loops=1)
Filter: (col1 = 1)
Rows Removed by Filter: 9
Buffers: shared hit=1
Planning Time: 1.343 ms
Execution Time: 0.282 ms
(12 rows)
Filter applied on where clause(tbl2.col1 = 1) is implicitly pushed for both the tables i.e. tbl1 and tbl2.
SQL 2 – Filter tbl2.col1 in (1,2)
explain (analyze, buffers)
SELECT
tbl1.col1
FROM
tbl1
INNER JOIN tbl2 ON tbl1.col1 = tbl2.col1
WHERE
tbl2.col1 in (1,2);
QUERY PLAN
----------------------------------------------
Hash Join (cost=1.15..3.54 rows=2 width=4) (actual time=0.183..0.213 rows=2 loops=1)
Hash Cond: (tbl1.col1 = tbl2.col1)
Buffers: shared hit=2
-> Seq Scan on tbl1 (cost=0.00..2.00 rows=100 width=4) (actual time=0.106..0.119 rows=100 loops=1)
Buffers: shared hit=1
-> Hash (cost=1.12..1.12 rows=2 width=4) (actual time=0.040..0.041 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=1
-> Seq Scan on tbl2 (cost=0.00..1.12 rows=2 width=4) (actual time=0.016..0.019 rows=2 loops=1)
Filter: (col1 = ANY ('{1,2}'::integer[]))
Rows Removed by Filter: 8
Buffers: shared hit=1
Planning Time: 0.291 ms
Execution Time: 0.285 ms
(14 rows)
IN clause filter is transformed as ANY and not pushed on filter for table tbl1. It is only applied to tbl2.
ANY or IN Clause filter applied to the SQL is not pushed to another tables joined with same column as filtered columns.
Solution
Rewrite the SQL to manually apply the filter on both columns from each table in the join.
SELECT
tbl1.col1
FROM
tbl1
INNER JOIN tbl2 ON tbl1.col1 = tbl2.col1
WHERE
tbl2.col1 = ANY (ARRAY[1,2,3])
and tbl1.col1 = ANY (ARRAY[1,2,3]); -- Newly added.
Post changes, In Execution plan necessary filter was pushed for each table.
Filter: (col1 = ANY (‘{1,2,3}’::integer[]))
QUERY PLAN
----------------------------------------------
Hash Join (cost=1.18..3.57 rows=1 width=4) (actual time=0.313..0.353 rows=3 loops=1)
Hash Cond: (tbl1.col1 = tbl2.col1)
Buffers: shared hit=2
-> Seq Scan on tbl1 (cost=0.00..2.38 rows=3 width=4) (actual time=0.158..0.193 rows=3 loops=1)
Filter: (col1 = ANY ('{1,2,3}'::integer[]))
Rows Removed by Filter: 97
Buffers: shared hit=1
-> Hash (cost=1.14..1.14 rows=3 width=4) (actual time=0.065..0.065 rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=1
-> Seq Scan on tbl2 (cost=0.00..1.14 rows=3 width=4) (actual time=0.035..0.040 rows=3 loops=1)
Filter: (col1 = ANY ('{1,2,3}'::integer[]))
Rows Removed by Filter: 7
Buffers: shared hit=1
Planning Time: 0.447 ms
Execution Time: 0.453 ms
(16 rows)
Ideally, it would be great if Postgres could automatically push the predicate or filter when it is applied to join columns.