Some database—namely Microsoft SQL Server, IBM Db2, and also PostgreSQL since release 11—offer an include
clause in the create index
statement. The introduction of this feature to PostgreSQL is the trigger for this long overdue explanation of the include
clause.
Before going into the details, let’s start with a short recap on how (non-clustered) B-tree indexes work and what the all-mighty index-only scan is.
Recap: B-tree Indexes
To understand the include
clause, you must first understand that using an index affects up to three layers of data structures:
The first two structures together form an index so they could be combined into a single item, i.e. the “B-tree index”. I prefer to keep them separate as they serve different needs and have a different impact on performance. Moreover, explaining the include
clause requires making this distinction.
![]()
In the general case, the database software starts traversing the B-tree to find the first matching entry at the leaf node level (1). It then follows the doubly linked list until it has found all matching entries (2) and finally it fetches each of those matching entries from the table (3). Actually, the last two steps can be interleaved, but that is not relevant for understanding the general concept.
![]()
The following formulas give you a rough idea of how many read operations each of these steps needs. The sum of these three components is the total effort of an index access.
The B-tree: log100(<rows in table>), often less than 5
The doubly linked list: <rows read from index> / 100
The table: <rows read from table>
When loading a few rows, the B-tree makes the greatest contribution to the overall effort. As soon as you need to fetch just a handful of rows from the table, this step takes the lead. In either case—few or many rows—the doubly linked list is usually a minor factor because it stores rows with similar values next to each other so that a single read operation can fetch 100 or even more rows. The formula reflects this by the respective divisor.
The most generic idea about optimization is to do less work to achieve the same goal. When it comes to index access, this means that the database software omits accessing a data structure if it doesn’t need any data from it.
You can read more about the inner workings of B-tree indexes in Chapter 1, “Anatomy of an SQL Index” of SQL Performance Explained.
Recap: Index-Only Scan
The index-only scan does exactly that: it omits the table access if the required data is available in the doubly linked list of the index.
Consider the following index and query I borrowed from “Index-Only Scan: Avoiding Table
Access”.
CREATE INDEX idx
ON sales
( subsidiary_id, eur_value )
SELECT SUM(eur_value)
FROM sales
WHERE subsidiary_id = ?
At first flance, you may wonder why the column eur_value
is in the index definition at all—it is not mentioned in the where
clause.
B-tree Indexes Help Many Clauses
It is a common misconception that indexes only help the where
clause.
B-tree indexes can also help the order by
, group by
, select
and other clauses. It is just the B-tree part of an index—not the doubly linked list—that cannot be used by other clauses.
The crucial point in this example is that the B-tree index happens to have all required columns—the database software doesn’t need to access the table itself. This is what we refer to as an index-only scan.
![]()
Applying the formulas above, the performance benefit of this is very small if only a few rows satisfy the where
clause. On the other hand, if the where
clause accepts many rows, e.g. millions, the number of read operations is essentially reduced by a factor of 100.
Note
It is not uncommon that an index-only scan improves performance by one or two orders of magnitude.
The example above uses the fact that the doubly-linked list—the leaf nodes of the B-tree—contains the eur_value
column. Although the other nodes of the B-tree store that column too, this query has no use for the information in these nodes.
The Include
Clause
The include
clause allows us to make a distinction between columns we would like to have in the entire index (key columns) and columns we only need in the leaf nodes (include
columns). That means it allows us to remove columns from the non-leaf nodes if we don’t need them there.
Using the include
clause, we could refine the index for this query:
CREATE INDEX idx
ON sales ( subsidiary_id )
INCLUDE ( eur_value )
The query can still use this index for an index-only scan, thus yielding essentially the same performance.
![]()
Besides the obvious differences in the picture, there is also a more subtle difference: the order of the leaf node entries does not take the include
columns into account. The index is solely ordered by its key columns. This has two consequences: include
columns cannot be used to prevent sorting nor are they considered for uniqueness (see below).
“Covering Index”
The term “covering index” is sometimes used in the context of index-only scans or include
clauses. As this term is often used with a different meaning, I generally avoid it.
What matters is whether a given index can support a given query by means of an index-only scan. Whether or not that index has an include
clause or contains all table columns is not relevant.
Compared to the original index definition, the new definition with the include
clause has some advantages:
The tree might have fewer levels (<~40%)
As the tree nodes above the doubly linked list do not contain the include
columns, the database can store more branches in each block so that the tree might have fewer levels.
The index is slightly smaller (<~3%)
As the non-leaf nodes of the tree don’t contain include
columns, the overall size of that index is slightly less. However, the leaf node level of the index needs the most space anyway so that the potential savings in the remaining nodes is very little.
It documents its purpose
This is definitely the most underestimated benefit of the include
clause: the reason why the column is in the index is document in the index definition itself.
Let me elaborate on the last item.
When extending an existing index, it is very important to know exactly why the index is currently defined the way it happens to be defined. The freedoms you have in changing the index without breaking any other queries is a direct result of this knowledge.
The following query demonstrates this:
SELECT *
FROM sales
WHERE subsidiary_id = ?
ORDER BY ts DESC
FETCH FIRST 1 ROW ONLY
As before, for a given subsidiary this query fetches the most recent sales
entry (ts
is for time stamp).
To optimize this query, it would be great to have an index that starts with the key columns (subsidiary_id, ts)
. With this index, the database software can directly navigate to the latest entry for that subsidiary and return it right away. There is no need to read and sort all of the entries for that subsidiary because the doubly linked list is sorted according to the index key, i.e. the last entry for any given subsidiary must have the greatest ts
value for that subsidiary. With this approach, the query is essentially as fast as a primary key lookup. See “Indexing Order By” and “Querying Top-N Rows” for more details about this technique.
Before adding a new index for this query, we should check if there is an existing index that can be changed (extended) to support this trick. This is generally a good practice because extending an existing index has a smaller impact on the maintenance overhead than adding a new index. However, when changing an existing index, we need to make sure that we do not make that index less useful for other queries.
If we look at the original index definition, we encounter a problem:
CREATE INDEX idx
ON sales
( subsidiary_id, eur_value )
To make this index support the order by
clause of the above query, we would need to insert the ts
column between the two existing columns:
CREATE INDEX idx
ON sales
( subsidiary_id, ts, eur_value )
However, that might render this index less useful for queries that need the eur_value
column in the second position, e.g. if it was in the where
or order by
clause. Changing this index involves a considerable risk: breaking other queries unless we know that there are no such queries. If we don’t know, it is often best to keep the index as it is and create another one for the new query.
The picture changes completely if we look at the index with the include
clause.
CREATE INDEX idx
ON sales ( subsidiary_id )
INCLUDE ( eur_value )
As the eur_value
column is in the include
clause, it is not in the non-leaf nodes and thus neither useful for navigating the tree nor for ordering. Adding a new column to the end of the key part is relatively safe.
CREATE INDEX idx
ON sales ( subsidiary_id, ts )
INCLUDE ( eur_value )
Even though there is still a small risk of negative impacts for other queries, it is usually worth taking that risk.
From the perspective of index evolution, it is thus very helpful to put columns into the include
clause if this is all you need. Columns that are just added to enable an index-only scan are the prime candidates for this.
Filtering on Include
Columns
Until now we have focused on how the include
clause can enable index-only scans. Let’s also look at another case where it is beneficial to have an extra column in the index.
SELECT *
FROM sales
WHERE subsidiary_id = ?
AND notes LIKE '%search term%'
I’ve made the search term a literal value to show the leading and trailing wildcards—of course you would use a bind parameter in your code.
Now, let’s think about the right index for this query. Obviously, the subsidiary_id
needs to be in the first position. If we take the previous index from above, it already satisfies this requirement:
CREATE INDEX idx
ON sales ( subsidiary_id, ts )
INCLUDE ( eur_value )
The database software can use that index with the three-step procedure as described at the beginning: (1) it will use the B-tree to find the first index entry for the given subsidiary; (2) it will follow the doubly linked list to find all sales for that subsidiary; (3) it will fetch all related sales from the table, remove those for which the like
pattern on the notes
column doesn’t match and return the remaining rows.
The problem is the last step of this procedure: the table access loads rows without knowing if they will make it into the final result. Quite often, the table access is the biggest contributor to the total effort of running a query. Loading data that is not even selected is a huge performance no-no.
Important
Avoid loading data that doesn’t affect the result of the query.
The challenge with this particular query is that it uses an in-fix like
pattern. Normal B-tree indexes don’t support searching such patterns. However, B-tree indexes still support filtering on such patterns. Note the emphasis: searching vs. filtering.
In other words, if the notes
column was present in the doubly linked list, the database software could apply the like
pattern before fetching that row from the table (not PostgreSQL, see below). This prevents the table access if the like
pattern doesn’t match. If the table has more columns, there is still a table access to fetch those columns for the rows that satisfy the where
clause—due to the select *
.
CREATE INDEX idx
ON sales ( subsidiary_id, ts )
INCLUDE ( eur_value, notes )
If there are more columns in the table, the index does not enable an index-only scan. Nonetheless, it can bring the performance close to that of an index-only scan if the portion of rows that match the like
pattern is very low. In the opposite case—if all rows match the pattern—the performance is a little bit worse due to the increased index size. However, the breakeven is easy to reach: for an overall performance improvement, it is often enough that the like
filter removes a small percentage of the rows. Your mileage will vary depending on the size of the involved columns.
Unique Indexes with Include
Clause
Last but not least there, is an entirely different aspect of the include
clause: unique indexes with an include
clause only consider the key columns for the uniqueness.
That allows us to create unique indexes that have additional columns in the leaf nodes, e.g. for an index-only scan.
CREATE UNIQUE INDEX …
ON … ( id )
INCLUDE ( payload )
This index protects against duplicate values in the id
column, yet it supports an index-only scan for the next query.
SELECT payload
FROM …
WHERE id = ?
Note that the include
clause is not strictly required for this behavior: databases that make a proper distinction between unique constraints and unique indexes just need an index with the unique key columns as the leftmost columns—additional columns are fine.
For the Oracle Database, the corresponding syntax is this:
CREATE INDEX …
ON … ( id, payload )
ALTER TABLE … ADD UNIQUE ( id )
USING INDEX …
Compatibility
![Availability of INCLUDE]()
PostgreSQL: No Filtering Before Visibility Check
The PostgreSQL database has a limitation when it comes to applying filters on the index level. The short story is that it doesn’t do it, except in a few cases. Even worse, some of those cases only work when the respective data is stored in the key part of the index, not in the include
clause. That means moving columns to the include
clause may negatively affect performance, even if the above described logic still applies.
The long story starts with the fact that PostgreSQL keeps old row versions in the table until they become invisible to all transactions and the vacuum process removes them at some later point in time. To know whether a row version is visible (to a given transaction) or not, each table has two extra attributes that indicate when a row version was created and deleted: xmin
and xmax
. The row is only visible if the current transaction falls within the xmin
/xmax
range.
Unfortunately, the xmin
/xmax
values are not stored in indexes.
That means that whenever PostgreSQL is looking at an index entry, it cannot tell whether or not that entry is visible to the current transaction. It could be a deleted entry or an entry that has not yet been committed. The canonical way to find out is to look into the table and check the xmin
/xmax
values.
A consequence is that there is no such thing as an index-only scan in PostgreSQL. No matter how many columns you put into an index, PostgreSQL will always need to check the visibility, which is not available in the index.
Yet there is an Index Only Scan
operation in PostgreSQL—but that still needs to check the visibility of each row version by accessing data outside the index. Instead of going to the table, the Index Only Scan
first checks the so-called visibility map. This visibility map is very dense so the number of read operations is (hopefully) less than fetching xmin
/xmax
from the table. However, the visibility map does not always give a definite answer: the visibility map either states that that the row is known to be visible, or that the visibility is not known. In the latter case, the Index Only Scan
still needs to fetch xmin
/xmax
from the table (shown as “Heap Fetches” in explain analyze
).
After this short visibility digression, we can return to filtering on the index level.
SQL allows arbitrary complex expressions in the where
clause. These expressions might also cause runtime errors such as “division by zero”. If PostgreSQL would evaluate such expression before confirming the visibility of the respective entry, even invisible rows could cause such errors. To prevent this, PostgreSQL generally checks the visibility before evaluating such expressions.
There is one exception to this general rule. As the visibility cannot be checked while searching an index, operators that can be used for searching must always be safe to use. These are the operators that are defined in the respective operator class. If a simple comparison filter uses an operation from such an operator class, PostgreSQL can apply that filter before checking the visibility because it knows that these operators are safe to use. The crux is that only key columns have an operator class associated with them. Columns in the include
clause don’t—filters based on them are not applied before their visibility is confirmed. This is my understanding from a thread on the PostgreSQL hackers mailing list.
For a demonstration, take the previous index and query:
CREATE INDEX idx
ON sales ( subsidiary_id, ts )
INCLUDE ( eur_value, notes )
SELECT *
FROM sales
WHERE subsidiary_id = ?
AND notes LIKE '%search term%'
The execution plan—edited for brevity—could look like this:
QUERY PLAN
----------------------------------------------
Index Scan using idx on sales (actual rows=16)
Index Cond: (subsidiary_id = 1)
Filter: (notes ~~ '%search term%')
Rows Removed by Filter: 240
Buffers: shared hit=54
The like
filter is shown in Filter
, not in Index Cond
. That means it was applied at table level. Also, the number of shared hits is rather high for fetching 16 rows.
In a Bitmap Index/Heap Scan
the phenomenon becomes more obvious.
QUERY PLAN
-----------------------------------------------
Bitmap Heap Scan on sales (actual rows=16)
Recheck Cond: (idsubsidiary_id= 1)
Filter: (notes ~~ '%search term%')
Rows Removed by Filter: 240
Heap Blocks: exact=52
Buffers: shared hit=54
-> Bitmap Index Scan on idx (actual rows=256)
Index Cond: (subsidiary_id = 1)
Buffers: shared hit=2
The Bitmap Index Scan
does not mention the like
filter at all. Instead it returns 256 rows—way more than the 16 that satisfy the where
clause.
Note that this is not a particularity of the include
column in this case. Moving the include
columns into the index key gives the same result.
CREATE INDEX idx
ON sales ( subsidiary_id, ts, eur_value, notes)
QUERY PLAN
-----------------------------------------------
Bitmap Heap Scan on sales (actual rows=16)
Recheck Cond: (subsidiary_id = 1)
Filter: (notes ~~ '%search term%')
Rows Removed by Filter: 240
Heap Blocks: exact=52
Buffers: shared hit=54
-> Bitmap Index Scan on idx (actual rows=256)
Index Cond: (subsidiary_id = 1)
Buffers: shared hit=2
This is because the like
operator is not part of the operator class so it is not considered to be safe.
If you use an operation from the operator class, e.g. equals, the execution plan changes.
SELECT *
FROM sales
WHERE subsidiary_id = ?
AND notes = 'search term'
The Bitmap Index Scan
now applies all conditions from the where
clause and only passes the remaining 16 rows on to the Bitmap Heap Scan
.
QUERY PLAN
----------------------------------------------
Bitmap Heap Scan on sales (actual rows=16)
Recheck Cond: (subsidiary_id = 1
AND notes = 'search term')
Heap Blocks: exact=16
Buffers: shared hit=18
-> Bitmap Index Scan on idx (actual rows=16)
Index Cond: (subsidiary_id = 1
AND notes = 'search term')
Buffers: shared hit=2
Note that this requires the respective column to be a key column. If you move the notes
column back to the include
clause, it has no associated operator class so the equals operator is not considered safe anymore. Consequently, PostgreSQL postpones applying this filter to the table access until after the visibility is checked.
QUERY PLAN
-----------------------------------------------
Bitmap Heap Scan on sales (actual rows=16)
Recheck Cond: (id = 1)
Filter: (notes = 'search term')
Rows Removed by Filter: 240
Heap Blocks: exact=52
Buffers: shared hit=54
-> Bitmap Index Scan on idx (actual rows=256)
Index Cond: (id = 1)
Buffers: shared hit=2
“A Close Look at the Index Include Clause” by Markus Winand was originally published at Use The Index, Luke!.