For PostgreSQL powerusers, automating repeated steps is becoming more and more necessary, and gexec can help. This blog will show you how to use the || operator and the \gexec command to avoid unnecessary repetition in your workflow.
The CLI client that ships with PostgreSQL is called psql. Like many CLI clients, it is often overlooked and replaced with something with a GUI, or it is only used for the most basic tasks, while more complex operations are carried out elsewhere. However, psql is a very capable tool with lots of useful features.
One common pattern is the need to run the same command with different arguments. Often, users simply rewrite the command over and over, or sometimes they may opt to use a text editor to write the command once, then copy and paste and edit it to accommodate different arguments.
Sometimes it can be useful to automate such steps, not only in the interest of saving time, but also in the interest of avoiding errors due to typos or copy-pasting. PostgreSQL can take the results of queries and add text to create commands with those results as arguments.
For this purpose, we can prepend or append text to any query result using the || operator.
Exercise 1: using the || operator
Let’s assume a new user needs access to some tables in a schema, e.g. all those tables that match a certain prefix.
Now, we could do this manually, or ask the database to automate the boring stuff.
1. Let’s retrieve the relevant tables with names starting with pgbench
postgres=# SELECT tablename FROM pg_tables WHERE tablename~'^pgbench';
tablename
------------------
pgbench_accounts
pgbench_branches
pgbench_history
pgbench_tellers
(4 rows)
2. Let’s use || to prepend and append command fragments to create a valid command with the tablename as a parameter.
postgres=# SELECT 'GRANT SELECT ON TABLE ' || tablename || ' TO someuser;' FROM pg_tables WHERE tablename~'^pgbench';
?column?
-----------------------------------------------------
GRANT SELECT ON TABLE pgbench_accounts TO someuser;
GRANT SELECT ON TABLE pgbench_branches TO someuser;
GRANT SELECT ON TABLE pgbench_history TO someuser;
GRANT SELECT ON TABLE pgbench_tellers TO someuser;
(4 rows)
Note that the strings end or begin with additional spaces, as the tablename itself does not contain the necessary spaces for argument separation. The semicolon ; was also added so these commands could be run straight away.
Now, these commands could be copied and then pasted straight into the prompt.
I’ve even seen people take such lines, store them into a file and then have psql execute all commands from the file.
But thankfully, a much easier way exists.
\gexec
In psql, there are many shortcuts and helpers to quickly gather info about the database, schemas, tables, privileges and much more.
The psql shell allows for working on the input and output buffers, and this can be used together with \gexec to have psql execute each command from the output buffer.
Exercise 2: calling \gexec
Reusing the query to generate the necessary commands, we can call \gexec to execute each line from the previous output.
postgres=# SELECT 'GRANT SELECT ON TABLE ' || tablename || ' TO someuser;' FROM pg_tables WHERE tablename~'^pgbench';
?column?
-----------------------------------------------------
GRANT SELECT ON TABLE pgbench_accounts TO someuser;
GRANT SELECT ON TABLE pgbench_branches TO someuser;
GRANT SELECT ON TABLE pgbench_history TO someuser;
GRANT SELECT ON TABLE pgbench_tellers TO someuser;
(4 rows)
postgres=# \gexec
GRANT
GRANT
GRANT
GRANT
Exercise 3: a cross join with \gexec
Assuming that you want to do something involving more arguments, you can always add more || to add more command fragments around the results from a query.
Suppose you need to grant privileges to insert, update, and delete from those tables as well.
A simple cross join gives us the desired action (constructed as a relation using the VALUES constructor) for each of the table names.
Note that we explicitly assign the action column name using AS t(action) to the table generated using VALUES.
postgres=# SELECT 'GRANT ' || action || ' ON TABLE ' || tablename || ' TO someuser;' FROM pg_tables CROSS JOIN (VALUES ('INSERT'),('UPDATE'),('DELETE')) AS t(action) WHERE tablename~'^pgbench';
?column?
-----------------------------------------------------
GRANT INSERT ON TABLE pgbench_accounts TO someuser;
GRANT UPDATE ON TABLE pgbench_accounts TO someuser;
GRANT DELETE ON TABLE pgbench_accounts TO someuser;
GRANT INSERT ON TABLE pgbench_branches TO someuser;
GRANT UPDATE ON TABLE pgbench_branches TO someuser;
GRANT DELETE ON TABLE pgbench_branches TO someuser;
GRANT INSERT ON TABLE pgbench_history TO someuser;
GRANT UPDATE ON TABLE pgbench_history TO someuser;
GRANT DELETE ON TABLE pgbench_history TO someuser;
GRANT INSERT ON TABLE pgbench_tellers TO someuser;
GRANT UPDATE ON TABLE pgbench_tellers TO someuser;
GRANT DELETE ON TABLE pgbench_tellers TO someuser;
(12 rows)
This output can then again be executed using \gexec.
postgres=# \gexec
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
Exercise 4: adding quotes
Depending on the circumstances, it may be required to add additional quotes to the output, for example when table names contain capitalization or spaces. In such cases, matching double quotes " can be added to the strings prepended and appended to arguments.
postgres=# SELECT 'GRANT SELECT ON TABLE "' || tablename || '" TO someuser;' FROM pg_tables WHERE schemaname='public';
?column?
-----------------------------------------------------
GRANT SELECT ON TABLE "with spaces" TO someuser;
GRANT SELECT ON TABLE "Capitalization" TO someuser;
GRANT SELECT ON TABLE "capitalization" TO someuser;
(3 rows)
postgres=# \gexec
GRANT
GRANT
GRANT
Now that you know how to use \gexec, why not take the next step? Take a look at our blog on column order in PostgreSQL to see it used in another practical example.
We released Citus 11 in the previous weeks and it is packed. Citus went full open source, so now previously enterprise features like the non-blocking aspect of the shard rebalancer—and multi-user support—are all open source for everyone to enjoy. One other huge change in Citus 11 is now you can query your distributed Postgres tables from any Citus node, by default.
When using Citus to distribute Postgres before Citus 11, the coordinator node was your application’s only point of contact. Your application needed to connect to the coordinator to query your distributed Postgres tables. Coordinator node can handle high query throughput, about 100K per second but your application might need even more processing power. Thanks to our work in Citus 11 you can now query from any node in the Citus database cluster you want. In Citus 11 we sync the metadata to all nodes by default, so you can connect to any node and run queries on your tables.
Running queries from any node is awesome but you also need to be able to monitor and manage your queries from any node. Before, when you only connected the coordinator, using Postgres’ monitoring tools was enough but this is not the case anymore. So in Citus 11 we added some ways to observe your queries similar to you would do in a single Postgres instance.
In this blogpost you’ll learn about some new monitoring tools introduced in Citus 11 that’ll help you track and take control of your distributed queries, including:
You can now use the Citus global process id, global PID for short, which is new to Citus 11. The Global PID is just like Postgres’ process id, but this new value is unique across a Citus cluster. We call the new value global process identifier and we use global PID or GPID for short.
To find the global PID of your current backend, you can use our new citus_backend_gpid function:
We tried to make GPIDs human readable, so they consist of the Postgres PID of your current backend and the node id of your current node. Smallest 10 digits of global PID are the Postgres process id—which you can find with pg_backend_pid function—and rest of the digits are the node id, which you can find in the pg_dist_node table.
Figure 1: Global PID, the new identifier number for queries in Citus database clusters, consists of the Citus node id (of the node the query started in) followed by the Postgres PID of the backend of the query.
Global PIDs are unique in a Citus cluster. Also, remember that a distributed query might need to run some queries on the shards. Those shard query executions also get the same GPID. In other words, all the activity of a distributed query can be traced via the same GPID. For example, if you run a SELECT query that has the GPID 110000000123, the queries that will SELECT from the shards will also have 110000000123 as global PID.
Note that global PIDs are big integers where Postgres PIDs are 4 byte integers.
New citus_stat_activity view to give you pg_stat_activity views across a Citus cluster
To find the Citus global PIDs and more information about your Postgres queries in a Citus cluster, you can use our new citus_stat_activity view. citus_stat_activity is a collection of pg_stat_activity views from all nodes in the Citus cluster. When you query the citus_stat_activity it goes to every node and gathers pg_stat_activity views. citus_stat_activity includes all the columns from pg_stat_activity and we added three extra columns:
global_pid: Citus global process id associated with the query.
nodeid: Citus node id of the node the citus_stat_activity row comes from
is_worker_query: Boolean value, showing if the row is from one of the queries that run on the shards.
Let’s say you have a distributed tabletbl with 4 shards and you run an update query on it in a node:
BEGIN;UPDATEtblSETb=100WHEREa=0;
You can connect to any node and use citus_stat_activity to find info about the UPDATE query.
record 1 is the original query that we ran on node with nodeid 11
record 2 is the query that runs on the shard from node with nodeid 2
both records have the same global_pid
is_worker_query column is true for record 2 and falsefor the original query, record 1.
Don’t forget, citus_stat_activity includes all the columns of pg_stat_activity, not just the ones we filter for demonstrating here. So, you can find much more information in citus_stat_activity view.
Use citus_dist_stat_activity view to get summarized info on your queries
If you are not interested in every single query from all the nodes in the Citus cluster and only care about the original distributed queries you can use citus_dist_stat_activity view.
citus_dist_stat_activity hides the queries that run on the shards from the citus_stat_activity view, so you can find some high level information about your Postgres queries.
As, you might have guessed citus_dist_stat_activity is citus_stat_activity filtered with is_worker_query = false. We created citus_dist_stat_activity because when you are interested in the process as a whole and not each and every process on the shards, then the general information that citus_dist_stat_activity provides about the initial queries should be enough.
Find blocking processes with citus_lock_waits view
When something in your Postgres database is blocked you are in the need of monitoring the most. Citus 11 has you covered when your cluster is blocked too. The newly updated citus_lock_waits shows the queries in your cluster that are waiting for some lock on another query.
Let’s say you run a DELETE query on the tbl that will be blocked on the previous UPDATE query:
DELETEFROMtblWHEREa=0;
You can connect to any node and use citus_lock_waits to find out which query is blocking your new query:
The result above shows the UPDATE statement is blocking the DELETE statement.
Once you find the blocking queries you can use citus_stat_activity and citus_dist_stat_activity with the global PIDs from citus_lock_waits to gather more insight.
Cancel a Postgres query from any Citus node with pg_cancel_backend
After you find out and get more information about the blocking and blocked queries you might decide you need to cancel one of them. Before Citus 11 you needed to go to the node that the query is being run on, and then use pg_cancel_backend with the process id to cancel.
Now in Citus 11 we override the pg_cancel_backend function to accept global PIDs too.
So, good news, things are now easier, you can cancel queries on your Citus clusters from any node:
Remember that global PIDs are always big integers and Postgres PIDs are 4-byte integers. The difference in size is how pg_cancel_backend differentiates between a PID and a GPID.
Also, like pg_cancel_backend, Citus 11 overrides pg_terminate_backend to accept global PIDs too. So, you can also terminate queries from different nodes using global PIDs.
More helper functions
In addition to all the Citus activity and lock views mentioned, we added some more smaller functions to help you monitor your database cluster. The new functions try to make it easier for you to get some info that can be useful when writing monitoring queries, including:
Get nodename and nodeport information
You can use citus_nodename_for_nodeid and citus_nodeport_for_nodeid to get info about the node with a node id:
With Citus 11 you can monitor from any node in a Citus database cluster
With Citus 11 you can query your distributed Postgres tables from any node by default. And with the tools you learned about in this blog post, you know how to monitor and manage your Citus cluster from any node, like you would do with a single Postgres instance.
If you’re interested in all that changed in Citus 11 check out:
And if you want to download Citus you can always find the latest download instructions on the website; we’ve put together lots of useful resources on the Getting Started page; you can file issues in the GitHub repo; and if you have questions please join us (and other users in the community) in the Citus Public Slack.
This article was originally published on citusdata.com.
Similar to PostgreSQL, Lustre file system is also an open source project which started about 20 years ago. According to Wikipedia, Lustre file system is a type of parallel distributed file system, and is designed for large-scale cluster computing with native Remote Direct Memory Access (RDMA) support. Lustre file systems are scalable and can be part of multiple computer clusters with tens of thousands of client nodes, tens of petabytes (PB) of storage on hundreds of servers, and more than a terabyte per second (TB/s) of aggregate I/O throughput. This blog will explain how to setup a simple Lustre file system on CentOS 7 and run PostgreSQL on it.
2. Lustre file system
To deliver parallel file access and improve I/O performance, Lustre file system separates out metadata services and data services. From high level architecture point of view, Lustre file system contains below basic components:
Management Server (MGS), provides configuration information about how the file system is configured, notifies clients about changes in the file system configuration and plays a role in the Lustre recovery process.
Metadata Server (MDS), manages the file system namespace and provides metadata services to clients such as filename lookup, directory information, file layouts, and access permissions.
Metadata Target (MDT), stores metadata information, and holds the root information of the file system.
Object Storage Server (OSS), stores file data objects and makes the file contents available to Lustre clients.
Object Storage Target (OST), stores the contents of user files.
Lustre Client, mounts the Lustre file system and makes the contents of the namespace visible to the users.
Lustre Networking (LNet) – a network protocol used for communication between Lustre clients and servers with native RDMA supported.
To setup a simple Lustre file system for PostgreSQL, we need to have 4 machines: MGS-MDS-MDT server, OSS-OST server, Lustre client1 and client2 (Postgres Servers). In this blog, I used three CentOS 7 virtual machines with below network settings:
To avoid dealing with Firewall and SELinux policy issues, I simply disabled them like below, Set SELINUX=disabled in /etc/selinux/config, and run commands,
Then update yum and install the filesystem utilities e2fsprogs to deal with ext4
yum update && yum upgrade -y e2fsprogs
If there is no errors, then install Lustre server and tools with yum install -y lustre-tests
3.2. Setup lnet network
Depends on your network interfaces setup, add the lnet configuration correspondingly. For example, all my 3 CentOS 7 has a network interface enp0s8, therefore, I added the configuration options lnet networks="tcp0(enp0s8)" to /etc/modprobe.d/lnet.conf as my Lustre lnet network configuration.
Then we need to load the lnet driver to the kernel, and start the lnet network by running below commands,
You can check if the lnet network is running on your Ethernet interface using command lctl list_nids, and you should see something like below,
10.10.1.1@tcp
You can try to ping other Lustre servers over the lnet network by running command lctl ping 10.10.1.2@tcp1. If the lnet network is working, then you should see below output,
12345-0@lo
12345-10.10.1.2@tcp
3.3. Setup MGS/MDS/MDT and OSS/OST servers
To set up the storage for MGS/MDS/MDT server, I added one dedicated virtual disk (/dev/sdb), created one partition (/dev/sdb1) and formatted it to ext4.
fdisk /dev/sdb
...
mkfs -t ext4 /dev/sdb1
You need to repeat the same process on OSS/OST server to add actual files storage disk.
If everything goes fine, then it is time to mount the disk on Lustre servers. First, we need to mount the disk on MGS/MDS/MDT server by running below command,
After the Luster server’s setup is done, we can simply mount the lustre file system on client by running below commands,
mkdir /mnt/lustre
mount -t lustre 10.10.1.1@tcp0:/lustrefs /mnt/lustre
If no error, then you can verify it by creating a text file and entering some information from one client, and check it from another client.
3.5. Setup Postgres on Lustre file system
As there are some many tutorials about how to setup Postgres on CentOS, I will skip this part. Assume you have installed Postgres either from an “official release” or compiled from the source code yourself, then run below tests from client1,
initdb -D /mnt/lustre/pgdata
pg_ctl -D /mnt/lustre/pgdata -l /tmp/logfile start
create table test(a int, b text);
insert into test values(generate_series(1, 1000), 'helloworld');
select count(*) from test;
pg_ctl -D /mnt/lustre/pgdata -l /tmp/logfile stop
From the above simple tests, you can confirm that the table created and records inserted by client1 are stored on remote Lustre file system, and if Postgres server stop on client1, then you can start Postgres server on client2 and query all the records inserted by client1.
4. Summary
In this blog, I explained how to set up a parallel distributed file system – Lustre on a local environment, and verify it with PostgreSQL servers. I hope this blog can help you when you want to evaluate some distributed file systems.
A software developer specialized in C/C++ programming with experience in hardware, firmware, software, database, network, and system architecture. Now, working in HighGo Software Inc, as a senior PostgreSQL architect.
When doing query optimization work, it is natural to focus on timing data. If we want to speed up a query, we need to understand which parts are slow.
But timings have a few weaknesses:
They vary from run to run
They are dependent on the cache
Timings alone can hide efficiency issues — e.g. through parallelism
Don’t get me wrong, I still think timings are very important for query tuning, but in this article we’re going to explore how you can complement them with the amount of data read and/or written by a query, by using BUFFERS.
When people share stories of 1000x query speed-ups, they are usually a result of reading far less data overall (usually as a result of adding an index). Much like the world of design, less is more.
To see the amount of data read/written by a query you can use the buffers parameter, for example by prefixing your query with:
One downside of including this extra data is that many folks already find Postgres explain output difficult to interpret. As such, I thought it’d be helpful to recap what exactly the buffers numbers mean, and describe some simple ways you can use them.
What are the buffer statistics again?
We’ve written about what the buffers statistics mean before, but the short version is that each of them consists of two parts, a prefix and a suffix.
There are three prefixes:
Shared blocks contain data from normal tables and indexes.
Temp blocks contain short-term data used to calculate hashes, sorts, materialize operations, and similar.
Local blocks contain data from temporary tables and indexes (yes, this is quite confusing, given that there is also a prefix “Temp”).
And there are four suffixes:
Hit means that the block was found in the Postgres buffer cache.
Read blocks were missed in the Postgres buffer cache and had to be read from disk or the operating system cache.
Dirtied blocks have been modified by the query.
Written blocks have been evicted from the cache.
These “blocks” are by default 8kB pages, and almost nobody changes this. If you want to check yours, you can do so by running:
show block_size;
Right, let’s take a look at a simple example query plan with buffers:
explain (analyze, buffers, costs off) select id, email from people where email = '45678@gmail.com';
Index Scan using people_email on people (actual time=1.066..1.071 rows=1 loops=1)
Index Cond: (email = '45678@gmail.com'::text)
Buffers: shared hit=3 read=1
Planning Time: 0.179 ms
Execution Time: 1.108 ms
From the line starting with “Buffers:” we can tell that, for this index scan, 3 blocks were read from the Postgres buffer cache (shared hit) and 1 block was read from the operating system cache or disk (shared read).
So when can the buffers be useful?
The buffers statistics can be useful in a number of ways, here are the most common that I see and hear about:
Spotting operations doing way more I/O than you expected
Getting a sense of the total I/O of the query
Spotting operations spilling to disk
Signs of cache performance problems
Let’s have a look at each of these cases.
1. Spotting operations doing way more I/O than you expected
If an index is returning a few thousand rows, you might be surprised to see it reading tens or even hundreds of thousand of blocks. Assuming our blocks are 8kB, reading 100,000 blocks is almost 1 GB of data!
Seeing that a 2-second scan read 5 GB of data definitely helps us understand why. Sure, without buffers you might have spotted that your index scan is filtering a lot of rows, but these are sometimes tricky to spot, since filter numbers are reported as a per-loop average.
In pgMustard, we try to do most of the arithmetic for people, and still focus more on the root cause of the issue (like index efficiency), but we also now calculate and more prominently display the total buffers on a per-operation basis.
Ryan Lambert included a nice example in his recent blog post on spatial indexing, where we can see an inefficient index scan reading about 39 GB to return 1m rows.
When viewing an operation in pgMustard, you can find its buffer stats in the pane on the left.
In the case above, Ryan was able to use a different indexing strategy to speed the query up by about 100x by (you guessed it) reading about 100x less data. 😄
2. Getting a sense of the total I/O of the query
Since parent operations include the buffers data of their children, you can usually get a quick sense of the total buffers by looking at the top node in a query plan.
We can see this by adding a limit to our simple example from earlier:
explain (analyze, buffers, costs off) select id, email from people where email = '45678@gmail.com' limit 1;
Limit (actual time=0.146..0.146 rows=1 loops=1)
Buffers: shared hit=4
-> Index Scan using people_email on people (actual time=0.145..0.145 rows=1 loops=1)
Index Cond: (email = '45678@gmail.com'::text)
Buffers: shared hit=4
Planning Time: 0.230 ms
Execution Time: 0.305 ms
Note that the Limit node is reporting 4 blocks read (all from the Postgres cache). This is a total, inclusive of the index scan, which is of course responsible for all the reads. For more complex queries this can give you a quick sense check for total I/O.
In pgMustard, we sum the buffer statistics together and covert them to kB/MB/GB in a Summary section in the top left. The same blocks can be read more than once for the same query, and reads and writes to disk are also double counted, but since we see it as a very rough measure of “work done”, we think the sum works surprisingly well.
The same spatial query example from earlier, showing a sum of 51 GB buffers.
3. Spotting operations spilling to disk
Another common performance issue are operations spilling to disk.
Postgres does a nice job of telling us that a Sort has spilled to disk (via the Sort Method and Disk statistics) but for other types of operations, it can be trickier or impossible to spot without buffers.
Here’s a simple example of a Sort spilling to disk:
explain (analyze, buffers, costs off) select id, email from people order by email;
Sort (actual time=2501.340..3314.259 rows=1000000 loops=1)
Sort Key: email
Sort Method: external merge Disk: 34144kB
Buffers: shared hit=7255, temp read=4268 written=4274
-> Seq Scan on people (actual time=0.016..35.445 rows=1000000 loops=1)
Buffers: shared hit=7255
Planning Time: 0.101 ms
Execution Time: 3335.237 ms
Since we asked for buffers, we can also see temp read/written numbers in the sort’s buffer statistics — which confirm that it wrote to (and read from) disk.
Looking out for temp buffers is a great way of spotting other types of operations spilling to disk that don’t provide as good reporting as sorts!
In pgMustard we report these as “Operation on Disk” tips. In this example, we should probably look into limiting the number of rows returned as a starting point 😇
4. Signs of cache performance problems
By comparing the number of shared hit and shared read blocks, we can get a sense of the cache hit rate. Sadly this isn’t perfect, as the latter includes operating system cache hits, but it is at least a clue during an investigation.
In the default text format of explain, it’s worth noting that you’ll only see non-zero buffers statistics. So, if you only see shared hits, then congratulations, all blocks were served from the Postgres buffer cache!
In pgMustard we report this ratio in “Cache Performace” tips. In this example, all data was read from the cache so the tip scored 0.0 (out of 5) to indicate that it is not an issue here.
Other areas and further reading
pgMustard also uses the buffers statistics in its “Read Efficiency” and “Read Speed” tips, and there are likely other things we could be using them for too.
We’ve looked at several ways we can use the buffers data during query optimization:
Spotting operations doing way more I/O than expected — per-operation buffers
Getting a sense of the total I/O of the query — total buffers
Spotting operations spilling to disk — temp buffers
Clues to potential cache performance problems — shared hit vs shared read
And as a quick reminder, if you agree that buffers should be on by default and are willing and able to review the patch to do so, it would be extremely welcome. 🙌
<blockquote>
<p>Vanilla Postgres has native partitioning?</p>
<blockquote>
<p>Yes! And it's really good!</p>
</blockquote>
</blockquote>
<p>We frequently get questions like: Can Postgres handle JSON? Can Postgres handle
time series data? How scalable is Postgres? Turns out the answer is most usually
yes! Postgres, vanilla Postgres, can handle whatever your need is without having
to go to a locked in proprietary database. Unless you're really close to the
Postgres internals and code releases you might have missed that Postgres
natively has partitioning. Our head of product,
<a href="https://twitter.com/craigkerstiens">Craig</a>, recently talked about the
advantages of working with
<a href="https://www.infoworld.com/article/3654271/cloud-convenience-and-open-source.html">vanilla Postgres</a>
versus niche products. With that in mind, I wanted to take a step back and go
through the basics of vanilla Postgres partitioning.</p>
<p>Crunchy customers on
<a href="https://www.crunchydata.com/products/crunchy-bridge">cloud</a> and
<a href="https://www.crunchydata.com/products/crunchy-postgresql-for-kubernetes">Kubernetes</a>
are often asking about partitioning features and for our general
recommendations. For existing data sets, our architects will take a deep dive
into their specific use case. Generally speaking though, partitioning is going
to be beneficial where data needs to scale - either to keep up with performance
demands or to manage the lifecycle of data. Do you need to partition a 200GB
database? Probably not. If your're building something new that's likely to scale
into the TB or dealing with something in the multi-TB size of data, it might be
time to take a peek at native partitioning and see if that can help.</p>
<h2>Partitioning use cases</h2>
<ul>
<li>
<h3>Data lifecycle & cost management</h3>
</li>
</ul>
<p>The main benefit of working with partitions is helping with lifecycle management
of data. Big data gets really expensive so archiving off data you no longer need
can be really important to managing costs. Using partitioning to manage your
data lifecycle means you'll be rolling off data to an archive frequently,
allowing you to drop/archive tables easily as that data no longer needs to exist
in a certain database.</p>
<p><img src="https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/78548363-ccfa-46ce-a6e4-62e89dbe6d00/public" alt="Partitioning Storage" loading="lazy"></p>
<ul>
<li>
<h3>Performance</h3>
</li>
</ul>
<p>Another benefit that people often look for in terms of partitioning is query
performance. Especially when your queries use the indexes or partitioning keys
specifically. You can really speed up query times by having things go straight
to individual date ranges or sets of data instead of the entire dataset.</p>
<p><img src="https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/08a5dcd5-4648-4378-95a8-8212a89c8100/public" alt="Partitioning Performance" loading="lazy"></p>
<h3>Types of Partitioning</h3>
<p>There are quite a few different kinds of partitioning based on how you want to
subdivide your data.</p>
<ul>
<li><strong>Range partitioning</strong> is probably the most common and typically used with
time or integer series data.</li>
<li><strong>List partitioning</strong> is also popular, especially if you have a database that
is easily separated by some kind of common field - like location or a specific
piece of data across your entire set.</li>
<li><strong>Hash partitioning</strong> is also available, but should only be used when a
clearly defined partition pattern cannot be obtained.</li>
<li><strong>Composite partitioning</strong> would be combining one or more of these, like time
based and list partitioning in the same dataset.</li>
</ul>
<h2>Sample Partitioning Setup with Native Postgres</h2>
<p>I'm going to fake out a sample data set for an IoT thermostat. My sample for
this is a table containing these fields: <code>thetime</code>, <code>thermostat_id</code>,
<code>current_temperature</code>, and <code>thermostat_status</code>.</p>
<p>This is kind of a fun little query that will generate quite a bit of data across
a 10 day time period.</p>
<pre><code class="language-sql">CREATE TABLE thermostat AS
WITH time AS (
SELECT generate_series(now() - interval '10 days', now(), '10 minutes') thetime
),
sensors AS (
SELECT generate_series(1,5) as sensor_id
),
temp AS (
SELECT
thetime,
sensor_id,
72 - 10 * cos(2 * pi() * EXTRACT ('hours' from thetime)/24) + random()*10 - 5 AS current_temperature
FROM time,sensors
)
SELECT
thetime,
sensor_id,
current_temperature::numeric(3,1),
CASE
WHEN current_temperature < 70 THEN 'heat'
WHEN current_temperature > 80 THEN 'cool'
ELSE 'off'
END AS thermostat_status
FROM temp;
</code></pre>
<p>Now lets create a new table that will be partitioned</p>
<pre><code class="language-sql">CREATE TABLE iot_thermostat (
thetime timestamptz,
sensor_id int,
current_temperature numeric (3,1),
thermostat_status text
)
PARTITION BY RANGE (thetime);
</code></pre>
<p>Next create an index on <code>thetime</code> field</p>
<pre><code class="language-sql">CREATE INDEX ON iot_thermostat(thetime);
</code></pre>
<p>Now create individual partitions</p>
<pre><code class="language-sql">CREATE TABLE iot_thermostat06242022 PARTITION OF iot_thermostat
FOR VALUES FROM ('2022-07-24 00:00:000') TO ('2022-07-25 00:00:000');
CREATE TABLE iot_thermostat06242022 PARTITION OF iot_thermostat
FOR VALUES FROM ('2022-07-23 00:00:000') TO ('2022-07-24 00:00:000');
# and so on
</code></pre>
<h3>Insert data into your partition</h3>
<p>Now we'll move data from our original <code>thermostat</code> dataset into the
<code>iot_thermostat</code> and data will automatically go into the correct partitions</p>
<pre><code class="language-sql">INSERT INTO iot_thermostat SELECT * from thermostat
</code></pre>
<p>You just need to insert data once, Postgres will take care of moving data to the
correct partitions for you.</p>
<p>Quick check on one of the partitions to make sure you got it all right:</p>
<pre><code class="language-sql">select * from iot_thermostat07242022
</code></pre>
<h3>Rotate partitions</h3>
<p>Ok, so let's say that we only care about data from the last 10 days, so tomorrow
we want to put data from <code>iot_thermostat07142022</code> in a different table and
archive it off. This is done by a detach:</p>
<pre><code class="language-sql">ALTER TABLE iot_thermostat DETACH PARTITION iot_thermostat07142022;
</code></pre>
<p>And that's now a standalone table.</p>
<p>We need to make a new one for tomorrow as well:</p>
<pre><code class="language-sql">CREATE TABLE iot_thermostat06262022 PARTITION OF iot_thermostat
FOR VALUES FROM ('2022-07-26 00:00:000') TO ('2022-07-27 00:00:000');
</code></pre>
<p>Obviously, if you're doing this daily, you'll store these in a cron job
somewhere so they happen automatically.</p>
<h2>Creating Partitions with pg_partman</h2>
<p>If you've been around the Postgres world for very long, you're likely to have
come across the <a href="https://github.com/pgpartman/">pg_partman</a> extension, written
by my colleague Keith Fiske. Pg_partman existed before native partitioning was
introduced in Postgres 10 and was originally based on the concept of triggers
for managing data flow to the correct partitions. Native partitioning doesn't
use triggers and this is generally thought to be much more performant. Today
pg_partman is mostly used for the management and creation of partitions or for
users on older versions of Postgres. It can also be used on newer versions of
Postgres for easier setup of the tables and automatic managing of the
partitions. I'm going to show a quick overview of how to set up tables and
partitioning like I did in the demo above using partman.</p>
<p>If you're working with a cloud system, like
<a href="https://docs.crunchybridge.com/extensions-and-languages/extensions/">Bridge</a>,
you'll <code>CREATE SCHEMA partman;</code> and
<code>CREATE EXTENSION pg_partman SCHEMA partman;</code> and update the
<code>shared_preload_libraries</code> settings. If you're working on a self hosted system,
you'll download from <a href="https://github.com/pgpartman/pg_partman">github</a> and
install.</p>
<p>Create parent partitioned table (this is exactly the same as the first example):</p>
<pre><code class="language-sql">CREATE TABLE iot_thermostat_partman (
thetime timestamptz,
sensor_id int,
current_temperature numeric (3,1),
thermostat_status text
)
PARTITION BY RANGE (thetime);
</code></pre>
<h3>Create the partitions</h3>
<p>This is done by calling the <code>create_parent</code> function via partman. By default
this creates 4 partitions ahead, we can create with 10 days of history, giving a
total of 14 partitions.</p>
<p>If you don't create a template table, pg_partman will create one for you at this
point. For more information on template tables, see the documentation about
<a href="https://github.com/pgpartman/pg_partman/blob/master/doc/pg_partman.md#child-table-property-inheritance">child property inheritance</a>.</p>
<p>The cool thing about <code>create_parent</code> is that this is a one time call of a
function, rather than calling a function every day like my first example. After
you've defined the policies, the background worker you set up during
installation just takes care of this as part of the underlying Postgres
processes. For those in the back, let me say that again, <strong>you just call the
function once</strong> and partitions are automatically created continuously based on
your policies.</p>
<pre><code class="language-sql">SELECT partman.create_parent('public.iot_thermostat_partman', 'thetime', 'native', 'daily',
p_start_partition := (now() - interval '10 days')::date::text );
</code></pre>
<p>Since we only want the most recent 10 days partitions we can set the retention
on this table's configuration as so:</p>
<pre><code class="language-sql">UPDATE partman.part_config SET retention = '10 days' WHERE parent_table = 'public.iot_thermostat_partman';
</code></pre>
<p>View the created partitions:</p>
<pre><code class="language-sql">postgres= \d+ iot_thermostat_partman
Partitioned table "public.iot_thermostat_partman"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------------------+--------------------------+-----------+----------+---------+----------+--------------+-------------
thetime | timestamp with time zone | | | | plain | |
sensor_id | integer | | | | plain | |
current_temperature | numeric(3,1) | | | | main | |
thermostat_status | text | | | | extended | |
Partition key: RANGE (thetime)
Partitions: iot_thermostat_partman_p2022_07_21 FOR VALUES FROM ('2022-07-21 00:00:00+00') TO ('2022-07-22 00:00:00+00'),
iot_thermostat_partman_p2022_07_22 FOR VALUES FROM ('2022-07-22 00:00:00+00') TO ('2022-07-23 00:00:00+00'),
iot_thermostat_partman_p2022_07_23 FOR VALUES FROM ('2022-07-23 00:00:00+00') TO ('2022-07-24 00:00:00+00'),
iot_thermostat_partman_p2022_07_24 FOR VALUES FROM ('2022-07-24 00:00:00+00') TO ('2022-07-25 00:00:00+00'),
iot_thermostat_partman_p2022_07_25 FOR VALUES FROM ('2022-07-25 00:00:00+00') TO ('2022-07-26 00:00:00+00'),
iot_thermostat_partman_p2022_07_26 FOR VALUES FROM ('2022-07-26 00:00:00+00') TO ('2022-07-27 00:00:00+00'),
iot_thermostat_partman_p2022_07_27 FOR VALUES FROM ('2022-07-27 00:00:00+00') TO ('2022-07-28 00:00:00+00'),
iot_thermostat_partman_p2022_07_28 FOR VALUES FROM ('2022-07-28 00:00:00+00') TO ('2022-07-29 00:00:00+00'),
iot_thermostat_partman_p2022_07_29 FOR VALUES FROM ('2022-07-29 00:00:00+00') TO ('2022-07-30 00:00:00+00'),
iot_thermostat_partman_p2022_07_30 FOR VALUES FROM ('2022-07-30 00:00:00+00') TO ('2022-07-31 00:00:00+00'),
iot_thermostat_partman_p2022_07_31 FOR VALUES FROM ('2022-07-31 00:00:00+00') TO ('2022-08-01 00:00:00+00'),
iot_thermostat_partman_p2022_08_01 FOR VALUES FROM ('2022-08-01 00:00:00+00') TO ('2022-08-02 00:00:00+00'),
iot_thermostat_partman_p2022_08_02 FOR VALUES FROM ('2022-08-02 00:00:00+00') TO ('2022-08-03 00:00:00+00'),
iot_thermostat_partman_p2022_08_03 FOR VALUES FROM ('2022-08-03 00:00:00+00') TO ('2022-08-04 00:00:00+00'),
iot_thermostat_partman_p2022_08_04 FOR VALUES FROM ('2022-08-04 00:00:00+00') TO ('2022-08-05 00:00:00+00'),
iot_thermostat_partman_default DEFAULT
</code></pre>
<h2>Beyond the Basics</h2>
<p>Partitioning has a lot of caveats beyond this basic set up I've just shown, so
here's some topics and food for thought on future research for your own specific
use case:</p>
<ul>
<li>
<p><strong>Subpartitions</strong>: For very large data sets, you can actually do nested levels
of partitioning with
<a href="https://github.com/pgpartman/pg_partman/blob/master/doc/pg_partman.md#sub-partitioning">sub-partitioning</a>.
This is generally not needed in most cases.</p>
</li>
<li>
<p><strong>Primary Keys/Unique Indexes</strong>: Postgres does not have the concept of an
index that covers multiple tables so there can be limitations on primary
key/unique index usage in partitioning. In general, the only unique keys you
can use are ones that include the partition keys (which could leave out time
series and other data types). You may want to look at the
<a href="https://github.com/pgpartman/pg_partman/blob/master/doc/pg_partman.md#sub-partitioning">template table creation</a>
in pg_partman that has some features for handling this.</p>
</li>
<li>
<p><strong>Null values</strong>: Generally if you're using partitioning, you can't have null
values in the field that you're partitioning so some thought might need to go
into data management and application logic.</p>
</li>
<li>
<p><strong>Constraints</strong>: Pg_partman has some expanded features for
<a href="https://github.com/pgpartman/pg_partman/blob/master/doc/pg_partman.md#constraint-exclusion">handling additional constraints</a>
outside the partitioning key</p>
</li>
<li>
<p><strong>ORMs</strong>: The amazing thing about using native partitioning is that it works
out of the box with quite a few ORMs, gems, and other tools. Do some testing
and research for your specific application stack.</p>
</li>
</ul>
<p><em>Co-authored with
<a href="https://www.crunchydata.com/blog/author/keith-fiske">Keith Fiske</a></em></p>
PostgreSQL Person of the Week Interview with Elizabeth Garrett Christensen: I’m not your average Postgres person of the week - I’m a non-developer as well as a Postgres fan and marketing/sales/customer facing person. I am part of a Postgres co-working couple and my husband David Christensen and I both work for Crunchy Data from our home in Lawrence, Kansas.
We all know and value SQL functions as a handy shortcut. PostgreSQL v14 has introduced a new, better way to write SQL functions. This article will show the advantages of the new syntax.
An example of an SQL function
Let’s create a simple example of an SQL function with the “classical” syntax so that we have some material for demonstrations:
CREATE EXTENSION unaccent;
CREATE FUNCTION mangle(t text) RETURNS text
LANGUAGE sql
AS 'SELECT lower(unaccent(t))';
You can use the new function like other database functions:
SELECT mangle('Schön dumm');
mangle
════════════
schon dumm
(1 row)
Why SQL functions?
You may ask what good an SQL function is. After all, the main purpose of a database function is to be able to run procedural code inside the database, something you cannot do with SQL. But SQL functions have their use:
code reuse for expressions frequently used in different SQL statements
to make SQL statements more readable by factoring out part of the code into a function with a meaningful name
Moreover, simple SQL functions can be inlined, that is, the optimizer can replace the function call with the function definition at query planning time. This can make SQL functions singularly efficient:
it removes the overhead of an actual function call
since functions are (mostly) black boxes to the optimizer, replacing the function with its definition usually gives you better estimates
We can see function inlining if we use EXPLAIN (VERBOSE) on our example function:
EXPLAIN (VERBOSE, COSTS OFF) SELECT mangle('Schön dumm');
QUERY PLAN
═══════════════════════════════════════════════
Result
Output: lower(unaccent('Schön dumm'::text))
(2 rows)
Shortcomings of PostgreSQL functions
PostgreSQL functions are great. One of the nice aspects is that you are not restricted to a single programming language. Out of the box, PostgreSQL supports functions written in SQL, C, PL/pgSQL (a clone of Oracle’s PL/SQL), Perl, Python and Tcl. But that is not all: in PostgreSQL, you can write a plugin that allows you to use any language of your choice inside the database. To allow that flexibility, the function body of a PostgreSQL function is simply a string constant that the call handler of the procedural language interprets when PostgreSQL executes the function. This has some undesirable side effects:
Lack of dependency tracking
Usually, PostgreSQL tracks dependencies between database objects in the pg_depend and pg_shdepend catalog tables. That way, the database knows the relationships between objects: it will either prevent you from dropping objects on which other objects depend (like a table with a foreign key reference) or drop dependent objects automatically (like dropping a table drops all indexes on the table).
Since the body of a function is just a string constant that PostgreSQL cannot interpret, it won’t track dependencies between a function and objects used in the function. A procedural language can provide a validator that checks the function body for syntactic correctness (if check_function_bodies = on). The validator can also test if the objects referenced in the function exist, but it cannot keep you from later dropping an object used by the function.
Let’s demonstrate that with our example:
DROP EXTENSION unaccent;
SELECT mangle('boom');
ERROR: function unaccent(text) does not exist
LINE 1: SELECT lower(unaccent(t))
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
QUERY: SELECT lower(unaccent(t))
CONTEXT: SQL function "mangle" during inlining
We will fix the problem by creating the extension again. However, it would be better to get an error message when we run DROP EXTENSION without using the CASCADE option.
search_path as a security problem
Since PostgreSQL parses the function body at query execution time, it uses the current setting of search_path to resolve all references to database objects that are not qualified with the schema name. That is not limited to tables and views, but also extends to functions and operators. We can use our example function to demonstrate the problem:
SET search_path = pg_catalog;
SELECT public.mangle('boom');
ERROR: function unaccent(text) does not exist
LINE 1: SELECT lower(unaccent(t))
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
QUERY: SELECT lower(unaccent(t))
CONTEXT: SQL function "mangle" during inlining
In our example, it is a mere annoyance that we can avoid by using public.unaccent() in the function call. But it can be worse than that, particularly with SECURITY DEFINER functions. Since it is cumbersome to schema-qualify each function and operator, the recommended solution is to force a search_path on the function:
ALTER FUNCTION mangle(text) SET search_path = public;
Note that the schemas on the search_path should allow CREATE only to privileged users, so the above is not a good idea on versions older than v15!
An unpleasant downside of setting a search_path is that it prevents the inlining of the SQL function.
The new SQL function syntax in PostgreSQL v14
From PostgreSQL v14 on, the body of SQL functions and procedures need no longer be a string constant. You can now use one of the following forms for the function body:
CREATE FUNCTION function_name(...) RETURNS ...
RETURN expression;
CREATE FUNCTION function_name(...) RETURNS ...
BEGIN ATOMIC
statement;
...
END;
The first form requires the function body to be an expression. So if you want to perform a query, you have to wrap it in parentheses (turning it into a subquery, which is a valid expression). For example:
CREATE FUNCTION get_data(v_id bigint) RETURNS text
RETURN (SELECT value FROM data WHERE is = v_id);
The second form allows you to write a function with more than one SQL statement. As it used to be with multi-statement SQL functions, the result of the function will be the result of the final SQL statement. You can also use the second form of the new syntax to create SQL procedures. The first form is obviously not suitable for a procedure, since procedures don’t have a return value.
We can easily rewrite our example function to use the new syntax:
CREATE OR REPLACE FUNCTION mangle(t text) RETURNS text
RETURN lower(unaccent(t));
Note that these new SQL functions can be inlined into SQL statements just like the old ones!
Advantages of the new SQL function syntax
The main difference is that the new-style SQL functions and procedures are parsed at function definition time and stored in parsed form in the prosqlbody column of the pg_proc system catalog. As a consequence, the two shortcomings noted above are gone:
Dependency tracking with new-style SQL functions
Because the function body is available in parsed form, PostgreSQL can track dependencies. Let’s try that with our redefined example function:
DROP EXTENSION unaccent;
ERROR: cannot drop extension unaccent because other objects depend on it
DETAIL: function mangle(text) depends on function unaccent(text)
HINT: Use DROP ... CASCADE to drop the dependent objects too.
Fixed search_path with new-style SQL functions
search_path is only relevant when SQL is parsed. Since this now happens when CREATE FUNCTION runs, we don’t have to worry about the current setting of that parameter at function execution time:
SET search_path = pg_catalog;
SELECT public.mangle('Schön besser');
mangle
══════════════
schon besser
(1 row)
Problems with interactive clients
You may notice that the multi-statement form for defining SQL functions contains semicolons to terminate the SQL statements. That will not only confuse the usual suspects like HeidiSQL (which never learned dollar quoting), but it will be a problem for any client that recognizes semicolons as separator between SQL statements. Even older versions of psql have a problem with that syntax:
psql (13.7, server 15beta2)
WARNING: psql major version 13, server major version 15.
Some psql features might not work.
Type "help" for help.
test=> CREATE FUNCTION tryme() RETURNS integer
BEGIN ATOMIC
SELECT 42;
END;
ERROR: syntax error at end of input
LINE 3: SELECT 42;
^
WARNING: there is no transaction in progress
COMMIT
psql thinks that the semicolon after “SELECT 42” terminates the CREATE FUNCTION statement. The truncated statement causes an error. The final END is treated as its own statement, which is a synonym for COMMIT and causes a warning.
In v14 and above, psql handles such statements correctly. pgAdmin 4 has learned the new syntax with version 6.3. But I am sure that there are many clients out there that have not got the message yet.
Conclusion
The new syntax for SQL function introduced by PostgreSQL v14 has great advantages for usability and security. Get a client that supports the new syntax and start using it for your SQL functions. You should consider rewriting your existing functions to make use of these benefits.
The State of PostgreSQL 2022 survey closed a few weeks ago, and we're hard at work cleaning and analyzing the data to provide the best insights we can for the PostgreSQL community.
In the database community, however, there are usually two things that drive lots of discussion year after year: performance and tooling. During this year's survey, we modified the questions slightly so that we could focus on three specific use cases and the PostgreSQL tools that the community finds most helpful for each: querying and administration, development, and data visualization.
PostgreSQL Tools: What Do We Have Against psql?
Absolutely nothing! As evidenced by the majority of respondents (69.4 %) that mentioned using psql for querying and administration, it's the ubiquitous choice for so many PostgreSQL users and there is already good documentation and community contributed resources (https://psql-tips.org/ by Leatitia Avrot is a great example) to learn more about it.
So that got us thinking. What other tools did folks bring up often for interacting with PostgreSQL along the three use cases mentioned above?
I'm glad we asked. 😉
PostgreSQL Querying and Administration
As we just said, psql is by far the most popular tool for interacting with PostgreSQL. 🎉
It's clear, however, that many users with all levels of experience do trust other tools as well.
Query and administration tools
pgAdmin (35 %), DBeaver (26 %), Datagrip (13 %), and IntelliJ (10 %) IDEs received the most mentions. Most of these aren't surprising if you've been working with databases, PostgreSQL or not. The most popular GUIs (pgAdmin and DBeaver) are open source and freely available to use. The next more popular GUIs (Datagrip and IntelliJ) are licensed per seat. However, if your company or team already uses JetBrain's tools, you might have access to these popular tools.
What I was more interested in were the mentions that happened just after the more popular tools I expected to see. Often, it's this next set of PostgreSQL tools that has gained enough attention from community members that there's obviously a value proposition to investigate further. If they can be helpful to my (or your) development workflow in certain situations, I think it's worth digging a little deeper.
pgcli
First on the list is pgcli, a Python-based command-line tool and one of many dbcli tools created for various databases. Although this is not a replacement for psql, it provides an interactive, auto-complete interface for writing SQL and getting results. Syntax highlighting and some basic support for psql backslash commands are included. If you love to stay in the terminal but want a little more interactivity, the dbcli tools have been around for quite some time, have a nice community of support, and might make database exploration just a little bit easier sometimes.
Azure Data Studio
Introduced as a beta in December 2017 by the Microsoft database tooling team, Azure Data Studio has been built on top of the same Electron platform as Visual Studio Code. Although the primary feature set is currently geared towards SQL Server (for obvious reasons), the ability to connect to PostgreSQL has been available since 2019.
There are a couple of unique features in Azure Data Studio (ADS) that work with both SQL Server and PostgreSQL connections that I think are worth mentioning.
First, ADS includes the ability to create and run SQL-based Jupyter Notebooks. Typically you'd have to wrap your SQL inside of another runtime like Python, but ADS provides the option to select the "SQL" kernel and deals with the connection and SQL wrapping behind the scenes.
Second, ADS provides the ability to export query results to Excel without any plugins needed. While there are (seemingly) a thousand ways to quickly get a result set into CSV, producing a correctly formatted Excel file requires a plugin with almost any other tool. Regardless of how you feel about Excel, it is still the tool of choice for many data analysts, and being able to provide an Excel file easily does help sometimes.
Finally, ADS also provides some basic charting capabilities using query results. There's no need to set up a notebook and use a charting library like plotly if you just need to get some quick visualizations on the data. I've had a few hiccups with the capabilities (it's certainly not intended as a serious data analytics tool), but it can be helpful to get some quick chart images to share while exploring query data.
Postico
For anyone using MacOS, Postico is a GUI application that's been recommended in many of my circles. Many folks prefer the native MacOS feel, and some of the unique usability and editing features that make working with PostgreSQL simple and intuitive.
Up and coming
We'll leave it to you to look through the data and see what other GUI/query tools fellow PostgreSQL users are also using that might be of interest to you, but there are a few that were mentioned multiple times and even caused me to hit Google a few times to find out more. Some are free and open source, while others require licenses but provide interesting features like built-in data analytics capabilities. Whether you end up using any of these or not, it's good to see continued innovation within the tooling market, something that doesn't seem to be slowing down decades into our SQL journey.
Helpful Third-Party PostgreSQL Tools for Application Development
Although the GUI/administration landscape is certainly as active as ever, one of the most impactful features of PostgreSQL is how extensible it is. If the core application doesn't provide exactly what your application needs, there's a good chance someone (or some company) is working to provide that functionality.
The total distinct number of tools mentioned was similar to GUI/administration tools and generally fell into four categories: management features, cluster monitoring, query plan insights, and database DevOps tooling. For this blog post, we're going to focus on the first three areas.
Management features
It's not surprising that the most popular third-party PostgreSQL tools tend to be focused on daily management tasks of some sort. Two of the most popular tools in this area are mainstays in most self-hosted PostgreSQL circles.
pgBouncer
PostgreSQL creates one new process (not thread) per connection. Without proper tuning and a right-sized server, a database can quickly become overwhelmed with unplanned spikes in usage. pgBouncer is an open-source connection pooling application that helps manage connection usage for high-traffic applications.
If your database is self-hosted or your DBaaS doesn't provide some kind of connection pooling management for you, pgBouncer can be installed anywhere that makes sense with respect to your application to provide better connection management.
pgBackRest
Database backups are essential, obviously, and PostgreSQL has always had standard tooling for backup and restore. But as databases have grown in size and application architectures have become more complex, using pg_dump and pg_restore can make it more difficult than intended to perform these tasks well.
The Crunchy Data team created pgBackRest to help provide a full-fledged backups and restore system with many necessary features for enterprise workloads. Multi-threaded backup and compression, multiple repository locations, and backup resume are just a few features that make this a common and valuable tool for any PostgreSQL administrator.
Cluster monitoring
The second area of third-party PostgreSQL tools that show up often focuses on improved database monitoring, which includes query monitoring in most cases. There are a lot of folks tackling this problem area from many different angles, which demonstrates the continued need that many developers and administrators have when managing PostgreSQL.
pgBadger
PostgreSQL has a lot of settings that can be tuned and details that can be logged into server logs, but there is no built-in functionality for holistically analyzing that data cohesively. This is where pgBadger steps in to help generate useful reports from all of the data your server is logging.
pgBadger is one of a few popular PostgreSQL tools written in Perl (which surprises me for some reason), but the developer has gone to great lengths to not require lots of Perl-specific modules for drawing charts and graphs, instead relying on common JavaScript libraries in the rendered reports.
There's a lot to look at with pgBadger, and the larger PostgreSQL community often recommends it as a helpful, long-term debugging tool for server performance issues.
pganalyze
pganalyze has grown in popularity quite a lot over the last few years. Lukas Fittl has done a great job adding new features and capabilities while also providing a number of great PostgreSQL community resources across various platforms.
pganalyze is a fee-based product that uses data provided by standard plugins (pg_stat_statements for example) which is then consumed through a collector that sends the data to a cloud service. If you use pganalyze to query log information as well (e.g., long-running queries), then features like example problem queries and index advisor could be really helpful for your development workflow and user experience.
Query plan analysis
No discussion about PostgreSQL would be complete without mentioning tools that help you understand EXPLAIN output better. This is one area so many people struggle with, particularly based on what their previous experience is with another database, and a small cache of common, helpful tools have been growing in popularity to help with this essential task.
Depesz EXPLAIN and Dalibo EXPLAIN
Both Depesz and Dalibo EXPLAIN provide a quick, free platform for taking a PostgreSQL explain plan and providing helpful insights into which operations are causing a slow query and, in some cases, providing helpful hints to help speed things up. Also, if you let them, both tools provide a permalink to the output for you to share with others if necessary.
pgMustard
One of my favorite EXPLAIN tools is pgMustard, created and maintained by Michael Christofides. This is a for-fee tool, but there are a lot of unique insights and features that pgMustard provides that others currently don't. Michael is also doing great work within the community, even recently starting a PostgreSQL podcast with Nikolay Samokhvalov, with whom we recently talked about all things SQL.
Which Visualization Tools Do You Use?
The final tooling question on the State of PostgreSQL survey asked about visualization tools that folks used. Without a doubt, Grafana was the top vote-getter, but that's something we could have probably guessed pretty easily.
I was surprised that the next two top vote-getters were for pgAdmin and DBeaver, both popular database GUI tools we mentioned earlier. In both cases, visualization capabilities are somewhat limited, so it's hard to tell exactly what kind of features are being used that would categorize them as visualization tools.
The next group of tools is more interesting to me and I wanted to highlight a few that might pique your interest to investigate further.
QGIS
QGIS is a desktop application that's used to visualize spatial data, whether from PostGIS queries or other data sources. As I've had the pleasure of learning about GIS data and queries from Ryan Lambert over the past few years, I've seen him use this tool for lots of valuable and interesting spatial queries. If you rely on PostGIS for application features and you store spatial data, take a look at how QGIS might be able to help your analysis workflow.
Superset
There are a number of data visualization and dashboarding alternatives in the market, and PostgreSQL support is universally expected regardless of the tool. Superset is an open-source option that also has commercial support and hosting options available through Preset.io. With more than 40 chart types and a vibrant community, there's a lot to explore in the Superset ecosystem.
Streamlit
For those developers that use Python for most of their data analysis and visualizations, Streamlit is another popular choice that can easily fit into your existing workflow. Streamlit isn't a drag-and-drop UI for creating dashboards, but rather a programmatic interface for building and deploying data analysis applications using Python. And as of July 2022, you can deploy public data apps using Streamlit.io.
What About You?
There were so many interesting answers and suggestions provided by the community to these three questions. It's clear that there are a lot of people around the world working to help developers and database professionals be more productive across many common tasks.
Are there any surprises in this list or tools that you think didn't make the list? Hit us up on Slack, our Forum, or Twitter (@timescaleDB) to share other tools that are important to your daily PostgreSQL workflow!
Read the Report
Now that we’ve given you a taste of our survey results, are you curious to learn more about the PostgreSQL community? If you’d like to know more insights about the State of PostgreSQL 2022, including why respondents chose PostgreSQL, their opinion on industry events, and what information sources they would recommend to friends and colleagues, don’t miss our complete report. Click here to get notified and learn firsthand what the State of PostgreSQL is in 2022.
Around this time of year, I am reading through the upcoming PostgreSQL release notes (hello PostgreSQL 15), reading mailing lists, and talking to PostgreSQL users to understand what are the impactful features. Many of these features will be highlighted in the feature matrix and release announcement (hello PostgreSQL 14!).
However, sometimes I miss an impactful feature. It could be that we need to see how the feature is actually used post-release, or it could be that it was missed. I believe one such change is the introduction of the SQL-standard BEGIN ATOMIC syntax in PostgreSQL 14 that is used to create function and stored procedure bodies.
Let’s see how functions we could create functions before PostgreSQL 14, the drawbacks to this method, and how going forward, BEGIN ATOMIC makes it easier and safer to manage functions!
Before PostgreSQL 14: Creating Functions as Strings
PostgreSQL has supported the ability to create stored functions (or “user-defined functions”) since POSTGRES 4.2 in 1994 (thanks [Bruce Momjian[(https://momjian.us/)] for the answer on this). When declaring the function body, you would write the code as a string (which is why you see the $$ marks in functions). For example, here is a simple function to add two numbers:
If I want to have functions that calls another user-defined function, I can do so similarly to the above. For example, here is a function that uses the add function to add up a whole bunch of numbers:
SELECT test1_add_stuff(1,2,3,4);
ERROR: function add(integer, integer) does not exist
LINE 2: SELECT add(add($1, $2), add($3, $4));
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
QUERY:
SELECT add(add($1, $2), add($3, $4));
CONTEXT: SQL function "test1_add_stuff" during startup
Well, this stinks. Using the pre-v14 style of creating custom functions in PostgreSQL lacks dependency tracking. So dropping one function could end up breaking other functions.
PostgreSQL does support dependency tracking. Prior to PostgreSQL 14, PostgreSQL could track dependencies in functions that involve attributes such as arguments or returns types, but it could not track dependencies within a function body.
This is where BEGIN ATOMIC function bodies changes things.
BEGIN ATOMIC: a better way to create and manage PostgreSQL user-defined functions
I started looking at the BEGIN ATOMIC method of creating functions thanks to a note from Morris de Oryx on the usefulness of this feature (and the fact it was previously missing from the feature matrix). Morris succinctly pointed out two of the most impactful attributes of creating a PostgreSQL function with BEGIN ATOMIC:
Because PostgreSQL parses the functions during creation, it can catch additional errors.
PostgreSQL has improved dependency tracking of functions.
Let’s use the above example again to see how BEGIN ATOMIC changes things.
Dependency Tracking
Let’s create the add function using the BEGIN ATOMIC syntax:
Notice the difference: the function body is no longer represented as a string, but actual code statements. Let’s now create a new function called test2_add_stuff that will use the add function:
What happens if we try to drop the add function now?
DROP FUNCTION add(int,int);
ERROR: cannot drop function add(integer,integer) because other objects depend on it
DETAIL: function test2_add_stuff(integer,integer,integer,integer) depends on function add(integer,integer)
HINT: Use DROP ... CASCADE to drop the dependent objects too.
This is awesome: using BEGIN ATOMIC when for functions guards against dropping a function that is called by one or more other functions!
Creation Time Parsing Checks
Let’s look at one more example at how BEGIN ATOMIC helps us avoid errors.
Prior to PostgreSQL 14, there are some checks that do occur at function creation time. For example, let’s create a function that calls a function that calls a function.
CREATE FUNCTION c(int)
RETURNS int
LANGUAGE SQL
IMMUTABLE PARALLEL SAFE
AS $$
SELECT b($1);
$$;
CREATE FUNCTION
The function creation works, even though a is still missing. While functions created without BEGIN ATOMIC can check for dependencies within the body of the function itself, they do not check throughout the entire parse tree. We can see our call to c fail due to a still missing:
SELECT c(3);
ERROR: function a(integer) does not exist
LINE 2: SELECT a($1) * 2;
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
QUERY:
SELECT a($1) * 2;
CONTEXT: SQL function "b" during inlining
SQL function "c" during startup
Using BEGIN ATOMIC for creating all of these functions will prevent us from getting into this situation, as PostgreSQL will scan the parse tree to ensure all of these functions exist. Let’s now recreate the functions to use BEGIN ATOMIC for their function bodies:
Now, try dropping a - PostgreSQL will prevent this from happening.
DROP FUNCTION a(int);
ERROR: cannot drop function a(integer) because other objects depend on it
DETAIL: function b(integer) depends on function a(integer)
function c(integer) depends on function b(integer)
HINT: Use DROP ... CASCADE to drop the dependent objects too.
Great. But what happens if I do a DROP FUNCTION ... CASCADE on function a?
DROP FUNCTION a(int) CASCADE;
NOTICE: drop cascades to 2 other objects
DETAIL: drop cascades to function b(integer)
drop cascades to function c(integer)
DROP FUNCTION
PostgreSQL drops all functions that depend on a, including both b and c. While this is handy for cleaning up a test example, be careful when cascading drops on our production systems so you do not accidentally remove an important object!
Conclusion
The BEGIN ATOMIC syntax for creating PostgreSQL functions should make managing user-defined functions less error prone, particularly the “accidental dropped function” case. Going forward I will definitely use this feature to manage my PostgreSQL stored functions.
PostgreSQL releases are packed with new features. In fact, PostgreSQL 14 has over 190 features listed in the release notes! This makes it possible to miss a feature that may make it much easier for you to build applications with PostgreSQL. I encourage you to read the release notes to see if there is a feature in them that will help your PostgreSQL experience!
If you’ve been running PostgreSQL for a while, you’ve heard about autovacuum. Yes, autovacuum, the thing which everybody asks you not to turn off, which is supposed to keep your database clean and reduce bloat automatically.
And yet—imagine this: one fine day, you see that your database size is larger than you expect, the I/O load on your database has increased, and things have slowed down without much change in workload. You begin looking into what might have happened. You run the excellent Postgres bloat query and you notice you have a lot of bloat. So you run the VACUUM command manually to clear the bloat in your Postgres database. Good!
But then you have to address the elephant in the room: why didn’t Postgres autovacuum clean up the bloat in the first place…? Does the above story sound familiar? Well, you are not alone. 😊
Autovacuum and VACUUM provide a number of configuration parameters to adapt it to fit your workload, but the challenge is figuring out which ones to tune. In this post—based on my optimizing autovacuum talk at Citus Con: An Event for Postgres—you’ll learn to figure out where the problem lies and what to tune to make it better.
More specifically, you’ll learn how to investigate—and how to fix—these 3 common types of autovacuum problems:
Another common type of autovacuum problem is transaction id wraparound related, which is a meaty topic all on its own. In the future I plan to write a separate, follow-on blog post to focus on that topic.
Overview of all 13 autovacuum tips in this blog post
This cheat sheet diagram of “autovacuum tips” gives you an overview of all the Postgres autovacuum fixes you’ll learn about in this blog post:
Figure 1: Diagram of the 13 different types of possible autovacuum fixes for the 3 most common types of autovacuum problems in Postgres.
Intro to Autovacuum
If you’re not yet familiar, Postgres uses Multiversion Concurrency Control (MVCC) to guarantee isolation while providing concurrent access to data. This means multiple versions of a row can exist in the database simultaneously. So, when rows are deleted, older versions are still kept around, since older transactions may still be accessing those versions.
Once all transactions which require a row version are complete, those row versions can be removed. This can be done by the VACUUM command. Now, VACUUM can be run manually but that requires you to monitor and make decisions about various things like: when to run vacuum, which tables to vacuum, how frequently to vacuum etc.
To make life easier for you, PostgreSQL has an autovacuum utility that:
wakes up every autovacuum_naptime seconds
checks for tables that have been “significantly modified”
starts more workers to run VACUUM and ANALYZE jobs on those tables in parallel.
Now, the definition of “significantly modified” in bullet #2 above—and how much to vacuum in parallel—depends heavily on your workload, transaction rate, and hardware. Let’s start looking into debugging autovacuum with one of the most common autovacuum issues—autovacuum not vacuuming a “significantly modified” table.
Problem #1: Autovacuum doesn’t trigger vacuum often enough
Vacuuming is typically triggered for a table if (non-transaction id wrapround related)
obsoleted tuples > autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * number of tuples OR
the number of inserted tuples > autovacuum_vacuum_insert_threshold + autovacuum_vacuum_insert_scale_factor * number of tuples.
If you see bloat growing more than expected and find yourself needing to manually run VACUUM to clear up bloat, it’s an indication that autovacuum is not vacuuming tables often enough.
You can check when and how frequently tables were vacuumed by checking pg_stat_user_tables. If your large tables show up here with low autovacuum counts and last_autovacuum well in the past, it’s another sign that autovacuum isn’t vacuuming your tables at the right time.
To vacuum tables at the right frequency, you should adjust the autovacuum_vacuum_scale_factor and autovacuum_vacuum_insert_scale_factor based on the size and growth rate of the tables.
As an example, for a table which has 1B rows, the default scale factor will lead to a vacuum being triggered when 200M rows change, which is quite a lot of bloat. To bring that to a more reasonable value, it might be wiser to set it to 0.02 or even 0.002 depending on the rate of change and size.
Problem #2: Vacuum is too slow
The 2nd problem you might encounter is that your tables are being vacuumed too slowly. This may manifest as bloat growing because your rate of cleaning up bloat is slower than your transaction rate. Or, you will see vacuum processes running constantly on your system when you check pg_stat_activity.
There are a few ways you can speed up vacuuming: these recommendations apply both to autovacuum and to manually triggered VACUUM.
Reducing the impact of cost limiting
The first thing you should check is if you have cost limiting enabled. When vacuum is running, the system maintains a counter that tracks estimated cost of different I/O operations. When that cost exceeds autovacuum_vacuum_cost_limit (or vacuum_cost_limit), the process sleeps for autovacuum_vacuum_cost_delay (or vacuum_cost_delay) ms. This is called cost limiting and is done to reduce the impact of vacuuming on other processes.
If you notice that vacuum is falling behind, you could disable cost limiting (by setting autovacuum_vacuum_cost_delay to 0) or reduce its impact by either decreasing autovacuum_vacuum_cost_delay or increasing autovacuum_vacuum_cost_limit to a high value (like 10000).
Increasing the number of parallel workers
Autovacuum can only vacuum autovacuum_max_workers tables in parallel. So, if you have hundreds of tables being actively written to (and needing to be vacuumed), doing them 3 at a time might take a while (3 is the default value for autovacuum_max_workers).
Therefore, in scenarios with a large number of active tables, it might be worth increasing autovacuum_max_workers to a higher value—assuming you have enough compute to support running more autovacuum workers.
Before increasing the number of autovacuum workers, make sure that you are not being limited by cost limiting. Cost limits are shared among all active autovacuum workers, so just increasing the number of parallel workers may not help, as each of them will then start doing lesser work.
To find more ideas on what to tune, it might be worth looking into pg_stat_progress_vacuum to understand what phase your ongoing vacuums are in and how you can improve their performance. Let’s look at a few examples where it might give useful insights.
Improving scan speed by prefetching and caching
To see how fast vacuum is progressing, you could compare heap_blks_scanned with heap_blks_total in pg_stat_progress_vacuum over time. If you see progress is slow and the phase is scanning heap, that means vacuum needs to scan a lot of heap blocks to complete.
In this case, you can scan the heap faster by prefetching larger relations in memory by using something like pg_prewarm or by increasing shared_buffers.
Increasing memory to store more dead tuples
When scanning the heap, vacuum collects dead tuples in memory. The number of dead tuples it can store is determined by maintenance_work_mem (or autovacuum_work_mem, if set). Once the maximum number of tuples have been collected, vacuum must switch to vacuuming indexes and then return to scanning heap again after the indexes and heap are vacuumed (i.e. after an index vacuuming cycle).
So, if you notice that index_vacuum_count in pg_stat_progress_vacuum is high—well, it means that vacuum is having to go through many such index vacuum cycles.
To reduce the number of cycles vacuum needs and to make it faster, you can increase autovacuum_work_mem so that vacuum can store more dead tuples per cycle.
Vacuum indexes in parallel
If you see that the phase in pg_stat_progress_vacuum is vacuuming indexes for a long time, you should check if you have a lot of indexes on the table being vacuumed.
If you have many indexes, you could make vacuuming faster by increasing max_parallel_maintenance_workers to process indexes in parallel. Note that this configuration change will help only if you manually run VACUUM commands. (Unfortunately, parallel vacuum is currently not supported for autovacuum.)
With all these recommendations, you should be able to speed up vacuuming significantly. But, what if your vacuum completes in time and you still notice that dead tuples have not come down? In the upcoming paragraphs, we will try to find causes and solutions for this new type of problem: vacuum finishes but is unable to clean dead rows.
Problem #3: Vacuum isn’t cleaning up dead rows
Vacuum can only clean row versions which no other transaction needs. But, if Postgres feels certain rows are “needed”, they won’t be cleaned up.
Let’s explore 4 common scenarios in which vacuum cannot clean up rows (and what to do about these scenarios!)
If you have a transaction that’s been running for several hours or days, the transaction might be holding onto rows, not allowing vacuum to clean the rows. You can find long-running transactions by running:
To prevent long running transactions from blocking vacuuming, you can terminate them by running pg_terminate_backend() on their PIDs.
To deal with long running transactions in a proactive way, you could:
Set a large statement_timeout to automatically time out long queries, or
Set idle_in_transaction_session_timeout to time out sessions which are idle within an open transaction, or
Set log_min_duration_statement to at least log long running queries so that you can set an alert on them and kill them manually.
Long-running queries on standby with hot_standby_feedback = on
Typically, Postgres can clean up a row version as soon as it isn’t visible to any transaction. If you’re running Postgres on a primary with a standby node, it’s possible for a vacuum to clean up a row version on the primary which is needed by a query on the standby. This situation is called a “replication conflict”—and when it’s detected, the query on the standby node will be cancelled.
To prevent queries on the standby node from being cancelled due to replication conflicts, you can set hot_standby_feedback = on, which will make the standby inform the primary about the oldest transaction running on it. As a result, the primary can avoid cleaning up rows which are still being used by transactions on the standby.
However, setting hot_standby_feedback = on also means that long running queries on the standby have the capability to block rows from getting cleaned up on the primary.
To get the xmin horizon of all your standbys, you can run:
To avoid having excessive bloat on the primary due to long-running transactions on the standby, you can take one of the following approaches:
Continue dealing with replication conflicts and set hot_standby_feedback = off.
Set vacuum_defer_cleanup_age to a higher value—in order to defer cleaning up rows on the primary until vacuum_defer_cleanup_age transactions have passed, giving more time to standby queries to complete without running into replication conflicts.
Lastly, you can also track and terminate long running queries on the standby like we discussed for the primary in the long running transactions section above.
Unused replication slots
A replication slot in Postgres stores information required by a replica to catch up with the primary. If the replica is down, or severely behind, the rows in the replication slot can’t be vacuumed on the primary.
This additional bloat can happen for physical replication only when you have hot_standby_feedback = on. For logical replication, you would be seeing bloat only for catalog tables.
You can run the query below to find replication slots with old transactions to retain.
Once you find them, you can drop inactive or unneeded replication slots by running pg_drop_replication_slot(). You can also apply learnings from the section on how to manage hot_standby_feedback for physical replication.
Uncommitted PREPARED transactions
Postgres supports 2 phase commits (2PC), which has 2 distinct steps. First, the transaction is prepared with PREPARE TRANSACTION and second, the transaction is committed with COMMIT PREPARED.
2PCs are resilient transactions meant to tolerate server restarts. So, if you have any PREPARED transactions hanging around for some reason, they might be holding onto rows. You can find old prepared transactions by running:
You can remove hanging 2PC transactions by running ROLLBACK PREPARED on them manually.
Another possibility: Vacuuming gets terminated repeatedly
Autovacuum knows that it’s a system process and prioritizes itself lower than user queries. So, if a process triggered by autovacuum is unable to acquire the locks it needs to vacuum, the process ends itself. That means if a particular table has DDL running on it almost all the time, a vacuum might not be able to acquire the needed locks and hence dead rows won’t be cleaned up.
If you notice that not being able to get the right locks is causing bloat to rise, you might have to do one of 2 things:
manually VACUUM the table (the good news is that manual VACUUM won’t terminate itself) or
manage the DDL activity on that table to give autovacuum time to clean dead rows
Now that you have walked through the causes and the 13 tips for debugging Postgres autovacuum issues, you should be able to handle problems like: (1) autovacuum is not triggering vacuum often enough; or (2) vacuum is too slow; or (3) vacuum isn’t cleaning up dead rows.
If you’ve addressed all of these and autovacuum still can’t keep up with your transaction rate, it might be time to upgrade your Postgres server to bigger hardware—or to scale out your database using multiple nodes, with Citus.
Below, I’m including a reference table which summarizes all the different Postgres configs we’ve mentioned in this post to optimize autovacuum.
Configuration Parameters for Debugging Postgres Autovacuum
Postgres Configs (in order of appearance)
Recommendation
#1 - Vacuum not triggered enough
autovacuum_vacuum_scale_factor
Lower the value to trigger vacuuming more frequently, useful for larger tables with more updates / deletes.
autovacuum_vacuum_insert_scale_factor
Lower the values to trigger vacuuming more frequently for large, insert-heavy tables.
#2 - Vacuum too slow
autovacuum_vacuum_cost_delay
Decrease to reduce cost limiting sleep time and make vacuuming faster.
autovacuum_vacuum_cost_limit
Increase the cost to be accumulated before vacuum will sleep, thereby reducing sleep frequency and making vacuum go faster.
autovacuum_max_workers
Increase to allow more parallel workers to be triggered by autovacuum.
shared_buffers
Consider increasing memory for shared memory buffers, enabling better caching of blocks which allows vacuum to scan faster.*
autovacuum_work_mem
Increase to allow each autovacuum worker process to store more dead tuples while scanning a table. Set to -1 to fallback to maintenance_work_mem.
maintenance_work_mem
Increase to allow each autovacuum worker process to store more dead tuples while scanning a table.*
max_parallel_maintenance_workers
Increase to allow `VACUUM` to vacuum more indexes in parallel.*
#3 - Vacuum isn’t cleaning up dead rows
statement_timeout
Set to automatically terminate long-running queries after a specified time.**
idle_in_transaction_session_timeout
Set to terminate any session that has been idle with an open transaction for longer than specified time.**
log_min_duration_statement
Set to log each completed statement which takes longer than the specified timeout.**
hot_standby_feedback
Set to “on” so the standby sends feedback to the primary about running queries. Decreases query cancellation, but can increase bloat. Consider switching “off” if bloat is too high.
vacuum_defer_cleanup_age
Set to defer cleaning up row versions until specified transactions have passed. Allows more time for standby queries to complete without running into conflicts due to early cleanup.
* Changing this config can impact queries other than autovacuum. To learn more about the implications, refer to the Postgres documentation.
** These changes to timeouts will apply to all transactions—not just transactions which are holding up dead rows.
Side-note: The other class of autovacuum issues you might run into are related to transaction id wraparound vacuums. These are triggered based on a different criterion and behave differently from your regular vacuum. Hence, they deserve a blog post of their own . I’ll be writing a part 2 of this blog post soon focused on what transaction ID wraparound autovacuums are, what makes them different, and how to deal with common issues encountered while they are running. Stay tuned!
Figure 2: Watch the autovacuum talk from Citus Con: An Event for Postgres, titled Optimizing Autovacuum: PostgreSQL’s vacuum cleaner, available on YouTube.
This article was originally published on citusdata.com.