The pgpool II community is gearing up to release the Alpha version of its next major release; pgpool II 4.2. It is going to be another exciting release of pgpool II that is a middleware product and provides mission critical functionality like load balancing, high availability, connection pooling etc for PostgreSQL server. We have written in detail about some of the major features of pgpool II 4.2 i.e. LDAP authentication support, supporting snapshot isolation mode etc, the purpose of this blog is provide brief description about all the major features provided in the 4.2 release.
The focus for last couple of major releases for pgpool II was primarily performance and high availability. The focus for this release was around security and improving user experience by extending the functionality of pgpool II. Another major feature that didn’t make it into this release due to resource constraint was GUI interface for configuration, management and monitoring of pgpool II cluster. This is a much needed feature in order to improve the user experience of pgpool II and make it easy to configure and deploy pgpool II cluster. Some of the infrastructure needed for supporting this feature like improved statistics etc was added (will discuss it in the blog) but GUI interface didn’t make it due to resource constraints.
Below is the summary of most of major and minor features added in pgpool II 4.2 release :
Logging Collector
Similar to community PostgreSQL, the logging_collector parameter that accepts the boolean value is added to added to pgpool II. The logging collector is a background process that captures the log messages and directs them to stderr or into log files. Please note that this parameter can only be set at pgpool II start.
log_disconnections (boolean)
This is another parameter added in pgpool II that is analogous to PostgreSQL, the log_disconnections parameter takes a boolean value. The purpose of this parameter is to log all the client terminations with pgpool II to the log destination.
Please note that this parameter can be changed by reloading the pgpool II configuration file.
Health Check Improvements
Pgpool-II periodically connects to configured PostgreSQL servers in order to detect any errors or faults on the server or the network. The procedure of periodically checking the state of the server or network is called health check. Please note that health check required an extra connection so the user need to adjust the max_connections parameter of PostgreSQL accordingly.
SHOW POOL_HEALTH_CHECK_STATS
Pgpool-II 4.2 provides the command to show statistic of health check by using “SHOW POOL_HEALTH_CHECK_STATS”, the health check statistic shown by this command are collected by the process that performs the health checking.
This command is really helpful for system administrator when diagnosing faults and failures, for example the admin can easily locate the failover event in the log file by looking at “last_failed_health_check” column. Another example is finding unstable connection to backend by evaluating “average_retry_count” column, if a particular node shows higher retry count then other node, there may be a problem to the connection.
Please refer to the link for details on the statistical information show by the “SHOW POOL_HEALTH_CHECK_STATS” command.
This is another very useful command for displaying the pgpool II backend statistics, the command displays the node id, hostname, port, status, role and the counts of following queries issued to each backend :
Select
Insert
Update
Delete
DDL
Other query
This command is really useful in understand the type of traffic sent to each backend server, please visit the link below for details on this command.
This is major feature added to pgpool II 4.2, it provides LDAP connection between client and pgpool II server. This was a much awaited feature to support LDAP connectivity between client and pgpool II server, the support for LDAP connectivity between pgpool II and backend server is there already. With the addition of this feature in pgpool II 4.2, the user can get end to end LDAP connectivity from client thru pgpool II thru backend PostgreSQL server. This is really helpful in using the same LDAP server for getting complete end of end connectivity with pgpool II and backend server.
I have written a detailed blog on how to get the LDAP connectivity working with pgpool II, it dwells into how to get the LDAP server setup and configured and how to get it working with pgpool II setup. The blog is really helpful for user trying to get LDAP connectivity working with pgpool II
This is minor but very useful feature to reload the configuration file on the local pgpool II node or reload the configuration on all pgpool II nodes. The pcp reload config takes —scope (or -s) command line switch, the user can pass -c to the —scope and it will reload all the configuration files of pgpool II cluster nodes. If you pass -l to the —scope, it will only reload the configuration fie on the local pgpool II node.
Snapshot Isolation Mode
This is a major and complex feature added in pgpool II 4.2, this is really critical for a distributed system where transactions are spanned over multiple servers. The scale out solutions that are being implemented in community PostgreSQL can also learn from this feature implemented in middleware pgpool.
Every major PostgreSQL release has compatibility with parser of the latest PostgreSQL release, pgpool II 4.2 contains compatibility with PostgreSQL 13 parser. Any new grammar rules or features added in PostgreSQL 13 parser will be compatible with pgpool II 4.2 This mean that 4.2 will recognise the new keywords added in PG-13 and deal with it accordingly.
Conclusion
I have given brief introduction of most of the major and minor features added in pgpool II 4.2 release, links are provided where necessary to provide more details about some of the features.
It is clearly evident that pgpool II has come a long way in terms of functionality, security and stability in last few years. It is becoming the middleware of choice with PostgreSQL.
The next major release of pgpool II after 4.2 will focus on easy of use and providing a graphical user interface that make the configuration, management and monitoring of Pgpool-II easy.
Ahsan Hadi is a VP of Development with HighGo Software Inc. Prior to coming to HighGo Software, Ahsan had worked at EnterpriseDB as a Senior Director of Product Development, Ahsan worked with EnterpriseDB for 15 years. The flagship product of EnterpriseDB is Postgres Plus Advanced server which is based on Open source PostgreSQL. Ahsan has vast experience with Postgres and has lead the development team at EnterpriseDB for building the core compatibility of adding Oracle compatible layer to EDB’s Postgres Plus Advanced Server. Ahsan has also spent number of years working with development team for adding Horizontal scalability and sharding to Postgres. Initially, he worked with postgres-xc which is multi-master sharded cluster and later worked on managing the development of adding horizontal scalability/sharding to Postgres. Ahsan has also worked a great deal with Postgres foreign data wrapper technology and worked on developing and maintaining FDW’s for several sql and nosql databases like MongoDB, Hadoop and MySQL.
Prior to EnterpriseDB, Ahsan worked for Fusion Technologies as a Senior Project Manager. Fusion Tech was a US based consultancy company, Ahsan lead the team that developed java based job factory responsible for placing items on shelfs at big stores like Walmart. Prior to Fusion technologies, Ahsan worked at British Telecom as a Analyst/Programmer and developed web based database application for network fault monitoring.
Ahsan joined HighGo Software Inc (Canada) in April 2019 and is leading the development teams based in multiple Geo’s, the primary responsibility is community based Postgres development and also developing HighGo Postgres server.
PostgreSQL 13 is released with some cool features, such as index enhancement, partition enhancements, and many others. Along with these enhancements, there are some security-related enhancements that require some explanation. There are two major ones: one is related to libpq and the other is related to postgres_fdw. As it is known that postgres_fdw is considered to be a “reference implementation” for other foreign data wrappers, all other foreign data wrappers follow their footsteps in development. This is a community-supported foreign-data wrapper. The blog will explain the security changes in postgresq_fdw.
1 – The superuser can permit the non-superusers to establish a password-less connection on postgres_fdw
Previously, only the superuser can establish a password-less connection with PostgreSQL using postgres_fdw. No other password-less authentication method was allowed. It had been observed that in some cases there is no password required, so it does not make sense to have that limitation. Therefore, PostgreSQL 13 introduced a new option (password_required) where superusers can give permission to non-superusers to use a password-less connection on postgres_fdw.
postgres=# CREATE EXTENSION postgres_fdw;
CREATE EXTENSION
postgres=# CREATE SERVER postgres_svr FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname 'postgres');
CREATE SERVER
postgres=# CREATE FORIENG TABLE foo_for(a INT) SERVER postgres_svr OPTIONS(table_name 'foo');
CREATE FOREIGN TABLE
postgres=# create user MAPPING FOR vagrant SERVER postgres_svr;
CREATE USER MAPPING
postgres=# SELECT * FROM foo_for;
a
---
1
2
3
(3 rows)
When we perform the same query from a non-superuser, then we will get this error message:
ERROR: password is required DETAIL: Non-superusers must provide a password in the user mapping
postgres=# CREATE USER nonsup;
CREATE ROLE
postgres=# create user MAPPING FOR nonsup SERVER postgres_svr;
CREATE USER MAPPING
postgres=# grant ALL ON foo_for TO nonsup;
GRANT
vagrant@vagrant:/work/data$ psql postgres -U nonsup;
psql (13.0)
Type "help" for help.
postgres=> SELECT * FROM foo_for;
2020-09-28 13:00:02.798 UTC [16702] ERROR: password is required
2020-09-28 13:00:02.798 UTC [16702] DETAIL: Non-superusers must provide a password in the user mapping.
2020-09-28 13:00:02.798 UTC [16702] STATEMENT: SELECT * FROM foo_for;
ERROR: password is required
DETAIL: Non-superusers must provide a password in the user mapping.
Now perform the same query from non-superuser after setting the new parameter password_required ‘false’ while creating the user mapping.
vagrant@vagrant:/work/data$ psql postgres
psql (13.0)
Type "help" for help.
postgres=# DROP USER MAPPING FOR nonsup SERVER postgres_svr;
DROP USER MAPPING
postgres=# CREATE USER MAPPING FOR nonsup SERVER postgres_svr OPTIONS(password_required 'false');
CREATE USER MAPPING
vagrant@vagrant:/work/data$ psql postgres -U nonsup;
psql (13.0)
Type "help" for help.
postgres=> SELECT * FROM foo_for;
a
---
1
2
3
(3 rows)
2 – Authentication via an SSL certificate
A new option is provided to use an SSL certificate for authentication in postgres_fdw. To achieve this, the two new options added to use that feature are sslkey and sslcert.
vagrant@vagrant$ openssl genrsa -des3 -out server.key 1024
Generating RSA private key, 1024 bit long modulus (2 primes)
.+++++
..................+++++
e is 65537 (0x010001)
Enter pass phrase for server.key:
Verifying - Enter pass phrase for server.key:
vagrant@vagrant$ openssl rsa -in server.key -out server.key
Enter pass phrase for server.key:
writing RSA key
Step 2: Change the mode of the server.key
vagrant@vagrant$ chmod og-rwx server.key
Step 3: Generate the certificate
vagrant@vagrant$ openssl req -new -key server.key -days 3650 -out server.crt -x509
-----
Country Name (2 letter code) [AU]:PK
State or Province Name (full name) [Some-State]:ISB
Locality Name (eg, city) []:Islamabad
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Percona
Organizational Unit Name (eg, section) []:Dev
Common Name (e.g. server FQDN or YOUR name) []:localhost
Email Address []:ibrar.ahmad@gmail.com
vagrant@vagrant$ cp server.crt root.crt
Now we need to generate the client certificate.
Step 4: Generate a Client key
vagrant@vagrant$ openssl genrsa -des3 -out /tmp/postgresql.key 1024
Generating RSA private key, 1024 bit long modulus (2 primes)
..........................+++++
.....................................................+++++
e is 65537 (0x010001)
Enter pass phrase for /tmp/postgresql.key:
Verifying - Enter pass phrase for /tmp/postgresql.key:
vagrant@vagrant$ openssl rsa -in /tmp/postgresql.key -out /tmp/postgresql.key
Enter pass phrase for /tmp/postgresql.key:
writing RSA key
vagrant@vagrant$ openssl req -new -key /tmp/postgresql.key -out
-----
Country Name (2 letter code) [AU]:PK
State or Province Name (full name) [Some-State]:ISB
Locality Name (eg, city) []:Islamabad
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Percona
Organizational Unit Name (eg, section) []:Dev
Common Name (e.g. server FQDN or YOUR name) []:127.0.0.1
Email Address []:ibrar.ahmad@gmail.com
Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:pakistan
An optional company name []:Percona
Now we are ready, and we can create a foreign server in PostgreSQL with certificates.
postgres=# CREATE server postgres_ssl_svr foreign data wrapper postgres_fdw options (dbname 'postgres', host 'localhost', port '5555', sslcert '/tmp/postgresql.crt', sslkey '/tmp/postgresql.key', sslrootcert '/tmp/root.crt');
CREATE SERVER
postgres=# create user MAPPING FOR vagrant SERVER postgres_ssl_svr;
CREATE USER MAPPING
postgres=# create foreign table foo_ssl_for(a int) server postgres_ssl_svr options(table_name 'foo');
CREATE FOREIGN TABLE
Now we are ready and set to query a foreign table by postgres_fdw using certificate authentication.
postgres=# select * from foo_ssl_for;
a
---
1
2
3
(3 rows)
Note: Only superusers can modify user mappings options sslcertand sslkeysettings.
Our white paper “Why Choose PostgreSQL?” looks at the features and benefits of PostgreSQL and presents some practical usage examples. We also examine how PostgreSQL can be useful for companies looking to migrate from Oracle.
Unlike indexes with which we've already got acquainted, the idea of BRIN is to avoid looking through definitely unsuited rows rather than quickly find the matching ones. This is always an inaccurate index: it does not contain TIDs of table rows at all.
Simplistically, BRIN works fine for columns where values correlate with their physical location in the table. In other words, if a query without ORDER BY clause returns the column values virtually in the increasing or decreasing order (and there are no indexes on that column).
This access method was created in scope of Axle, the European project for extremely large analytical databases, with an eye on tables that are several terabyte or dozens of terabytes large. An important feature of BRIN that enables us to create indexes on such tables is a small size and minimal overhead costs of maintenance.
This works as follows. The table is split into ranges that are several pages large (or several blocks large, which is the same) - hence the name: Block Range Index, BRIN. The index stores summary information on the data in each range. As a rule, this is the minimal and maximal values, but it happens to be different, as shown further. Assume that a query is performed that contains the condition for a column; if the sought values do not get into the interval, the whole range can be skipped; but if they do get, all rows in all blocks will have to be looked through to choose the matching ones among them.
It will not be a mistake to treat BRIN not as an index, but as an accelerator of sequential scan. We can regard BRIN as an alternative to partitioning if we consider each range as a "virtual" partition.
Now let's discuss the structure of the index in more detail.
This article sets out to compare PostGIS in Rails with Geocoder and to highlight a couple of the areas where you'll want to (or need to) reach for one over the other. I will also present some of the terminology and libraries that I found along the way of working on this project and article as I set out to understand PostGIS better and how it is integrated with Rails. Installing PostGIS ActiveRecord PostGIS Adapter Our Example Data Building a Geo Helper Class with PostGIS Finding Nearby Records…
There have been many big features added to PostgreSQL 13, like Parallel Vacucim, D-Duplication of B-Tree index, etc., and a complete list can be found at PostgreSQL 13 release notes. Along with the big features, there are also small ones added, including dropdb –force.
Dropdb –force
A new command-line option is added to dropdb command, and a similar SQL option “FORCE” is also added in DROP DATABASE. Using the option -f or –force with dropdb command or FORCE with DROP DATABASE to drop the database, it will terminate all existing connections with the database. Similarly, DROP DATABASE FORCE will do the same.
In the first terminal, create a test database and a database test, and connect to the database.
vagrant@vagrant:~$ createdb test;
vagrant@vagrant:~$ psql test
psql (13.0)
Type "help" for help.
In the second terminal, try to drop the test database and you will get the error message that the test database is being used by another user.
vagrant@vagrant:/usr/local/pgsql.13/bin$ psql postgres
psql (13.0)
Type "help" for help.
postgres=# drop database test;
ERROR: database "test" is being accessed by other users
DETAIL: There is 1 other session using the database.
Now try the same command with the FORCE option. You will see that the database is dropped successfully.
postgres=# drop database test WITH ( FORCE );
DROP DATABASE
Note: you can also use the command line dropdb test -f.
The session on the first terminal will be terminated.
test=# \d
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>
WAL is short for Write-Ahead-Log. Any change to the data is first recorded in a WAL file. The WAL files are mainly used by RDBMS as a way to achieve durability and consistency while writing data to storage systems.
Before we move forward, let’s first see why we need a WAL archiving and Point in Time Recovery (PITR). Consider if you have accidentally dropped some table(s) or deleted some data? How do you recover from such mistakes? WAL archiving and PITR is the answer to that. A WAL file can be replayed on a server to recreate the recorded changes on that server. Hence we can use the WALs to recover from such dangerous situations. Whereas the PITR is a way to stop the replay of WALs at the specified point and have a consistent snapshot of data at that time. i.e. just before the table is dropped or data is removed.
How to Perform WAL Archiving
Normally, PostgreSQL databases keep the WAL files in the pg_wal directory of the $PGDATA. However, these WAL files may get recycled and can be deleted/overwritten by the server. So to avoid such scenarios, we keep a copy of WAL files in a separate directory other than $PGDATA. In order to do that, the PostgreSQL server provides us a way to copy the WAL file to a different location as soon as a WAL file is generated by the server. This way depends on three commands (options) namely archive_mode, archive_command, and wal_level. These options can be set in the $PGDATA/postgresql.conf configuration file.
Archiving Options
PostgreSQL server provides us with some options through which we can control the WAL archiving. Let’s see what these options are and how to use them.
archive_mode signifies whether we want to enable the WAL archiving. It can accept the following values:
on – to enable the archiving
off – disable the archiving
always – normally this option is the same as ‘on’. This enables archiving for a standby server as well. If the standby shares the same path with another server, it may lead to WAL file corruption. So care must be taken in this case.
archive_command specifies how to archive (copy) the WAL files and where. This option accepts the shell command or a shell script. Which is executed whenever there is a WAL file generated by the server to archive it. This option accepts the following placeholders:
%f – if present it’s replaced with the filename of the WAL file.
%p – if present it is replaced with the pathname of the WAL file.
%% – is replaced with ‘%’
wal_level is another important option. In PostgreSQL version 10+, it defaults to ‘replica’, prior to this version it was set to minimal by default. wal_level accepts the following values:
Minimal – adds only the information that is required for crash recovery or from immediate shutdown. It’s not usable for replication or archiving purposes.
Replica – signifies that WAL will have enough information for WAL archiving and replication.
Logical – adds information required for logical replication.
an example of wal archive.
vim $PGDATA/postgresql.conf
archive_mode = on
archive_command = ‘cp %p /path/to//archive_dir/%f’
wal_level = replica
Point In Time Recovery
In PostgreSQL, PITR is a way to stop the replay of WAL files at an appropriate point in time. There can be many WAL files in the archive but we may not want to replay all of them. Replaying all WALs will result in the same state where we had made some mistake. There are two important prerequisites required for PITR to work.
Availability of a full base backup (usually taken with pg_basebackup)
WAL files (WAL archive)
In order to achieve the PITR, the first step would be to restore an earlier taken base backup and then create a recovery setup. The setup requires configuring restore_command and recovery_target options.
restore_command specifies from where to look up the WAL files to replay on this server. This command accepts the same placeholders as archive_command.
recovery_target_time This option tells the server when to stop the recovery or replay process. The process will stop as soon as the given timestamp is reached.
recovery_target_inclusive This option controls whether to stop the replay of WALs just after the recovery_target_time is reached (if set to true) or just before (if set to false).
an example of PITR recovery options.
vim $BACKUP/postgresql.con
restore_command = ‘cp /path/to//archive_dir/%f %p’
recovery_target_time = ‘’
recovery_target_inclusive = false
Demo
Let’s combine all of the above in a practical demonstration and see how this all works.
# Start server
./pg_ctl -D $PGDATA start
# take a base backup
./pg_basebackup -D $BACKUP -Fp
# connect and put some data
./psql postgres
postgres=# create table foo(c1 int, c2 timestamp default current_timestamp);
CREATE TABLE
postgres=# insert into foo select generate_series(1, 1000000), clock_timestamp();
INSERT 0 1000000
postgres=# select current_timestamp;
current_timestamp
-------------------------------
2020-10-01 18:01:18.157764+05
(1 row)
postgres=# delete from foo;
DELETE 1000000
postgres=# select current_timestamp;
current_timestamp
-------------------------------
2020-10-01 18:01:36.272033+05
(1 row)
Let’s stop the server and create a recovery setup on the backup to stop before the deletion of data occurred.
./pg_ctl -D $PGDATA stop
# tell this cluster to start on recovery mode.
touch $BACKUP/recovery.signal
# edit configuration file to setup recovery options on the backup cluster.
vim $BACKUP/postgresql.conf
# Recovery Options
restore_command = ‘cp $HOME/wal_archive/%f %p’
recovery_target_time = ‘2020-10-01 18:01:18.157764+05’
recovery_target_inclusive = false
# let’s start the backup cluster and start the recovery process.
./pg_ctl -D $BACKUP start
2020-10-01 18:03:17.365 PKT [71219] LOG: listening on IPv4 address "127.0.0.1", port 5432
2020-10-01 18:03:17.366 PKT [71219] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-10-01 18:03:17.371 PKT [71220] LOG: database system was interrupted; last known up at 2020-10-01 18:00:45 PKT
2020-10-01 18:03:17.413 PKT [71220] LOG: starting point-in-time recovery to 2020-10-01 18:01:18.157764+05
2020-10-01 18:03:17.441 PKT [71220] LOG: restored log file "000000010000000000000002" from archive
2020-10-01 18:03:17.463 PKT [71220] LOG: redo starts at 0/2000028
2020-10-01 18:03:17.463 PKT [71220] LOG: consistent recovery state reached at 0/2000100
2020-10-01 18:03:17.463 PKT [71219] LOG: database system is ready to accept read only connections
2020-10-01 18:03:17.501 PKT [71220] LOG: restored log file "000000010000000000000003" from archive
2020-10-01 18:03:18.182 PKT [71220] LOG: restored log file "000000010000000000000004" from archive
2020-10-01 18:03:18.851 PKT [71220] LOG: restored log file "000000010000000000000005" from archive
2020-10-01 18:03:19.539 PKT [71220] LOG: restored log file "000000010000000000000006" from archive
2020-10-01 18:03:20.195 PKT [71220] LOG: restored log file "000000010000000000000007" from archive
2020-10-01 18:03:20.901 PKT [71220] LOG: restored log file "000000010000000000000008" from archive
2020-10-01 18:03:21.574 PKT [71220] LOG: restored log file "000000010000000000000009" from archive
2020-10-01 18:03:22.286 PKT [71220] LOG: restored log file "00000001000000000000000A" from archive
2020-10-01 18:03:22.697 PKT [71220] LOG: recovery stopping before commit of transaction 508, time 2020-10-01 18:01:32.304189+05
2020-10-01 18:03:22.697 PKT [71220] LOG: pausing at the end of recovery
2020-10-01 18:03:22.697 PKT [71220] HINT: Execute pg_wal_replay_resume() to promote.
Connect to this cluster and see if we still have data in the foo table?
postgres=# select count(*) from foo;
count
---------
1000000
(1 row)
See when the recovery stopped, we still have the data!
Conclusion
PITR is a critical process to recover important data in case of a server crash which may result in an invalid/corrupt data directory leading to downtime. However, what can be prevented is the impact of a crash resulting in data loss.
Keeping an up-to-date full backup with WAL files will lead to a simple and convenient restoration process. The time recovery time will depend on how recent the full backup was and how many WAL files were generated past it.
Asif Rehman is a Senior Software Engineer at HighGo Software. He Joined EnterpriseDB, an Enterprise PostgreSQL’s company in 2005 and started his career in open source development particularly in PostgreSQL. Asif’s contributions range from developing in-house features relating to oracle compatibility, to developing tools around PostgreSQL. He Joined HighGo Software in the month of Sep 2018.
This long article describes the many challenges of managing open
source projects and the mismatch between resource allocation, e.g., money, and the importance of the software to economic activity. It highlights OpenSSL as an
example where limited funding led to developer burnout and security vulnerabilities, even though so much of the Internet's infrastructure relies on it.
With proprietary software, there is usually a connection between software cost and its economic value, though the linkage varies widely. (How much of software's cost goes into software development,
testing, bug fixing, and security analysis has even greater variability.) With open source, there is even less linkage.
The article explores various methods to increase the linkage. It is a complex problem, both to get money, and to distribute money in a way that helps and does not harm open source communities.
Query optimization can take different forms depending on the data represented and the required
needs. In a recent case, we had a large table that we had to query for some non-indexed criteria.
This table was on an appliance that we were unable to modify, so we had to find a way to query
efficiently without indexes that would have made it easier.
The straightforward approach for this query was something along these lines:
SELECT accounts.*
FROM accounts
JOIN logs
ON accounts.id = logs.account_id
WHERE
logs.created_at BETWEEN $1 - 'interval 1 minute' AND $1 + 'interval 1 minute' AND
logs.field1 = $2 AND
logs.field2 = $3
FETCH FIRST ROW ONLY
Unfortunately, none of the fields involved in this query were indexed, nor could they be, due to
our access level on this database system. This lack of indexes means that our query against those
fields would end up doing a sequential scan of the whole table which made things unacceptably slow. This specific table held
time-series data with ~ 100k records per 1 minute period over a period of several weeks, which
meant we were dealing with a lot of data.
While we could not create any additional indexes to help us with this query, we could use some
specific properties to help us:
There was a primary key field, id, which was unique and monotonic, i.e., always
increasing in value.
This table was append only; no updates or deletes, so once data existed in the table it was
always the same.
The field we actually care about (created_at) also ends up being monotonic: subsequent records would always have the same or later values.
Since records were created sequentially and the id field was always increasing, id values and
created_at fields would be together be generally monotonic; this means there is an indexed field
which we can use as a surrogate stand-in for the target field that we want to treat as an index.
While due to the nature of logging ingest it is possible the created_at and id values are not
strictly monotonic (for instance if there are multiple logging records being created by separate
ingest processes, of which ids get assigned in chunks), for our purposes this was close enough,
since we were looking around in a time window which was wider than we were actually expecting the
message to appear.
Since we are looking for fields matching a specific window of time, we can substitute the
non-indexed clause created_at BETWEEN <timestamp_min> AND <timestamp_max> with an expression
matching the indexed statement id BETWEEN <id of first id gt timestamp_min> AND <id of first id gt
timestamp_max> to get the same effective approach.
In order to find the specific id fields which match the created_at time ranges we are
interested in, we would need to find the first id value which matched the criteria created_at >
'timestamp'::timestamp, as all subsequent id values would match that condition as well. This
would effectively require a binary search of the table to check which records match the criteria,
and return the smallest id value for which this criteria held.
So now that we have identified how we can use an indexed surrogate key to substitute for the
non-indexed expression, we need to figure out how to calculate the ranges in question.
Based on some recent discoveries about optimizing simple-looking but poorly-performing queries using
more complicated queries1, I had instincts that this could be solved with a WITH RECURSIVE
Common Table Expression. After toying around for a while and not coming up with the exact solution,
I ended up visiting the #postgresql channel on FreeNode IRC Network. There, I presented the
problem and got some interested responses, as this is the exact kind of question that database
nerds experts love2. The solution that user xocolatl (Vik Fearing) came up with for a basic
binary search is the basis for the rest of my solution:
CREATE TABLE test_table (id integer PRIMARY KEY, ts timestamptz);
INSERT INTO test_table
SELECT g, date 'today' + interval '1s' * g
FROM generate_series(1, 1000000) AS g;
WITH RECURSIVE
search (min, max, middle, level) AS (
SELECT min(id), max(id), (min(id)+max(id))/2, 0
FROM test_table
UNION ALL
SELECT v.min, v.max, (v.min + v.max)/2, s.level+1
FROM search AS s
CROSS JOIN LATERAL (
SELECT *
FROM test_table AS e
WHERE e.id >= s.middle
ORDER BY e.id
FETCH FIRST ROW ONLY
) AS e
CROSS JOIN LATERAL (VALUES (
CASE WHEN e.ts < now() THEN e.id ELSE s.min END,
CASE WHEN e.ts < now() THEN s.max ELSE e.id END
)) AS v (min, max)
WHERE (v.min + v.max)/2 NOT IN (v.min, v.max)
)
SELECT *
FROM search AS s
JOIN test_table AS e ON e.id = s.middle
ORDER BY s.level DESC
FETCH FIRST ROW ONLY;
As expected, the solution involved a WITH RECURSIVE CTE.
The basic explanation here is that the search expression first starts with the min, max, and
middle (mean) values of id for the table (the initialization expression), then iteratively adds
additional rows to the results depending on if the table row with id >= middle matches our
specific test criteria, then continues until one of the boundaries of the region is hit. (Since we
are using integer division in our terminal expression (v.min + v.max)/2 NOT IN (v.min, v.max) we
are guaranteed to hit one of the boundary conditions eventually in our search.)
A few other things worthy of note:
This approach uses the check e.ts < now() as the condition we are testing, which means in this
specific example, the answer to this "closest id" query would actually change depend on when you run
this query relative to when the initial test data was populated. However, we can replace that
condition with whatever condition we want to use to test for our surrogate non-indexed data.
This approach will work whether or not there are gaps in the sequence. In order to properly
handle gaps, we are selecting the first row with id >= middle ... FETCH FIRST ROW ONLY rather than
just selecting id = middle, which you could do in a gapless sequence.
In addition to not caring about gaps, we are also not trying to make sure this is using a
balanced binary search; it would be not worth the computing effort to find the middlest existing row
in an index, as we'd need to know the number of rows in the segment we're searching. Since in
PostgreSQL this would entail a COUNT(*) from a subselect, this would be quite slow and not worth
the trouble.
Considering my specific use case was trying to limit the records considered based on two
created_at values I needed to calculate a search_min and search_max to find the start/end
ids for each side in the interval.
Given this, I just modified the CTE to calculate those and add the additional conditions we were
wanting to consider. I also had to turn the result query from a join against the found id value
to a range; the final query is as follows:
WITH RECURSIVE
search_min (min, max, middle, level) AS (
SELECT min(id), max(id), (min(id)+max(id))/2, 0
FROM logs
UNION ALL
SELECT v.min, v.max, (v.min + v.max)/2, s.level+1
FROM search_min AS s
CROSS JOIN LATERAL (
SELECT *
FROM logs AS e
WHERE e.id >= s.middle
ORDER BY e.id
FETCH FIRST ROW ONLY
) AS e
CROSS JOIN LATERAL (VALUES (
CASE WHEN extract(epoch FROM e.created_at)::integer < $1 THEN e.id ELSE s.min END,
CASE WHEN extract(epoch FROM e.created_at)::integer < $1 THEN s.max ELSE e.id END
)) AS v (min, max)
WHERE (v.min + v.max)/2 NOT IN (v.min, v.max)
),
search_max (min, max, middle, level) AS (
SELECT min(id), max(id), (min(id)+max(id))/2, 0
FROM logs
UNION ALL
SELECT v.min, v.max, (v.min + v.max)/2, s.level+1
FROM search_max AS s
CROSS JOIN LATERAL (
SELECT *
FROM logs AS e
WHERE e.id >= s.middle
ORDER BY e.id
FETCH FIRST ROW ONLY
) AS e
CROSS JOIN LATERAL (VALUES (
CASE WHEN extract(epoch FROM e.created_at)::integer < $2 THEN e.id ELSE s.min END,
CASE WHEN extract(epoch FROM e.created_at)::integer < $2 THEN s.max ELSE e.id END
)) AS v (min, max)
WHERE (v.min + v.max)/2 NOT IN (v.min, v.max)
)
SELECT accounts.*
FROM accounts
JOIN logs
ON logs.account_id = accounts.id
WHERE
logs.field1 = $3 AND
logs.field2 = $4 AND
logs.id >= (SELECT middle FROM search_min ORDER BY level DESC FETCH FIRST ROW ONLY) AND
logs.id <= (SELECT middle FROM search_max ORDER BY level DESC FETCH FIRST ROW ONLY)
ORDER BY logs.id
FETCH FIRST ROW ONLY
The final results were drastically improved. The initial query went from timing out in the
webservice in question to returning results in a fraction of a second. Clearly this technique,
while not as useful as directly indexing data we care about, can come in handy in some
circumstances.
On 11st of September 2020, Alvaro Herrera committed patch: psql: Display stats target of extended statistics The stats target can be set since commit d06215d03, but wasn't shown by psql. Author: Justin Pryzby <justin@telsasoft.com> Discussion: https://postgr.es/m/20200831050047.GG5450@telsasoft.com Reviewed-by: Georgios Kokolatos <gkokolatos@protonmail.com> Reviewed-by: Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> Since PostgreSQL 10 we have so called extended statistics. … Continue reading "Waiting for PostgreSQL 13 – psql: Display stats target of extended statistics"
On 5th of October 2020, Peter Eisentraut committed patch: Support for OUT parameters in procedures Unlike for functions, OUT parameters for procedures are part of the signature. Therefore, they have to be listed in pg_proc.proargtypes as well as mentioned in ALTER PROCEDURE and DROP PROCEDURE. Reviewed-by: Andrew Dunstan <andrew.dunstan@2ndquadrant.com> Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com> … Continue reading "Waiting for PostgreSQL 14 – Support for OUT parameters in procedures"
PostgreSQL Person of the Week Interview with Andreas Kretschmer: I was born in Meißen, Saxony, Germany, Planet Earth. I’m married and we have 3 wonderful daughters.
You don't need monitoring until you need it. But if you're running anything in production, you always need it.
This is particularly true if you are managing databases. You need to be able to answer questions like "am I running out of disk?" or "why does my application have degraded performance?" to be able to troubleshoot or mitigate problems before they occur.
When I first made a foray into how to monitor PostgreSQL in Kubernetes, let alone in a containerized environment, I learned that a lot of the tools that I had used previously did not exactly apply (though keep in mind, that foray was awhile back -- things have changed!). I found myself learning a whole new tech stack for monitoring, including open source projects as Prometheus and Grafana.
I also learned how I took for granted how easy it was to collect information like CPU and memory statistics in other environments. In container world this was a whole different ballgame, as you needed to get this information from cgroups. Fortunately for me, my colleague Joe Conway built a PostgreSQL extension called pgnodemx that reads these values from within PostgreSQL itself. Read more about pgnodemx.
And then there is the process of getting the metrics stack set up. Even with my earlier experiments on setting up PostgreSQL monitoring with Docker, I knew there was more work to be done to make an easy-to-setup monitoring solution in Kubernetes.
All this, combined with the adoption of the PostgreSQL Operator, made us want to change how we support monitoring PostgreSQL clusters on Kubernetes. We wanted to continue using proven open source solutions for monitoring and analyzing systems in Kubernetes (e.g. Prometheus, Grafana), introduce support for alerting (Alertmanager), and provide accurate host-style metrics for things like CPU, memory, and disk usage.
In PostgreSQL table bloat has been a primary concern since the original MVCC model was conceived. Therefore we have decided to do a series of blog posts discussing this issue in more detail. What is table bloat in the first place? Table bloat means that a table and/or indexes are growing in size even if the amount of data stored in the database does not grow at all. If one wants to support transactions it is absolutely necessary not to overwrite data in case it is modified because one has to keep in mind that people might want to read an old row while it is modified or rollback a transaction.
Therefore bloat is an intrinsic thing related to MVCC in PostgreSQL. However, the way PostgreSQL stores data and handles transactions is not the only way a database can handle transactions and concurrency. Let us see which other options there are:
In MS SQL you will find a thing called tempdb while Oracle and MySQL put old versions into the redo log. As you might know PostgreSQL copies rows on UPDATE and stores them in the same table. Firebird is also storing old row versions inline.
There are two main points I want to make here:
Getting rid of old rows is hard
No solution is without tradeoffs
Getting rid of rows is definitely an issue. In PostgreSQL removing old rows is usually done by VACUUM. However, in some cases VACUUM cannot keep up or space is growing for some other reasons (usually long transactions). We at CYBERTEC have blogged extensively about that scenario.
“No solution is without tradeoffs” is also an important aspect of storage. There is no such thing as a perfect storage engine – there are only storage engines serving a certain workload well. The same is true for PostgreSQL: The current table format is ideal for many workloads. However, there is also a dark side which leads us back to where we started: Table bloat. If you are running UPDATE-intense workloads it happens more often than not that the size of a table is hard to keep under control. This is especially true if developers and system administrators are not fully aware of the inner workings of PostgreSQL in the first place.
zheap: Keeping bloat under control
zheap is a way to keep table bloat under control by implementing a storage engine capable of running UPDATE-intense workloads a lot more efficiently. The project has originally been started by EnterpriseDB and a lot of effort has been put into the project already.
To make zheap ready for production we are proud to announce that our partners at Heroic Labs have committed to fund the further development of zheap and release all code to the community. CYBERTEC has decided to double the amount of funding and to put up additional expertise and manpower to move zheap forward. If there are people, companies, etc. who are also interested in helping us move zheap forward we are eager to team up with everybody willing to make this great technology succeed.
Let us take a look at the key design goals:
Perform UPDATE in place
Have smaller tables (smaller tuple headers, improved alignment)
Reduce writes as much as possible (avoid dirtying pages unless data is modified)
Reuse space more quickly
So let us see how those goals can be achieved in general.
The basic design of zheap
zheap is a completely new storage engine and it therefore makes sense to dive into the basic architecture. Three essential components have to work together:
zheap: The table format
undo: Handling transaction rollback, etc.
WAL: Protect critical writes
Let us take a look at the layout of a zheap page first. As you know PostgreSQL typically sees tables as a sequence of 8k blocks, so the layout of a page is definitely important:
At first glance, this image looks almost like a standard PostgreSQL 8k page but in fact it is not. The first thing you might notice is that tuples are stored in the same order as item entries at the beginning of the page to allow for faster scans. The next thing we see here is the presence of “slot” entries at the end of the page. In a standard PostgreSQL table visibility information is stored as part of the row which needs a lot of space. In zheap transaction information has been moved to the page which significantly reduces the size of data (which in turn translates to better performance). A transaction occupies 16 bytes of storage and contains the following information: transaction id, epoch and the latest undo record pointer of that transaction. A row points to a transaction slot. The default number of transaction slots in the table is 4 which is usually ok for big tables. However, sometimes more transaction slots are needed. In this case, zheap has something called “TPD” which is nothing more than an overflow area to store additional transaction information as needed.
Here is the basic layout:
Sometimes many transaction slots are needed for a single page. TPD offers a flexible way to handle that. The question is: Where does zheap store TPD data? The answer is: These special pages are interleaved with the standard data pages. They are just marked in a special way to ensure that sequential scans won’t touch them. To track these special purpose pages zheap uses a meta page to track them:
TDP is simply a way to make transaction slots more scalable. Having some slots in the block itself reduces the need to excessively touch pages. If more are needed TPD is an elegant way out. In a way it is the best of both worlds.
Transaction slots can be reused after a transaction ends.
zheap: Tuple formats
The next important part of the puzzle is the layout of a single tuple: In PostgreSQL a standard heap tuple has a 20+ byte header because all the transactional information is stored in a tuple. Not so in this case. All transactional information has been moved to page level structures (transaction slots). This is super important: The header has therefore been reduced to merely 5 bytes. But there are more optimizations going on here: A standard tuple has to use CPU alignment (padding) between the tuple header and the real data in the row. This can burn some bytes for every single row in the table. zheap is not doing that leading to more tightly packed storage. Additional space is saved by removing the padding from pass-by-value data types. All those optimizations mean that we can save valuable space in every single row of the table. URSELOCATIONSTARTENDTIMELANGUAGE
Here is what a standard PostgreSQL tuple header looks like:
Now let us compare this to a zheap tuple:
As you can see a zheap tuple is a lot smaller than a normal heap tuple. As the transactional information has been unified in the transaction slot machinery, we don’t have to handle visibility on the row level anymore but can do it more efficiently on the page level.
By shrinking the storage footprint zheap will contribute to good performance.
UNDO: Keeping things in order
One of the most important things when talking about zheap is the notion of “undo”. What is the purpose of this thing in the first place? Let us take a look and see: Consider the following operation:
BEGIN;
UPDATE tab SET x = 7 WHERE x = 5;
…
COMMIT / ROLLBACK;
To ensure that transactions can operate correctly UPDATE cannot just overwrite the old value and forget about it. There are two reasons for that: First of all, we want to support concurrency. Many users should be able to read data while it is modified. The second problem is that updating a row does not necessarily mean that it will be committed. Thus we need a way to handle ROLLBACK in a useful way. The classical PostgreSQL storage format will simply copy the row INSIDE standard heap which leads to all those bloat related issues we have discussed on our blog already.
The way zheap approaches things here is a bit different: In case a modification is made the system writes “undo” information to fix it in case the transaction has to be aborted for whatever reason. This is the fundamental concept applicable to INSERT, UPDATE, and DELETE. Let us go through those operations one by one and see how it works:
INSERT: Adding rows
In case of INSERT zheap has to allocate a transaction slot and then emit an undo entry to fix things on error. In case if INSERT the TID is the most relevant information needed by undo. Space can be reclaimed instantly after an INSERT has been rolled back which is a major difference between zheap and standard heap tables in PostgreSQL.
UPDATE: Modifying data
An UPDATE statement is far more complicated: There are basically two cases:
The new row fits into the old space
The new row does not fit into the old space
In case the old row is shorter than the new one we can simply overwrite it and emit an undo entry holding the complete old row. In short: We hold the new row in zheap and a copy of the old row in undo so that we can copy it back to the old structure in case it is needed.
What happens if the new row does not fit in? In this case performance will be worse because zheap essentially has to perform a DELETE / INSERT operation which is of course not as efficient as an in-place UPDATE.
Space can instantly be reclaimed in the following cases:
When updating a row to a shorter version
When non-inplace UPDATEs are performed
DELETE: Removing rows
Finally there is DELETE. To handle the removal of a row zheap has to emit an undo record to put the old row back in place in case of ROLLBACK. The row has to be removed from the zheap during DELETE.
UNDO and ROLLBACK in action
Up to now we have spoken quite a bit about undo and rollback. However, let us dive a bit deeper and see how undo, rollback, and so on interact with each other.
In case a ROLLBACK happens the undo has to make sure that the old state of the table is restored. Thus the undo action we have scheduled before has to be executed. In case of errors the undo action is applied as part of a new transaction to ensure success.
Ideally, all undo action associated with a single page is applied at once to cut down on the amount of WAL that has to be written. A nice side effect of this strategy is also that we can reduce page-level locking to the absolute minimum which reduces contention and therefore helps contribute to good performance.
So far this sounds easy but let us consider an important use case: What happens in the event of a really long transaction? What happens if a terabyte of data has to be rolled back at once? End users certainly don’t appreciate never-ending rollbacks. It is also worth keeping in mind that we must also be prepared for a crash during rollback.
What happens is that if undo action is larger than a certain configurable threshold the job is done by a background worker process. This is a really elegant solution that helps to maintain a good end-user experience.
Undo itself can be removed in three cases:
as soon as there are no transactions anymore that can see the data.
as soon as all undo action has been completed
For committed transactions till the time they are all-visible
Let us take a look at a basic architecture diagram:
As you can see the process is quite sophisticated.
Indexing: A brief remark
To ensure that zheap is a drop-in replacement for the current heap it is important to keep the indexing code untouched. Zheap can work with PostgreSQL’s standard access methods. There is of course room to make things even more efficient. However, at this point no changes to the indexing code are needed. This also implies that all index types available in PostgreSQL are fully available without known restrictions.
Finally …
Currently zheap is still under development and we are glad for the contributions made by Heroic Labs to develop this technology further. So far we have already implemented logical decoding for zheap and added support for PostgreSQL. We will continue to allocate more resources to push the tool to make it production-ready.
If you want to read more about PostgreSQL and VACUUM right now consider checking our previous posts on the subject. In addition, we also want to invite you to keep visiting our blog on a regular basis to learn more about it and other interesting technologies.
Storing a user’s timezone in Postgres can be an interesting task in apps. Often, this is accomplished by generating a hardcoded list of every timezone in the backend (or worse the frontend JavaScript) and storing an item from that list in the database for each user. This hardcoded magic list is both difficult to build accurately and hard to maintain. Timezones change more frequently than you would imagine ...but less frequently than time itself.
Postgres has a system view called pg_timezone_names that has the list of timezones used by Postgres internally. We can use this view to generate a dropdown or other input type for the user to select their timezone. We can filter out some of the timezones in your query (like all the ones that start with “posix/“) because this list is quite large by default.
select name from pg_timezone_names where name notlike'posix%'and name not ilike 'system%'orderby name;
Now, we need to figure out the best way to store the selection from our user. We can do this by adding a column to our users table to hold the timezone. The column is a text type because we are only storing the name value from pg_timezone_names.
altertable users addcolumn timezone text;
Next, we want to query the user’s events in the timezone we have for the user. We store the start_time for the events as UTC timestamptz columns. This solution leverages the timezone function in Postgres. AT TIME ZONE zone also works.
SELECT events.name, timezone(users.timezone, start_time) FROM events JOIN users ON events.user_id = users.id JOIN pg_timezone_names ON users.timezone = pg_timezone_names.name;
This could be taken a step further by using the to_timestamp function in Postgres to format the dates, and keep the application from formatting dates at all.
Last week 2nd Quadrant was purchased by EDB.
While this is certainly good news for these companies, it can increase risks to the Postgres community. First, there is an
unwritten rule that the Postgres core team should not have over half of its members from a single company, and the acquisition causes EDB's representation in the
core team to be 60% — the core team is working on a solution for this.
Second, two companies becoming one reduces Postgres user choice for support and services, especially in the North American and western European markets. Reduced vendor options often results in a worse
customer service and less innovation. Since the Postgres community does independent innovation, this might not be an issue for community software, but could be for company-controlled tooling around
Postgres.
Third, there is the risk that an even larger company wanting to hurt Postgres could acquire EDB and take it in a direction that is neutral or negative for the Postgres community. Employee
non-compete agreements, and the lack of other Postgres support companies could extend the duration of these effects. There isn't much the community can do to minimize these issues but to be alert for
problems.
A couple years ago (at the pgconf.eu 2014 in Madrid) I presented a talk called “Performance Archaeology” which showed how performance changed in recent PostgreSQL releases. I did that talk as I think the long-term view is interesting and may give us insights that may be very valuable. For people who actually work on PostgreSQL […]
My professional background has been in application development with a strong affinity for developing with PostgreSQL (which I hope comes through in previous articles). However, in many of my roles, I found myself as the "accidental" systems administrator, where I would troubleshoot issues in production and do my best to keep things running and safe.
When it came to monitoring my Postgres databases, I initially took what I knew about monitoring a web application itself, i.e. looking at CPU, memory, and network usage, and used that to try to detect issues. In many cases, it worked: for instance, I could see a CPU spike on a PostgreSQL database and deduce that there was a runaway query slowing down the system.
Over time, I learned about other types of metrics that would make it easier to triage and mitigate PostgreSQL issues. Combined with what I learned as an accidental systems administrator, I've found they make a powerful toolkit that even helps with application construction.
To help with sharing these experiences, I set up a few PostgreSQL clusters with the PostgreSQL Operator monitoring stack. The Postgres Operator monitoring stack uses the Kubernetes support of pgMonitor to collect metrics about the deployment environment (e.g. Pods) and the specific PostgreSQL instances. I'll also add a slight Kubernetes twist to it, given there are some special consideration you need to make when monitoring Postgres on Kubernetes.
We'll start with my "go to" set of statistics, what I call "the vitals."
The Vital Statistics: CPU, Memory, Disk, and Network Utilization
One fairly common complaint about postgres is that is that each connection uses too much memory. Often made when comparing postgres' connection model to one where each connection is assigned a dedicated thread, instead of the current model where each connection has a dedicated process.
To be clear: This is a worthwhile discussion to have. And there are several important improvements we could make to reduce memory usage.
That said, I think one common cause of these concerns is that the easy ways to measure the memory usage of a postgres backend, like top and ps, are quite misleading.
Watchdog is the high availability component of Pgpool-II. Over the past few releases watchdog has gotten a lot of attention from the Pgpool-II developer community and received lots of upgrades and stability improvements.
One of the not very strong areas of pgpool-II watchdog is its configuration interface. Watchdog cluster requires quite a few config settings on each node, and it’s very easy to get it wrong and hard to debug.
For example in a three-node Pgpool-II cluster, we normally require to configure the following parameters in each pgpool.conf
# Local node identificatin (Unique for each node)
wd_hostname
we_port
wd_heartbeat_port
# Node #1 endpoint
other_pgpool_hostname0
other_pgpool_port0
other_wd_port0
# Node #2 endpoint
other_pgpool_hostname1
other_pgpool_port1
other_wd_port1
# Local device setting for sending heartbeat
heartbeat_device
#Node #1 heartbeat endpoint
heartbeat_destination0
heartbeat_destination_port0
#Node #2 heartbeat endpoint
heartbeat_destination1
heartbeat_destination_port1
The main issue here is not the number of parameters. The issue is the value of almost each of these parameters is different for each pgpool-II node and that makes configuring, debugging, adding and removing of a watchdog difficult to manage.
One pgpool.conf for every node
When we were discussing the features for Pgpool-II 4.2 last year we decided to make the ease of use as one of the priorities for the next release. And since the most difficult configuration belongs to the watchdog area so we decided to fix that first up.
In the upcoming 4.2 version Pgpool-II uses the unified watchdog configuration and unlike the previous versions now we can use the same pgpool.conf file for every Pgpool-II node.
For the same three node pgpool-II cluster, The configuration file will now need to set these below parameters once and can use same pgpool.conf file for each node
Once these parameters are configured the next step is to specify the unique node id of each pgpool-II node. For that purpose, pgpool-II opted to use the same technique as used by other distributed software like zookeeper (myid file), i.e. A separate single-line configuration file to set the local node id.
For that purpose, similar to the myid file, Pgpool-II uses a pgpool_node_id file. pgpool_node_id just contains a single integer in human-readable ASCII text that represents the local pgpool-II node id.
# create pgpool_node_id for node #1
echo 1 > etc/pgpool_node_id
Conclusion
Although from the look of it, this seems like a small feature yet it is a very useful one and makes the watchdog cluster deployment a lot easier and less error-prone. On top of it, it’s not very easy to add/remove the pgpool-II nodes from the watchdog cluster.
Pgpool-II 4.2 alpha was released just a few days ago, and GA is expected in around a monthtime. So I thought to blog about this feature as upgrading to 4.2 would require keeping these configuration changes in mind since the watchdog needs to be reconfigured before proceeding with the upgrade. Otherwise upgrading to 4.2 without considering these changes would lead to downtime.
Muhammad Usama is a database architect / PostgreSQL consultant at HighGo Software and also Pgpool-II core committer. Usama has been involved with database development (PostgreSQL) since 2006, he is the core committer for open source middleware project Pgpool-II and has played a pivotal role in driving and enhancing the product. Prior to coming to open source development, Usama was doing software design and development with the main focus on system-level embedded development. After joining the EnterpriseDB, an Enterprise PostgreSQL’s company in 2006 he started his career in open source development specifically in PostgreSQL and Pgpool-II. He is a major contributor to the Pgpool-II project and has contributed to many performance and high availability related features.
One common challenge with Postgres for those of you who manage busy Postgres
databases, and those of you who foresee being in that situation, is that
Postgres does not handle large numbers of connections particularly well.
While it is possible to have a few thousand established connections without
running into problems, there are some real and hard-to-avoid problems.
Since
joining Microsoft
last year in the Azure Database for PostgreSQL
team—where I work on open source Postgres—I have spent a lot of
time analyzing and addressing some of the issues with connection scalability in
Postgres.
In this post I will explain why I think it is important to improve Postgres’
handling of large number of connections. Followed by an analysis of the
different limiting aspects to connection scalability in Postgres.
In an upcoming post I will show the results of the work we’ve done to improve
connection handling and snapshot scalability in Postgres—and go into
detail about the identified issues and how we have addressed them in Postgres
14.
Why connection scalability in Postgres is important
In some cases problems around connection scalability are caused by
unfamiliarity with Postgres, broken applications, or other issues in the same
vein. And as I already mentioned, some applications can have a few thousand
established connections without running into any problems.
A frequent counter-claim to requests to improve Postgres’ handling of large
numbers of connection counts is that there is nothing to address. That the
desire/need to handle large numbers of connection is misguided, caused by
broken applications or similar. Often accompanied by references to the server
only having a limited number of CPU cores.
There certainly are cases where the best approach is to avoid large numbers of
connections, but there are—in my opinion—pretty clear reasons for
needing larger number of connections in Postgres. Here are the main ones:
Central state and spikey load require large numbers of connections: It is
common for a database to be the shared state for an application (leaving
non-durable caching services aside). Given the cost of establishing a new
database connection (TLS, latency, and Postgres costs, in that order) it is
obvious that applications need to maintain pools of Postgres connections that
are large enough to handle the inevitable minor spikes in incoming
requests. Often there are many servers running [web-]application code using
one centralized database.
To some degree this issue can be addressed using Postgres connection poolers like
PgBouncer or more recently
Odyssey. To actually reduce the number
of connections to the database server such poolers need to be used in
transaction (or statement)
pooling modes. However, doing so
precludes the use of many useful database features like
prepared statements, temporary
tables, …
Latency and result processing times lead to idle connections: Network
latency and application processing times will often result in individual
database connections being idle the majority of the time, even when the
applications are issuing database requests as fast as they can.
Common OLTP database workloads, and especially web applications, are heavily
biased towards reads. And with OLTP workloads, the majority of SQL queries
are simple enough to be processed well below the network latency between
application and database.
Additionally the application needs to process the results of the database
queries it sent. That often will involve substantial work (e.g. template
processing, communication with cache servers, …).
To drive this home, here is a simple experiment using
pgbench (a simple
benchmarking program that is part of Postgres). In a memory-resident,
read-only pgbench workload (executed on my workstation1, 20/40
CPU cores/threads) I am comparing the achievable throughput across increasing
client counts between a non-delayed pgbench and a pgbench with simulated
delays. For the simulated delays, I used a 1ms network delay and a 1ms
processing delay. The non-delayed pgbench peaks around 48 clients, the
delayed run around 3000 connections. Even comparing on-machine TCP
connections to a 10GBe between two physically close machines moves the peak
from around 48 connections closer to 500 connections.
Scaling out to allow for higher connection counts can increase cost: Even
in cases where the application’s workload can be distributed over a number of
Postgres instances, the impact of latency combined with low maximum
connection limits will often result in low utilization of the database
servers, while exerting pressure to increase the number of database servers
to handle the required number of connections. That can increase the
operational costs substantially.
Surveying connection scalability issues
My goal in starting this project was to improve Postgres’ ability to handle
substantially larger numbers of connections. To do that—to pick the right
problem to solve—I first needed to understand which problems were most
important, otherwise it would have been easy to end up with micro-optimizations
without improving real-world workloads.
So my first software engineering task was to survey the different aspects of
connection scalability limitations in Postgres, specifically:
By the end of this deep dive into the connection scalability limitations in Postgres, I
hope you will understand why I
concluded
that snapshot scalability should be addressed first.
Memory usage
There are 3 main aspects to problems around memory usage of a large numbers of
connections:
Postgres, as many of you will know, uses a process-based connection model.
When a new connection is established, Postgres’ supervisor process creates a
dedicated process to handle that connection going forward. The use of a “full
blown process” over the use of of threads has some advantages like increased
isolation/robustness, but also some disadvantages.
One common complaint is that each connection uses too much memory. That is, at
least partially, a common observation because it is surprisingly hard to
measure the increase in memory usage by an additional connection.
In a recent post about measuring the
memory overhead of a Postgres connection
I show that it is surprisingly hard to accurately measure the memory
overhead. And that in many workloads, and with the right configuration—most
importantly, using
huge_pages—the memory overhead of each connection is
below 2 MiB.
Conclusion: connection memory overhead is acceptable
When each connection only has an overhead of a few MiB, it is quite possible to
have thousands of established connections. It would obviously be good to use
less memory, but memory is not the primary issue around connection scalability.
Cache bloat
Another important aspect of memory-related connection scalability issues can be
that, over time, the memory usage of a connection increases, due to long-lived
resources. This particularly is an issue in workloads that utilize long-lived
connections combined with schema-based multi-tenancy.
Unless applications implement some form of connection <-> tenant association,
each connection over time will access all relations for all tenants. That leads
to Postgres’ internal catalog metadata caches growing beyond a reasonable size,
as currently (as of version 13) Postgres does not prune its metadata caches of
unchanging rarely-accessed contents.
Problem illustration
To demonstrate the issue of cache bloat, I created a simple test bed
with 100k tables, with a few columns and single primary serial column
index2. Takes a while to create.
With the
recently addedpg_backend_memory_contexts view it is not too difficult to see the
aggregated memory usage of the various caches (although it would be
nice to see more of the different types of caches broken out into
their own memory contexts). See 3.
In a new Postgres connection, not much memory is used:
name
parent
size_bytes
size_human
num_contexts
CacheMemoryContext
TopMemoryContext
524288
512 kB
1
index info
CacheMemoryContext
149504
146 kB
80
relation rules
CacheMemoryContext
8192
8192 bytes
1
But after forcing all Postgres tables we just created to be accessed4, this
looks very different:
name
parent
size_bytes
size_human
num_contexts
CacheMemoryContext
TopMemoryContext
621805848
593 MB
1
index info
CacheMemoryContext
102560768
98 MB
100084
relation rules
CacheMemoryContext
8192
8192 bytes
1
As the metadata cache for indexes is created in its own memory context,
num_contexts for the “index info” contexts nicely shows that we accessed the
100k tables (and some system internal ones).
Conclusion: cache bloat is not the major issue at this moment
A common solution for the cache bloat issue is to drop “old” connections from the
application connection pooler after a certain age. Many connection pooler
libraries/web frameworks support that.
As there is a feasible workaround, and as cache bloat is only an issue
for databases with a lot of objects, cache bloat is not the major issue at the
moment (but worthy of improvement, obviously).
Query memory usage
The third aspect is that it is hard to limit memory used by queries. The
work_mem
setting does not control the memory used by a query as a whole, but only of
individual parts of a query (e.g. sort, hash aggregation, hash join). That
means that a query can end up requiring work_mem several times over5.
That means that one has to be careful setting work_mem in workloads requiring
a lot of connections. With larger work_mem settings, practically required for
analytics workloads, one can’t reasonably use a huge number of concurrent
connections and expect to never hit memory exhaustion related issues
(i.e. errors or the OOM killer).
Luckily most workloads requiring a lot of connection don’t need a high
work_mem setting, and it can be set on the user, database, connection, and
transaction level.
Snapshot scalability
There are a lot of recommendations out there strongly recommending to not set
max_connections
for Postgres to a high value, as high values can cause problems. In fact, I’ve argued that
myself many times.
But that is only half the truth.
Setting max_connections to a very high value alone only
leads at best (worst?) to a very small slowdown in itself, and wastes some
memory. E.g. on my workstation1 there is no measurable
performance difference for a read-only pgbench between max_connections=100
and a value as extreme max_connections=100000 (for the same pgbench client
count, 48 in this case). However the memory required for Postgres does increase
measurable with such an extreme setting. With shared_buffers=16GBmax_connections=100 uses 16804 MiB, max_connections=100000 uses 21463 MiB
of shared memory. That is a large enough difference to potentially cause a
slowdown indirectly (although most of that memory will never be used, therefore
not allocated by the OS in common configurations).
The real issue is that currently Postgres does not scale well to having a large
number of established connections, even if nearly all connections are idle.
To showcase this, I used two separate pgbench6 runs. One of them just
establishes connections that are entirely idle (using a test file that just
contains \sleep 1s, causing a client-side sleep). Another to run a normal
pgbench read-only workload.
This is far from reproducing the worst possible version of the issue, as
normally the set of idle connections varies over time, which makes this issue
considerably worse. This version is much easier to reproduce however.
This is a very useful scenario to test, because it allows us to isolate the
cost of additional connections pretty well isolated. Especially when the count
of active connections is low, the system CPU usage is quite low. If there is a
slowdown when the number of idle connections increases, it is clearly related
to the number of idle connections.
If we instead measured the throughput with a high number of active connections,
it’d be harder to pinpoint whether e.g. the increase in context switches or
lack of CPU cycles is to blame for slowdowns.
Throughput of one active connection in presence of a variable number of idle connectionsThroughput of 48 active connections in presence of a variable number of idle connections
These results7 clearly show that the
achievable throughput of active connections decreases significantly when the
number of idle connections increases.
In reality “idle” connections are not entirely idle, but send queries at a
lower rate. To simulate that I’ve used the the below to simulate clients only
occasionally sending queries:
\sleep100msSELECT1;
Throughput of one active connection in presence of a variable number of mostly-idle connectionsThroughput of 48 active connections in presence of a variable number of mostly-idle connections
The results8 show that the slightly
more realistic scenario causes active connections to slow down even worse.
Cause
Together these results very clearly show that there is a significant issue
handling large connection counts, even when CPU/memory are plentiful. The fact
that a single active connection slows down by more than 2x due to concurrent
idle connections points to a very clear issue.
A CPU profile quickly pinpoints the part of Postgres responsible:
Profile of one active connection running read-only pgbench concurrently with 5000 idle connections, bottleneck is clearly in GetSnapshotData()
Obviously the bottleneck is entirely in the GetSnapshotData() function. That
function performs the bulk of the work necessary to provide readers with
transaction isolation.
GetSnapshotData() builds so called “snapshots” that describe which effects of
concurrent transactions are visible to a transaction, and which are not. These
snapshots are built very frequently (at least once per transaction, very
commonly more often).
Even without knowing its implementation, it does make some intuitive sense (at
least I think so, but I also know what it does) that such a task gets more
expensive the more connections/transactions need to be handled.
Conclusion: Snapshot scalability is a significant limit
A large number of connections clearly reduce the efficiency of other
connections, even when idle (which as explained above, is very common). Except
for reducing the number of concurrent connections and issuing fewer queries,
there is no real workaround for the snapshot scalability issue.
Connection model & context switches
As mentioned above, Postgres uses a
one-process-per-connection model. That works well in a lot of cases, but is a
limiting factor for dealing with 10s to 100s of thousands of connections.
Whenever a query is received by a backend process, the kernel needs to perform
a context switch to that process. That is not cheap. But more importantly, once
the result for the query has been computed, the backend will commonly be idle
for a while—the query result has to traverse the network, be received and
processed by the application, before the application sends a new query. That
means on a busy server another process/backend/connection will need to be
scheduled—another context switch (cross-process context switches are more
expensive than doing process-kernel-same process, e.g. as part of a syscall).
Note that switching to a one-thread-per-connection model does not address this
issue to a meaningful degree: while some of the context switches may get cheaper,
context switches still are the major limit. There are reasons to consider
switching to threads, but connection scalability itself is not a major one
(without additional architectural changes, some of which may be easier using
threads).
To handle huge numbers of connections a different type of connection model is
needed. Instead of using a process/thread-per-connection model, a fixed/limited
number of processes/threads need to handle all connections. By waiting for
incoming queries on many connections at once and then processing many queries
without being interrupted by the OS CPU scheduler, efficiency can very
significantly be improved.
This is not a brilliant insight by me. Architectures like this are in wide use,
and have widely been discussed. See e.g. the
C10k problem, coined in 1999.
Besides avoiding context switches, there are many other performance benefits
that can be gained. E.g. on higher core count machines, a lot of performance
can be gained by increasing locality of shared memory, e.g. by binding specific
processes/threads and regions of memory to specific CPU cores.
However, changing Postgres to support a different kind of connection model like this is a huge
undertaking. That does not just require carefully separating many dependencies
between processes and connections, but also user-land scheduling between
different queries, support for asynchronous IO, likely a different query
execution model (to avoid needing a separate stack for each query), and much
more.
Conclusion: Start by improving snapshot scalability in Postgres
In my opinion, the memory usage issues are not as severe as the other issues
discussed. Partially because the memory overhead of connections is less big
than it initially appears, and partially because issues like Postgres’ caches
using too much memory can be worked around reasonably.
We could, and should, make improvements around memory usage in Postgres,
and there are several low enough hanging fruits. But I don’t think, as things
currently are, that improving memory usage would, on its own, change the
picture around connection scalability, at least not on a fundamental level.
In contrast, there is no good way to work around the snapshot scalability
issues. Reducing the number of established connections significantly is often
not feasible, as explained above. There aren’t really any other workarounds.
Additionally, as the snapshot scalability issue is very localized, it is quite
feasible to tackle it. There are no fundamental paradigm shifts necessary.
Lastly, there is the aspect of wanting to handle many tens of thousands of
connections, likely by entirely switching the connection model. As outlined,
that is a huge project/fundamental paradigm shift. That doesn’t mean it
should not be tackled, obviously.
Addressing the snapshot scalability issue first thus seems worthwhile,
promising significant benefits on its own.
But there’s also a more fundamental reason for tackling snapshot scalability
first: While e.g. addressing some memory usage issues at the same time,
switching the connection model would not at all address the snapshot issue. We
would obviously still need to provide isolation between the connections, even
if a connection wouldn’t have a dedicated process anymore.
Hopefully now you understand why I chose to focus on Postgres snapshot scalability
first. More about that in my next blog post.
DO$$DECLAREcntint:=0;vrecord;BEGINFORvINSELECT*FROMpg_classWHERErelkind='r'andrelnameLIKE'foo%'LOOPEXECUTEformat('SELECT count(*) FROM %s',v.oid::regclass::text);cnt=cnt+1;IFcnt%100=0THENCOMMIT;ENDIF;ENDLOOP;RAISENOTICE'tables %1',cnt;END;$$;
Even worse, there can also be several queries in progress
at the same time, e.g. due to the use of cursors. It is however not common
to concurrently use many cursors. ↩︎
This is with pgbench modified to wait until all connections are
established. Without that pgbench modification, sometimes a subset of
clients may not be able to connect, particularly before the fixes described
in this article. See this
mailing
listpost
for details. ↩︎