ahsan hadi: pgpool II 4.2 features

September 30, 2020, 8:34 am

≫ Next: Ibrar Ahmed: Postgresql_fdw Authentication Changes in PostgreSQL 13

≪ Previous: Bruce Momjian: Three Postgres Adoption Groups in Enterprises

The pgpool II community is gearing up to release the Alpha version of its next major release; pgpool II 4.2. It is going to be another exciting release of pgpool II that is a middleware product and provides mission critical functionality like load balancing, high availability, connection pooling etc for PostgreSQL server. We have written in detail about some of the major features of pgpool II 4.2 i.e. LDAP authentication support, supporting snapshot isolation mode etc, the purpose of this blog is provide brief description about all the major features provided in the 4.2 release.

The focus for last couple of major releases for pgpool II was primarily performance and high availability. The focus for this release was around security and improving user experience by extending the functionality of pgpool II. Another major feature that didn’t make it into this release due to resource constraint was GUI interface for configuration, management and monitoring of pgpool II cluster. This is a much needed feature in order to improve the user experience of pgpool II and make it easy to configure and deploy pgpool II cluster. Some of the infrastructure needed for supporting this feature like improved statistics etc was added (will discuss it in the blog) but GUI interface didn’t make it due to resource constraints.

Below is the summary of most of major and minor features added in pgpool II 4.2 release :

Logging Collector

Similar to community PostgreSQL, the logging_collector parameter that accepts the boolean value is added to added to pgpool II. The logging collector is a background process that captures the log messages and directs them to stderr or into log files. Please note that this parameter can only be set at pgpool II start.

log_disconnections (boolean)

This is another parameter added in pgpool II that is analogous to PostgreSQL, the log_disconnections parameter takes a boolean value. The purpose of this parameter is to log all the client terminations with pgpool II to the log destination.

Please note that this parameter can be changed by reloading the pgpool II configuration file.

Health Check Improvements

Pgpool-II periodically connects to configured PostgreSQL servers in order to detect any errors or faults on the server or the network. The procedure of periodically checking the state of the server or network is called health check. Please note that health check required an extra connection so the user need to adjust the max_connections parameter of PostgreSQL accordingly.

SHOW POOL_HEALTH_CHECK_STATS

Pgpool-II 4.2 provides the command to show statistic of health check by using “SHOW POOL_HEALTH_CHECK_STATS”, the health check statistic shown by this command are collected by the process that performs the health checking.

This command is really helpful for system administrator when diagnosing faults and failures, for example the admin can easily locate the failover event in the log file by looking at “last_failed_health_check” column. Another example is finding unstable connection to backend by evaluating “average_retry_count” column, if a particular node shows higher retry count then other node, there may be a problem to the connection.

Please refer to the link for details on the statistical information show by the “SHOW POOL_HEALTH_CHECK_STATS” command.

https://tatsuo-ishii.github.io/pgpool-II/current/sql-show-pool-health-check-stats.html

SHOW POOL_BACKEND_STATS

This is another very useful command for displaying the pgpool II backend statistics, the command displays the node id, hostname, port, status, role and the counts of following queries issued to each backend :

Select
Insert
Update
Delete
DDL
Other query

This command is really useful in understand the type of traffic sent to each backend server, please visit the link below for details on this command.

https://tatsuo-ishii.github.io/pgpool-II/current/sql-show-pool-backend-stats.html

LDAP Authentication Support

This is major feature added to pgpool II 4.2, it provides LDAP connection between client and pgpool II server. This was a much awaited feature to support LDAP connectivity between client and pgpool II server, the support for LDAP connectivity between pgpool II and backend server is there already. With the addition of this feature in pgpool II 4.2, the user can get end to end LDAP connectivity from client thru pgpool II thru backend PostgreSQL server. This is really helpful in using the same LDAP server for getting complete end of end connectivity with pgpool II and backend server.

I have written a detailed blog on how to get the LDAP connectivity working with pgpool II, it dwells into how to get the LDAP server setup and configured and how to get it working with pgpool II setup. The blog is really helpful for user trying to get LDAP connectivity working with pgpool II

Authenticating pgpool II with LDAP

PCP reload config

This is minor but very useful feature to reload the configuration file on the local pgpool II node or reload the configuration on all pgpool II nodes. The pcp reload config takes —scope (or -s) command line switch, the user can pass -c to the —scope and it will reload all the configuration files of pgpool II cluster nodes. If you pass -l to the —scope, it will only reload the configuration fie on the local pgpool II node.

Snapshot Isolation Mode

This is a major and complex feature added in pgpool II 4.2, this is really critical for a distributed system where transactions are spanned over multiple servers. The scale out solutions that are being implemented in community PostgreSQL can also learn from this feature implemented in middleware pgpool.

The feature is snapshot isolation that guarantees atomic visibility among nodes, the implementation is based on a research paper Pangea: An Eager Database Replication Middleware guaranteeing Snapshot Isolation without Modification of Database Servers. This feature avoids the inconsistency that can be caused among nodes due to visibility inconsistency. This is a must needed feature for a write workload application that does global transaction involving multiple nodes.

Please visit this link to see more details about this feature..

https://tatsuo-ishii.github.io/pgpool-II/current/runtime-config-running-mode.html#GUC-SNAPSHOT-ISOLATION-MODE

Importing PostgreSQL 13 Parser

Every major PostgreSQL release has compatibility with parser of the latest PostgreSQL release, pgpool II 4.2 contains compatibility with PostgreSQL 13 parser. Any new grammar rules or features added in PostgreSQL 13 parser will be compatible with pgpool II 4.2 This mean that 4.2 will recognise the new keywords added in PG-13 and deal with it accordingly.

Conclusion

I have given brief introduction of most of the major and minor features added in pgpool II 4.2 release, links are provided where necessary to provide more details about some of the features.

It is clearly evident that pgpool II has come a long way in terms of functionality, security and stability in last few years. It is becoming the middleware of choice with PostgreSQL.

The next major release of pgpool II after 4.2 will focus on easy of use and providing a graphical user interface that make the configuration, management and monitoring of Pgpool-II easy.

Ahsan Hadi

Ahsan Hadi is a VP of Development with HighGo Software Inc. Prior to coming to HighGo Software, Ahsan had worked at EnterpriseDB as a Senior Director of Product Development, Ahsan worked with EnterpriseDB for 15 years. The flagship product of EnterpriseDB is Postgres Plus Advanced server which is based on Open source PostgreSQL. Ahsan has vast experience with Postgres and has lead the development team at EnterpriseDB for building the core compatibility of adding Oracle compatible layer to EDB’s Postgres Plus Advanced Server. Ahsan has also spent number of years working with development team for adding Horizontal scalability and sharding to Postgres. Initially, he worked with postgres-xc which is multi-master sharded cluster and later worked on managing the development of adding horizontal scalability/sharding to Postgres. Ahsan has also worked a great deal with Postgres foreign data wrapper technology and worked on developing and maintaining FDW’s for several sql and nosql databases like MongoDB, Hadoop and MySQL.

Prior to EnterpriseDB, Ahsan worked for Fusion Technologies as a Senior Project Manager. Fusion Tech was a US based consultancy company, Ahsan lead the team that developed java based job factory responsible for placing items on shelfs at big stores like Walmart. Prior to Fusion technologies, Ahsan worked at British Telecom as a Analyst/Programmer and developed web based database application for network fault monitoring.

Ahsan joined HighGo Software Inc (Canada) in April 2019 and is leading the development teams based in multiple Geo’s, the primary responsibility is community based Postgres development and also developing HighGo Postgres server.

The post pgpool II 4.2 features appeared first on Highgo Software Inc..

↧

Ibrar Ahmed: Postgresql_fdw Authentication Changes in PostgreSQL 13

September 30, 2020, 11:30 am

≫ Next: Egor Rogov: Indexes in PostgreSQL — 9 (BRIN)

≪ Previous: ahsan hadi: pgpool II 4.2 features

Postgresql_fdw Authentication Changes in PostgreSQL 13 PostgreSQL 13 is released with some cool features, such as index enhancement, partition enhancements, and many others. Along with these enhancements, there are some security-related enhancements that require some explanation. There are two major ones: one is related to libpq and the other is related to postgres_fdw. As it is known that postgres_fdw is considered to be a “reference implementation” for other foreign data wrappers, all other foreign data wrappers follow their footsteps in development. This is a community-supported foreign-data wrapper. The blog will explain the security changes in postgresq_fdw.

1 – The superuser can permit the non-superusers to establish a password-less connection on postgres_fdw

Previously, only the superuser can establish a password-less connection with PostgreSQL using postgres_fdw. No other password-less authentication method was allowed. It had been observed that in some cases there is no password required, so it does not make sense to have that limitation. Therefore, PostgreSQL 13 introduced a new option (password_required) where superusers can give permission to non-superusers to use a password-less connection on postgres_fdw.

postgres=# CREATE EXTENSION postgres_fdw;
CREATE EXTENSION

postgres=# CREATE SERVER postgres_svr FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname 'postgres');
CREATE SERVER

postgres=# CREATE FORIENG TABLE foo_for(a INT) SERVER postgres_svr OPTIONS(table_name 'foo');
CREATE FOREIGN TABLE

postgres=# create user MAPPING FOR vagrant SERVER postgres_svr;
CREATE USER MAPPING
postgres=# SELECT * FROM foo_for;
 a 
---
 1
 2
 3
(3 rows)

When we perform the same query from a non-superuser, then we will get this error message:

ERROR: password is required
DETAIL: Non-superusers must provide a password in the user mapping

postgres=# CREATE USER nonsup;
CREATE ROLE

postgres=# create user MAPPING FOR nonsup SERVER postgres_svr;
CREATE USER MAPPING

postgres=# grant ALL ON foo_for TO nonsup;
GRANT

vagrant@vagrant:/work/data$ psql postgres -U nonsup;
psql (13.0)
Type "help" for help.

postgres=> SELECT * FROM foo_for;
2020-09-28 13:00:02.798 UTC [16702] ERROR:  password is required
2020-09-28 13:00:02.798 UTC [16702] DETAIL:  Non-superusers must provide a password in the user mapping.
2020-09-28 13:00:02.798 UTC [16702] STATEMENT:  SELECT * FROM foo_for;
ERROR:  password is required
DETAIL:  Non-superusers must provide a password in the user mapping.

Now perform the same query from non-superuser after setting the new parameter password_required ‘false’ while creating the user mapping.

vagrant@vagrant:/work/data$ psql postgres
psql (13.0)
Type "help" for help.

postgres=# DROP USER MAPPING FOR nonsup SERVER postgres_svr;
DROP USER MAPPING

postgres=# CREATE USER MAPPING FOR nonsup SERVER postgres_svr OPTIONS(password_required 'false');
CREATE USER MAPPING

vagrant@vagrant:/work/data$ psql postgres -U nonsup;
psql (13.0)
Type "help" for help.

postgres=> SELECT * FROM foo_for;
 a 
---
 1
 2
 3
(3 rows)

2 – Authentication via an SSL certificate

A new option is provided to use an SSL certificate for authentication in postgres_fdw. To achieve this, the two new options added to use that feature are sslkey and sslcert.

Before performing this task we need to configure SSL for server and client. There are many blogs available (How to Enable SSL authentication for an EDB Postgres Advanced Server and SSL Certificates For PostgreSQL) to setup SSL for PostgreSQL, and this blog tries to configure SSL with minimum requirements.

Step 1: Generate Key in $PGDATA

vagrant@vagrant$  openssl genrsa -des3 -out server.key 1024
Generating RSA private key, 1024 bit long modulus (2 primes)
.+++++
..................+++++
e is 65537 (0x010001)
Enter pass phrase for server.key:
Verifying - Enter pass phrase for server.key:


vagrant@vagrant$ openssl rsa -in server.key -out server.key
Enter pass phrase for server.key:
writing RSA key

Step 2: Change the mode of the server.key

vagrant@vagrant$  chmod og-rwx server.key

Step 3: Generate the certificate

vagrant@vagrant$ openssl req -new -key server.key -days 3650 -out server.crt -x509
-----
Country Name (2 letter code) [AU]:PK
State or Province Name (full name) [Some-State]:ISB
Locality Name (eg, city) []:Islamabad
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Percona
Organizational Unit Name (eg, section) []:Dev
Common Name (e.g. server FQDN or YOUR name) []:localhost
Email Address []:ibrar.ahmad@gmail.com


vagrant@vagrant$ cp server.crt root.crt

Now we need to generate the client certificate.

Step 4: Generate a Client key

vagrant@vagrant$ openssl genrsa -des3 -out /tmp/postgresql.key 1024
Generating RSA private key, 1024 bit long modulus (2 primes)
..........................+++++
.....................................................+++++
e is 65537 (0x010001)
Enter pass phrase for /tmp/postgresql.key:
Verifying - Enter pass phrase for /tmp/postgresql.key:



vagrant@vagrant$ openssl rsa -in /tmp/postgresql.key -out /tmp/postgresql.key
Enter pass phrase for /tmp/postgresql.key:
writing RSA key


vagrant@vagrant$ openssl req -new -key /tmp/postgresql.key -out 
-----
Country Name (2 letter code) [AU]:PK
State or Province Name (full name) [Some-State]:ISB
Locality Name (eg, city) []:Islamabad
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Percona
Organizational Unit Name (eg, section) []:Dev
Common Name (e.g. server FQDN or YOUR name) []:127.0.0.1
Email Address []:ibrar.ahmad@gmail.com

Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:pakistan
An optional company name []:Percona

Step 5: Copy root.crt to the client

vagrant@vagrant$ cp data5555/root.crt /tmp/

Step 6: Test the connection using a certificate

vagrant@vagrant$ psql 'host=localhost port=5555 dbname=postgres user=ibrar sslmode=verify-full sslcert=/tmp/postgresql.crt sslkey=/tmp/postgresql.key sslrootcert=/tmp/root.crt'
psql (13.0)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.
postgres=> \q

Now we are ready, and we can create a foreign server in PostgreSQL with certificates.

postgres=# CREATE server postgres_ssl_svr foreign data wrapper postgres_fdw options (dbname 'postgres', host 'localhost', port '5555', sslcert '/tmp/postgresql.crt', sslkey '/tmp/postgresql.key', sslrootcert '/tmp/root.crt');
CREATE SERVER

postgres=# create user MAPPING FOR vagrant SERVER postgres_ssl_svr;
CREATE USER MAPPING

postgres=# create foreign table foo_ssl_for(a int) server postgres_ssl_svr options(table_name 'foo');
CREATE FOREIGN TABLE

Now we are ready and set to query a foreign table by postgres_fdw using certificate authentication.

postgres=# select * from foo_ssl_for;

 a 
---
 1
 2
 3
(3 rows)

Note: Only superusers can modify user mappings options sslcertand sslkeysettings.

Our white paper “Why Choose PostgreSQL?” looks at the features and benefits of PostgreSQL and presents some practical usage examples. We also examine how PostgreSQL can be useful for companies looking to migrate from Oracle.

Download PDF

↧

Egor Rogov: Indexes in PostgreSQL — 9 (BRIN)

September 30, 2020, 5:00 pm

≫ Next: Leigh Halliday: PostGIS vs. Geocoder in Rails

≪ Previous: Ibrar Ahmed: Postgresql_fdw Authentication Changes in PostgreSQL 13

In the previous articles we discussed PostgreSQL indexing engine, the interface of access methods, and the following methods: B-trees, GiST, SP-GiST, GIN, and RUM. The topic of this article is BRIN indexes.

BRIN

General concept

Unlike indexes with which we've already got acquainted, the idea of BRIN is to avoid looking through definitely unsuited rows rather than quickly find the matching ones. This is always an inaccurate index: it does not contain TIDs of table rows at all.

Simplistically, BRIN works fine for columns where values correlate with their physical location in the table. In other words, if a query without ORDER BY clause returns the column values virtually in the increasing or decreasing order (and there are no indexes on that column).

This access method was created in scope of Axle, the European project for extremely large analytical databases, with an eye on tables that are several terabyte or dozens of terabytes large. An important feature of BRIN that enables us to create indexes on such tables is a small size and minimal overhead costs of maintenance.

This works as follows. The table is split into ranges that are several pages large (or several blocks large, which is the same) - hence the name: Block Range Index, BRIN. The index stores summary information on the data in each range. As a rule, this is the minimal and maximal values, but it happens to be different, as shown further. Assume that a query is performed that contains the condition for a column; if the sought values do not get into the interval, the whole range can be skipped; but if they do get, all rows in all blocks will have to be looked through to choose the matching ones among them.

It will not be a mistake to treat BRIN not as an index, but as an accelerator of sequential scan. We can regard BRIN as an alternative to partitioning if we consider each range as a "virtual" partition.

Now let's discuss the structure of the index in more detail.

↧

Leigh Halliday: PostGIS vs. Geocoder in Rails

October 1, 2020, 5:00 am

≫ Next: Ibrar Ahmed: PostgreSQL 13 New Feature: dropdb –force

≪ Previous: Egor Rogov: Indexes in PostgreSQL — 9 (BRIN)

This article sets out to compare PostGIS in Rails with Geocoder and to highlight a couple of the areas where you'll want to (or need to) reach for one over the other. I will also present some of the terminology and libraries that I found along the way of working on this project and article as I set out to understand PostGIS better and how it is integrated with Rails. Installing PostGIS ActiveRecord PostGIS Adapter Our Example Data Building a Geo Helper Class with PostGIS Finding Nearby Records…

↧

Ibrar Ahmed: PostgreSQL 13 New Feature: dropdb –force

October 1, 2020, 8:58 am

≫ Next: Asif Rehman: PostgreSQL WAL Archiving and Point-In-Time-Recovery

≪ Previous: Leigh Halliday: PostGIS vs. Geocoder in Rails

There have been many big features added to PostgreSQL 13, like Parallel Vacucim, D-Duplication of B-Tree index, etc., and a complete list can be found at PostgreSQL 13 release notes. Along with the big features, there are also small ones added, including dropdb –force.

Dropdb –force

A new command-line option is added to dropdb command, and a similar SQL option “FORCE” is also added in DROP DATABASE. Using the option -f or –force with dropdb command or FORCE with DROP DATABASE to drop the database, it will terminate all existing connections with the database. Similarly, DROP DATABASE FORCE will do the same.

In the first terminal, create a test database and a database test, and connect to the database.

vagrant@vagrant:~$ createdb test;
vagrant@vagrant:~$ psql test
psql (13.0)
Type "help" for help.

In the second terminal, try to drop the test database and you will get the error message that the test database is being used by another user.

vagrant@vagrant:/usr/local/pgsql.13/bin$ psql postgres
psql (13.0)
Type "help" for help.
postgres=# drop database test;
ERROR:  database "test" is being accessed by other users
DETAIL:  There is 1 other session using the database.

Now try the same command with the FORCE option. You will see that the database is dropped successfully.

postgres=# drop database test WITH ( FORCE );
DROP DATABASE

Note: you can also use the command line dropdb test -f.

The session on the first terminal will be terminated.

test=# \d
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

Looking for more info on other PostgreSQL 13 changes? Check out Ibrar’s previous post, Postgresql_fdw Authentication Changes in PostgreSQL 13!

↧

Asif Rehman: PostgreSQL WAL Archiving and Point-In-Time-Recovery

October 1, 2020, 11:35 am

≫ Next: Bruce Momjian: The Economics of Open Source Contributions

≪ Previous: Ibrar Ahmed: PostgreSQL 13 New Feature: dropdb –force

WAL is short for Write-Ahead-Log. Any change to the data is first recorded in a WAL file. The WAL files are mainly used by RDBMS as a way to achieve durability and consistency while writing data to storage systems.

Before we move forward, let’s first see why we need a WAL archiving and Point in Time Recovery (PITR). Consider if you have accidentally dropped some table(s) or deleted some data? How do you recover from such mistakes? WAL archiving and PITR is the answer to that. A WAL file can be replayed on a server to recreate the recorded changes on that server. Hence we can use the WALs to recover from such dangerous situations. Whereas the PITR is a way to stop the replay of WALs at the specified point and have a consistent snapshot of data at that time. i.e. just before the table is dropped or data is removed.

How to Perform WAL Archiving

Normally, PostgreSQL databases keep the WAL files in the pg_wal directory of the $PGDATA. However, these WAL files may get recycled and can be deleted/overwritten by the server. So to avoid such scenarios, we keep a copy of WAL files in a separate directory other than $PGDATA.
In order to do that, the PostgreSQL server provides us a way to copy the WAL file to a different location as soon as a WAL file is generated by the server. This way depends on three commands (options) namely archive_mode, archive_command, and wal_level. These options can be set in the $PGDATA/postgresql.conf configuration file.

Archiving Options

PostgreSQL server provides us with some options through which we can control the WAL archiving. Let’s see what these options are and how to use them.

archive_mode signifies whether we want to enable the WAL archiving. It can accept the following values:
- on – to enable the archiving
- off – disable the archiving
- always – normally this option is the same as ‘on’. This enables archiving for a standby server as well. If the standby shares the same path with another server, it may lead to WAL file corruption. So care must be taken in this case.
archive_command specifies how to archive (copy) the WAL files and where. This option accepts the shell command or a shell script. Which is executed whenever there is a WAL file generated by the server to archive it. This option accepts the following placeholders:
- %f – if present it’s replaced with the filename of the WAL file.
- %p – if present it is replaced with the pathname of the WAL file.
- %% – is replaced with ‘%’
wal_level is another important option. In PostgreSQL version 10+, it defaults to ‘replica’, prior to this version it was set to minimal by default. wal_level accepts the following values:
- Minimal – adds only the information that is required for crash recovery or from immediate shutdown. It’s not usable for replication or archiving purposes.
- Replica – signifies that WAL will have enough information for WAL archiving and replication.
- Logical – adds information required for logical replication.

an example of wal archive.
vim $PGDATA/postgresql.conf
archive_mode = on
archive_command = ‘cp %p /path/to//archive_dir/%f’
wal_level = replica

Point In Time Recovery

In PostgreSQL, PITR is a way to stop the replay of WAL files at an appropriate point in time. There can be many WAL files in the archive but we may not want to replay all of them. Replaying all WALs will result in the same state where we had made some mistake. There are two important prerequisites required for PITR to work.

Availability of a full base backup (usually taken with pg_basebackup)
WAL files (WAL archive)

In order to achieve the PITR, the first step would be to restore an earlier taken base backup and then create a recovery setup. The setup requires configuring restore_command and recovery_target options.

restore_command specifies from where to look up the WAL files to replay on this server. This command accepts the same placeholders as archive_command.
recovery_target_time This option tells the server when to stop the recovery or replay process. The process will stop as soon as the given timestamp is reached.

recovery_target_inclusive This option controls whether to stop the replay of WALs just after the recovery_target_time is reached (if set to true) or just before (if set to false).

an example of PITR recovery options.
vim $BACKUP/postgresql.con
restore_command = ‘cp  /path/to//archive_dir/%f %p’
recovery_target_time = ‘’
recovery_target_inclusive = false

Demo

Let’s combine all of the above in a practical demonstration and see how this all works.

# Start server
./pg_ctl -D $PGDATA start
 
# take a base backup
./pg_basebackup -D $BACKUP -Fp
 
# connect and put some data
./psql postgres
postgres=# create table foo(c1 int, c2 timestamp default current_timestamp);
CREATE TABLE
postgres=#  insert into foo select generate_series(1, 1000000), clock_timestamp();
INSERT 0 1000000
postgres=# select current_timestamp;
       current_timestamp       
-------------------------------
 2020-10-01 18:01:18.157764+05
(1 row)
postgres=# delete from foo;
DELETE 1000000
postgres=# select current_timestamp;
       current_timestamp       
-------------------------------
 2020-10-01 18:01:36.272033+05
(1 row)

Let’s stop the server and create a recovery setup on the backup to stop before the deletion of data occurred.

./pg_ctl -D $PGDATA stop
 
# tell this cluster to start on recovery mode.
touch $BACKUP/recovery.signal
 

# edit configuration file to setup recovery options on the backup cluster.
vim  $BACKUP/postgresql.conf
# Recovery Options
restore_command = ‘cp  $HOME/wal_archive/%f %p’
recovery_target_time = ‘2020-10-01 18:01:18.157764+05’
recovery_target_inclusive = false

# let’s start the backup cluster and start the recovery process.
./pg_ctl -D $BACKUP start
 2020-10-01 18:03:17.365 PKT [71219] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2020-10-01 18:03:17.366 PKT [71219] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-10-01 18:03:17.371 PKT [71220] LOG:  database system was interrupted; last known up at 2020-10-01 18:00:45 PKT
2020-10-01 18:03:17.413 PKT [71220] LOG:  starting point-in-time recovery to 2020-10-01 18:01:18.157764+05
2020-10-01 18:03:17.441 PKT [71220] LOG:  restored log file "000000010000000000000002" from archive
2020-10-01 18:03:17.463 PKT [71220] LOG:  redo starts at 0/2000028
2020-10-01 18:03:17.463 PKT [71220] LOG:  consistent recovery state reached at 0/2000100
2020-10-01 18:03:17.463 PKT [71219] LOG:  database system is ready to accept read only connections
2020-10-01 18:03:17.501 PKT [71220] LOG:  restored log file "000000010000000000000003" from archive
2020-10-01 18:03:18.182 PKT [71220] LOG:  restored log file "000000010000000000000004" from archive
2020-10-01 18:03:18.851 PKT [71220] LOG:  restored log file "000000010000000000000005" from archive
2020-10-01 18:03:19.539 PKT [71220] LOG:  restored log file "000000010000000000000006" from archive
2020-10-01 18:03:20.195 PKT [71220] LOG:  restored log file "000000010000000000000007" from archive
2020-10-01 18:03:20.901 PKT [71220] LOG:  restored log file "000000010000000000000008" from archive
2020-10-01 18:03:21.574 PKT [71220] LOG:  restored log file "000000010000000000000009" from archive
2020-10-01 18:03:22.286 PKT [71220] LOG:  restored log file "00000001000000000000000A" from archive
2020-10-01 18:03:22.697 PKT [71220] LOG:  recovery stopping before commit of transaction 508, time 2020-10-01 18:01:32.304189+05
2020-10-01 18:03:22.697 PKT [71220] LOG:  pausing at the end of recovery
2020-10-01 18:03:22.697 PKT [71220] HINT:  Execute pg_wal_replay_resume() to promote.

Connect to this cluster and see if we still have data in the foo table?

postgres=# select count(*) from foo;
  count  
---------
 1000000
(1 row)

See when the recovery stopped, we still have the data!

Conclusion

PITR is a critical process to recover important data in case of a server crash which may result in an invalid/corrupt data directory leading to downtime. However, what can be prevented is the impact of a crash resulting in data loss.

Keeping an up-to-date full backup with WAL files will lead to a simple and convenient restoration process. The time recovery time will depend on how recent the full backup was and how many WAL files were generated past it.

Asif Rehman

Asif Rehman is a Senior Software Engineer at HighGo Software. He Joined EnterpriseDB, an Enterprise PostgreSQL’s company in 2005 and started his career in open source development particularly in PostgreSQL. Asif’s contributions range from developing in-house features relating to oracle compatibility, to developing tools around PostgreSQL. He Joined HighGo Software in the month of Sep 2018.

The post PostgreSQL WAL Archiving and Point-In-Time-Recovery appeared first on Highgo Software Inc..

↧

Bruce Momjian: The Economics of Open Source Contributions

October 2, 2020, 9:45 am

≫ Next: David Christensen: Using CTEs to do a binary search of large tables with non-indexed correlated data

≪ Previous: Asif Rehman: PostgreSQL WAL Archiving and Point-In-Time-Recovery

This long article describes the many challenges of managing open source projects and the mismatch between resource allocation, e.g., money, and the importance of the software to economic activity. It highlights OpenSSL as an example where limited funding led to developer burnout and security vulnerabilities, even though so much of the Internet's infrastructure relies on it.

With proprietary software, there is usually a connection between software cost and its economic value, though the linkage varies widely. (How much of software's cost goes into software development, testing, bug fixing, and security analysis has even greater variability.) With open source, there is even less linkage.

The article explores various methods to increase the linkage. It is a complex problem, both to get money, and to distribute money in a way that helps and does not harm open source communities.

↧

David Christensen: Using CTEs to do a binary search of large tables with non-indexed correlated data

October 1, 2020, 5:00 pm

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 13 – psql: Display stats target of extended statistics

≪ Previous: Bruce Momjian: The Economics of Open Source Contributions

Horses racing, accompanied by jockeys Photo by Colin Knowles, used under CC BY-SA 2.0.

Query optimization can take different forms depending on the data represented and the required needs. In a recent case, we had a large table that we had to query for some non-indexed criteria. This table was on an appliance that we were unable to modify, so we had to find a way to query efficiently without indexes that would have made it easier.

The straightforward approach for this query was something along these lines:

SELECT accounts.*
FROM accounts
JOIN logs
    ON accounts.id = logs.account_id
WHERE
    logs.created_at BETWEEN $1 - 'interval 1 minute' AND $1 + 'interval 1 minute' AND
    logs.field1 = $2 AND
    logs.field2 = $3
FETCH FIRST ROW ONLY

Unfortunately, none of the fields involved in this query were indexed, nor could they be, due to our access level on this database system. This lack of indexes means that our query against those fields would end up doing a sequential scan of the whole table which made things unacceptably slow. This specific table held time-series data with ~ 100k records per 1 minute period over a period of several weeks, which meant we were dealing with a lot of data.

While we could not create any additional indexes to help us with this query, we could use some specific properties to help us:

There was a primary key field, id, which was unique and monotonic, i.e., always increasing in value.
This table was append only; no updates or deletes, so once data existed in the table it was always the same.
The field we actually care about (created_at) also ends up being monotonic: subsequent records would always have the same or later values.
Since records were created sequentially and the id field was always increasing, id values and created_at fields would be together be generally monotonic; this means there is an indexed field which we can use as a surrogate stand-in for the target field that we want to treat as an index. While due to the nature of logging ingest it is possible the created_at and id values are not strictly monotonic (for instance if there are multiple logging records being created by separate ingest processes, of which ids get assigned in chunks), for our purposes this was close enough, since we were looking around in a time window which was wider than we were actually expecting the message to appear.
Since we are looking for fields matching a specific window of time, we can substitute the non-indexed clause created_at BETWEEN <timestamp_min> AND <timestamp_max> with an expression matching the indexed statement id BETWEEN <id of first id gt timestamp_min> AND <id of first id gt timestamp_max> to get the same effective approach.
In order to find the specific id fields which match the created_at time ranges we are interested in, we would need to find the first id value which matched the criteria created_at > 'timestamp'::timestamp, as all subsequent id values would match that condition as well. This would effectively require a binary search of the table to check which records match the criteria, and return the smallest id value for which this criteria held.

So now that we have identified how we can use an indexed surrogate key to substitute for the non-indexed expression, we need to figure out how to calculate the ranges in question.

Based on some recent discoveries about optimizing simple-looking but poorly-performing queries using more complicated queries¹, I had instincts that this could be solved with a WITH RECURSIVE Common Table Expression. After toying around for a while and not coming up with the exact solution, I ended up visiting the #postgresql channel on FreeNode IRC Network. There, I presented the problem and got some interested responses, as this is the exact kind of question that database ~~nerds~~ experts love². The solution that user xocolatl (Vik Fearing) came up with for a basic binary search is the basis for the rest of my solution:

CREATE TABLE test_table (id integer PRIMARY KEY, ts timestamptz);

INSERT INTO test_table
SELECT g, date 'today' + interval '1s' * g
FROM generate_series(1, 1000000) AS g;

WITH RECURSIVE
search (min, max, middle, level) AS (
    SELECT min(id), max(id), (min(id)+max(id))/2, 0
    FROM test_table
        UNION ALL
    SELECT v.min, v.max, (v.min + v.max)/2, s.level+1
    FROM search AS s
    CROSS JOIN LATERAL (
        SELECT *
        FROM test_table AS e
        WHERE e.id >= s.middle
        ORDER BY e.id
        FETCH FIRST ROW ONLY
    ) AS e
    CROSS JOIN LATERAL (VALUES (
        CASE WHEN e.ts < now() THEN e.id ELSE s.min END,
        CASE WHEN e.ts < now() THEN s.max ELSE e.id END
    )) AS v (min, max)
    WHERE (v.min + v.max)/2 NOT IN (v.min, v.max)
)
SELECT *
FROM search AS s
JOIN test_table AS e ON e.id = s.middle
ORDER BY s.level DESC
FETCH FIRST ROW ONLY;

As expected, the solution involved a WITH RECURSIVE CTE.

The basic explanation here is that the search expression first starts with the min, max, and middle (mean) values of id for the table (the initialization expression), then iteratively adds additional rows to the results depending on if the table row with id >= middle matches our specific test criteria, then continues until one of the boundaries of the region is hit. (Since we are using integer division in our terminal expression (v.min + v.max)/2 NOT IN (v.min, v.max) we are guaranteed to hit one of the boundary conditions eventually in our search.)

A few other things worthy of note:

This approach uses the check e.ts < now() as the condition we are testing, which means in this specific example, the answer to this "closest id" query would actually change depend on when you run this query relative to when the initial test data was populated. However, we can replace that condition with whatever condition we want to use to test for our surrogate non-indexed data.
This approach will work whether or not there are gaps in the sequence. In order to properly handle gaps, we are selecting the first row with id >= middle ... FETCH FIRST ROW ONLY rather than just selecting id = middle, which you could do in a gapless sequence.
In addition to not caring about gaps, we are also not trying to make sure this is using a balanced binary search; it would be not worth the computing effort to find the middlest existing row in an index, as we'd need to know the number of rows in the segment we're searching. Since in PostgreSQL this would entail a COUNT(*) from a subselect, this would be quite slow and not worth the trouble.

Considering my specific use case was trying to limit the records considered based on two created_at values I needed to calculate a search_min and search_max to find the start/end ids for each side in the interval.

Given this, I just modified the CTE to calculate those and add the additional conditions we were wanting to consider. I also had to turn the result query from a join against the found id value to a range; the final query is as follows:

WITH RECURSIVE
search_min (min, max, middle, level) AS (
    SELECT min(id), max(id), (min(id)+max(id))/2, 0
    FROM logs
        UNION ALL
    SELECT v.min, v.max, (v.min + v.max)/2, s.level+1
    FROM search_min AS s
    CROSS JOIN LATERAL (
        SELECT *
        FROM logs AS e
        WHERE e.id >= s.middle
        ORDER BY e.id
        FETCH FIRST ROW ONLY
    ) AS e
    CROSS JOIN LATERAL (VALUES (
        CASE WHEN extract(epoch FROM e.created_at)::integer < $1 THEN e.id ELSE s.min END,
        CASE WHEN extract(epoch FROM e.created_at)::integer < $1 THEN s.max ELSE e.id END
    )) AS v (min, max)
    WHERE (v.min + v.max)/2 NOT IN (v.min, v.max)
),
search_max (min, max, middle, level) AS (
    SELECT min(id), max(id), (min(id)+max(id))/2, 0
    FROM logs
        UNION ALL
    SELECT v.min, v.max, (v.min + v.max)/2, s.level+1
    FROM search_max AS s
    CROSS JOIN LATERAL (
        SELECT *
        FROM logs AS e
        WHERE e.id >= s.middle
        ORDER BY e.id
        FETCH FIRST ROW ONLY
    ) AS e
    CROSS JOIN LATERAL (VALUES (
        CASE WHEN extract(epoch FROM e.created_at)::integer < $2 THEN e.id ELSE s.min END,
        CASE WHEN extract(epoch FROM e.created_at)::integer < $2 THEN s.max ELSE e.id END
    )) AS v (min, max)
    WHERE (v.min + v.max)/2 NOT IN (v.min, v.max)
)
SELECT accounts.*
FROM accounts
JOIN logs
    ON logs.account_id = accounts.id
WHERE
    logs.field1 = $3 AND
    logs.field2 = $4 AND
    logs.id >= (SELECT middle FROM search_min ORDER BY level DESC FETCH FIRST ROW ONLY) AND
    logs.id <= (SELECT middle FROM search_max ORDER BY level DESC FETCH FIRST ROW ONLY)
ORDER BY logs.id
FETCH FIRST ROW ONLY

The final results were drastically improved. The initial query went from timing out in the webservice in question to returning results in a fraction of a second. Clearly this technique, while not as useful as directly indexing data we care about, can come in handy in some circumstances.

Another tool for the toolbag!

1 (back): Improving max() performance in PostgreSQL: GROUP BY vs. CTE

2 (back): See nerdsniping

↧

Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 13 – psql: Display stats target of extended statistics

October 4, 2020, 11:39 pm

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 14 – Support for OUT parameters in procedures

≪ Previous: David Christensen: Using CTEs to do a binary search of large tables with non-indexed correlated data

On 11st of September 2020, Alvaro Herrera committed patch: psql: Display stats target of extended statistics The stats target can be set since commit d06215d03, but wasn't shown by psql. Author: Justin Pryzby <justin@telsasoft.com> Discussion: https://postgr.es/m/20200831050047.GG5450@telsasoft.com Reviewed-by: Georgios Kokolatos <gkokolatos@protonmail.com> Reviewed-by: Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> Since PostgreSQL 10 we have so called extended statistics. … Continue reading

↧

Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 14 – Support for OUT parameters in procedures

October 5, 2020, 3:10 am

≫ Next: Andreas 'ads' Scherbaum: Andreas Kretschmer

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 13 – psql: Display stats target of extended statistics

On 5th of October 2020, Peter Eisentraut committed patch: Support for OUT parameters in procedures Unlike for functions, OUT parameters for procedures are part of the signature. Therefore, they have to be listed in pg_proc.proargtypes as well as mentioned in ALTER PROCEDURE and DROP PROCEDURE. Reviewed-by: Andrew Dunstan <andrew.dunstan@2ndquadrant.com> Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com> … Continue reading

↧

Andreas 'ads' Scherbaum: Andreas Kretschmer

October 5, 2020, 7:00 am

≫ Next: Jonathan Katz: How to Setup PostgreSQL Monitoring in Kubernetes

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 14 – Support for OUT parameters in procedures

PostgreSQL Person of the Week Interview with Andreas Kretschmer: I was born in Meißen, Saxony, Germany, Planet Earth. I’m married and we have 3 wonderful daughters.

↧

Jonathan Katz: How to Setup PostgreSQL Monitoring in Kubernetes

October 6, 2020, 2:27 am

≫ Next: Hans-Juergen Schoenig: zheap: Reinvented PostgreSQL storage

≪ Previous: Andreas 'ads' Scherbaum: Andreas Kretschmer

You don't need monitoring until you need it. But if you're running anything in production, you always need it.

This is particularly true if you are managing databases. You need to be able to answer questions like "am I running out of disk?" or "why does my application have degraded performance?" to be able to troubleshoot or mitigate problems before they occur.

When I first made a foray into how to monitor PostgreSQL in Kubernetes, let alone in a containerized environment, I learned that a lot of the tools that I had used previously did not exactly apply (though keep in mind, that foray was awhile back -- things have changed!). I found myself learning a whole new tech stack for monitoring, including open source projects as Prometheus and Grafana.

I also learned how I took for granted how easy it was to collect information like CPU and memory statistics in other environments. In container world this was a whole different ballgame, as you needed to get this information from cgroups. Fortunately for me, my colleague Joe Conway built a PostgreSQL extension called pgnodemx that reads these values from within PostgreSQL itself. Read more about pgnodemx.

And then there is the process of getting the metrics stack set up. Even with my earlier experiments on setting up PostgreSQL monitoring with Docker, I knew there was more work to be done to make an easy-to-setup monitoring solution in Kubernetes.

All this, combined with the adoption of the PostgreSQL Operator, made us want to change how we support monitoring PostgreSQL clusters on Kubernetes. We wanted to continue using proven open source solutions for monitoring and analyzing systems in Kubernetes (e.g. Prometheus, Grafana), introduce support for alerting (Alertmanager), and provide accurate host-style metrics for things like CPU, memory, and disk usage.

In other words, we wanted to fully support pgMonitor for Kubernetes.

As such, the key introduction in the PostgreSQL Operator 4.5 release is a revamped monitoring system.

Which leads us to...

↧

Hans-Juergen Schoenig: zheap: Reinvented PostgreSQL storage

October 7, 2020, 12:00 am

≫ Next: Devin Clark: Storing Timezones in Postgres

≪ Previous: Jonathan Katz: How to Setup PostgreSQL Monitoring in Kubernetes

In PostgreSQL table bloat has been a primary concern since the original MVCC model was conceived. Therefore we have decided to do a series of blog posts discussing this issue in more detail. What is table bloat in the first place? Table bloat means that a table and/or indexes are growing in size even if the amount of data stored in the database does not grow at all. If one wants to support transactions it is absolutely necessary not to overwrite data in case it is modified because one has to keep in mind that people might want to read an old row while it is modified or rollback a transaction.

Therefore bloat is an intrinsic thing related to MVCC in PostgreSQL. However, the way PostgreSQL stores data and handles transactions is not the only way a database can handle transactions and concurrency. Let us see which other options there are:

In MS SQL you will find a thing called tempdb while Oracle and MySQL put old versions into the redo log. As you might know PostgreSQL copies rows on UPDATE and stores them in the same table. Firebird is also storing old row versions inline.

There are two main points I want to make here:

Getting rid of old rows is hard
No solution is without tradeoffs

Getting rid of rows is definitely an issue. In PostgreSQL removing old rows is usually done by VACUUM. However, in some cases VACUUM cannot keep up or space is growing for some other reasons (usually long transactions). We at CYBERTEC have blogged extensively about that scenario.

“No solution is without tradeoffs” is also an important aspect of storage. There is no such thing as a perfect storage engine – there are only storage engines serving a certain workload well. The same is true for PostgreSQL: The current table format is ideal for many workloads. However, there is also a dark side which leads us back to where we started: Table bloat. If you are running UPDATE-intense workloads it happens more often than not that the size of a table is hard to keep under control. This is especially true if developers and system administrators are not fully aware of the inner workings of PostgreSQL in the first place.

zheap: Keeping bloat under control

zheap is a way to keep table bloat under control by implementing a storage engine capable of running UPDATE-intense workloads a lot more efficiently. The project has originally been started by EnterpriseDB and a lot of effort has been put into the project already.

To make zheap ready for production we are proud to announce that our partners at Heroic Labs have committed to fund the further development of zheap and release all code to the community. CYBERTEC has decided to double the amount of funding and to put up additional expertise and manpower to move zheap forward. If there are people, companies, etc. who are also interested in helping us move zheap forward we are eager to team up with everybody willing to make this great technology succeed.

Let us take a look at the key design goals:

Perform UPDATE in place
Have smaller tables (smaller tuple headers, improved alignment)
Reduce writes as much as possible (avoid dirtying pages unless data is modified)
Reuse space more quickly

So let us see how those goals can be achieved in general.

The basic design of zheap

zheap is a completely new storage engine and it therefore makes sense to dive into the basic architecture. Three essential components have to work together:

zheap: The table format
undo: Handling transaction rollback, etc.
WAL: Protect critical writes

Let us take a look at the layout of a zheap page first. As you know PostgreSQL typically sees tables as a sequence of 8k blocks, so the layout of a page is definitely important:

At first glance, this image looks almost like a standard PostgreSQL 8k page but in fact it is not. The first thing you might notice is that tuples are stored in the same order as item entries at the beginning of the page to allow for faster scans. The next thing we see here is the presence of “slot” entries at the end of the page. In a standard PostgreSQL table visibility information is stored as part of the row which needs a lot of space. In zheap transaction information has been moved to the page which significantly reduces the size of data (which in turn translates to better performance). A transaction occupies 16 bytes of storage and contains the following information: transaction id, epoch and the latest undo record pointer of that transaction. A row points to a transaction slot. The default number of transaction slots in the table is 4 which is usually ok for big tables. However, sometimes more transaction slots are needed. In this case, zheap has something called “TPD” which is nothing more than an overflow area to store additional transaction information as needed.

Here is the basic layout:

Sometimes many transaction slots are needed for a single page. TPD offers a flexible way to handle that. The question is: Where does zheap store TPD data? The answer is: These special pages are interleaved with the standard data pages. They are just marked in a special way to ensure that sequential scans won’t touch them. To track these special purpose pages zheap uses a meta page to track them:

TDP is simply a way to make transaction slots more scalable. Having some slots in the block itself reduces the need to excessively touch pages. If more are needed TPD is an elegant way out. In a way it is the best of both worlds.

Transaction slots can be reused after a transaction ends.

zheap: Tuple formats

The next important part of the puzzle is the layout of a single tuple: In PostgreSQL a standard heap tuple has a 20+ byte header because all the transactional information is stored in a tuple. Not so in this case. All transactional information has been moved to page level structures (transaction slots). This is super important: The header has therefore been reduced to merely 5 bytes. But there are more optimizations going on here: A standard tuple has to use CPU alignment (padding) between the tuple header and the real data in the row. This can burn some bytes for every single row in the table. zheap is not doing that leading to more tightly packed storage. Additional space is saved by removing the padding from pass-by-value data types. All those optimizations mean that we can save valuable space in every single row of the table. URSELOCATIONSTARTENDTIMELANGUAGE

Here is what a standard PostgreSQL tuple header looks like:

Now let us compare this to a zheap tuple:

As you can see a zheap tuple is a lot smaller than a normal heap tuple. As the transactional information has been unified in the transaction slot machinery, we don’t have to handle visibility on the row level anymore but can do it more efficiently on the page level.

By shrinking the storage footprint zheap will contribute to good performance.

UNDO: Keeping things in order

One of the most important things when talking about zheap is the notion of “undo”. What is the purpose of this thing in the first place? Let us take a look and see: Consider the following operation:

BEGIN;
	UPDATE tab SET x = 7 WHERE x = 5;
	…
	COMMIT / ROLLBACK;

To ensure that transactions can operate correctly UPDATE cannot just overwrite the old value and forget about it. There are two reasons for that: First of all, we want to support concurrency. Many users should be able to read data while it is modified. The second problem is that updating a row does not necessarily mean that it will be committed. Thus we need a way to handle ROLLBACK in a useful way. The classical PostgreSQL storage format will simply copy the row INSIDE standard heap which leads to all those bloat related issues we have discussed on our blog already.

The way zheap approaches things here is a bit different: In case a modification is made the system writes “undo” information to fix it in case the transaction has to be aborted for whatever reason. This is the fundamental concept applicable to INSERT, UPDATE, and DELETE. Let us go through those operations one by one and see how it works:

INSERT: Adding rows

In case of INSERT zheap has to allocate a transaction slot and then emit an undo entry to fix things on error. In case if INSERT the TID is the most relevant information needed by undo. Space can be reclaimed instantly after an INSERT has been rolled back which is a major difference between zheap and standard heap tables in PostgreSQL.

UPDATE: Modifying data

An UPDATE statement is far more complicated: There are basically two cases:

The new row fits into the old space
The new row does not fit into the old space

In case the old row is shorter than the new one we can simply overwrite it and emit an undo entry holding the complete old row. In short: We hold the new row in zheap and a copy of the old row in undo so that we can copy it back to the old structure in case it is needed.

What happens if the new row does not fit in? In this case performance will be worse because zheap essentially has to perform a DELETE / INSERT operation which is of course not as efficient as an in-place UPDATE.

Space can instantly be reclaimed in the following cases:

When updating a row to a shorter version
When non-inplace UPDATEs are performed

DELETE: Removing rows

Finally there is DELETE. To handle the removal of a row zheap has to emit an undo record to put the old row back in place in case of ROLLBACK. The row has to be removed from the zheap during DELETE.

UNDO and ROLLBACK in action

Up to now we have spoken quite a bit about undo and rollback. However, let us dive a bit deeper and see how undo, rollback, and so on interact with each other.

In case a ROLLBACK happens the undo has to make sure that the old state of the table is restored. Thus the undo action we have scheduled before has to be executed. In case of errors the undo action is applied as part of a new transaction to ensure success.

Ideally, all undo action associated with a single page is applied at once to cut down on the amount of WAL that has to be written. A nice side effect of this strategy is also that we can reduce page-level locking to the absolute minimum which reduces contention and therefore helps contribute to good performance.

So far this sounds easy but let us consider an important use case: What happens in the event of a really long transaction? What happens if a terabyte of data has to be rolled back at once? End users certainly don’t appreciate never-ending rollbacks. It is also worth keeping in mind that we must also be prepared for a crash during rollback.

What happens is that if undo action is larger than a certain configurable threshold the job is done by a background worker process. This is a really elegant solution that helps to maintain a good end-user experience.

Undo itself can be removed in three cases:

as soon as there are no transactions anymore that can see the data.
as soon as all undo action has been completed
For committed transactions till the time they are all-visible

Let us take a look at a basic architecture diagram:

As you can see the process is quite sophisticated.

Indexing: A brief remark

To ensure that zheap is a drop-in replacement for the current heap it is important to keep the indexing code untouched. Zheap can work with PostgreSQL’s standard access methods. There is of course room to make things even more efficient. However, at this point no changes to the indexing code are needed. This also implies that all index types available in PostgreSQL are fully available without known restrictions.

Finally …

Currently zheap is still under development and we are glad for the contributions made by Heroic Labs to develop this technology further. So far we have already implemented logical decoding for zheap and added support for PostgreSQL. We will continue to allocate more resources to push the tool to make it production-ready.

If you want to read more about PostgreSQL and VACUUM right now consider checking our previous posts on the subject. In addition, we also want to invite you to keep visiting our blog on a regular basis to learn more about it and other interesting technologies.

The post zheap: Reinvented PostgreSQL storage appeared first on Cybertec.

↧

Devin Clark: Storing Timezones in Postgres

September 30, 2020, 5:00 pm

≫ Next: Bruce Momjian: Community Impact of 2nd Quadrant Purchase

≪ Previous: Hans-Juergen Schoenig: zheap: Reinvented PostgreSQL storage

Storing a user’s timezone in Postgres can be an interesting task in apps. Often, this is accomplished by generating a hardcoded list of every timezone in the backend (or worse the frontend JavaScript) and storing an item from that list in the database for each user. This hardcoded magic list is both difficult to build accurately and hard to maintain. Timezones change more frequently than you would imagine ...but less frequently than time itself.

Postgres has a system view called pg_timezone_names that has the list of timezones used by Postgres internally. We can use this view to generate a dropdown or other input type for the user to select their timezone. We can filter out some of the timezones in your query (like all the ones that start with “posix/“) because this list is quite large by default.

select name from pg_timezone_names where name notlike'posix%'and name not ilike 'system%'orderby name;

Now, we need to figure out the best way to store the selection from our user. We can do this by adding a column to our users table to hold the timezone. The column is a text type because we are only storing the name value from pg_timezone_names.

altertable users addcolumn timezone text;

Next, we want to query the user’s events in the timezone we have for the user. We store the start_time for the events as UTC timestamptz columns. This solution leverages the timezone function in Postgres. AT TIME ZONE zone also works.

SELECT events.name, timezone(users.timezone, start_time)
FROM events
JOIN users ON events.user_id = users.id
JOIN pg_timezone_names ON users.timezone = pg_timezone_names.name;

This could be taken a step further by using the to_timestamp function in Postgres to format the dates, and keep the application from formatting dates at all.

↧

Bruce Momjian: Community Impact of 2nd Quadrant Purchase

October 7, 2020, 9:30 am

≫ Next: Tomas Vondra: OLTP performance since PostgreSQL 8.3

≪ Previous: Devin Clark: Storing Timezones in Postgres

Last week 2nd Quadrant was purchased by EDB. While this is certainly good news for these companies, it can increase risks to the Postgres community. First, there is an unwritten rule that the Postgres core team should not have over half of its members from a single company, and the acquisition causes EDB's representation in the core team to be 60% — the core team is working on a solution for this.

Second, two companies becoming one reduces Postgres user choice for support and services, especially in the North American and western European markets. Reduced vendor options often results in a worse customer service and less innovation. Since the Postgres community does independent innovation, this might not be an issue for community software, but could be for company-controlled tooling around Postgres.

Third, there is the risk that an even larger company wanting to hurt Postgres could acquire EDB and take it in a direction that is neutral or negative for the Postgres community. Employee non-compete agreements, and the lack of other Postgres support companies could extend the duration of these effects. There isn't much the community can do to minimize these issues but to be alert for problems.

↧

Tomas Vondra: OLTP performance since PostgreSQL 8.3

October 8, 2020, 2:00 am

≫ Next: Jonathan Katz: PostgreSQL Monitoring for Application Developers: The Vitals

≪ Previous: Bruce Momjian: Community Impact of 2nd Quadrant Purchase

A couple years ago (at the pgconf.eu 2014 in Madrid) I presented a talk called “Performance Archaeology” which showed how performance changed in recent PostgreSQL releases. I did that talk as I think the long-term view is interesting and may give us insights that may be very valuable. For people who actually work on PostgreSQL […]

↧

Jonathan Katz: PostgreSQL Monitoring for Application Developers: The Vitals

October 8, 2020, 2:17 am

≫ Next: Andres Freund: Measuring the Memory Overhead of a Postgres Connection

≪ Previous: Tomas Vondra: OLTP performance since PostgreSQL 8.3

My professional background has been in application development with a strong affinity for developing with PostgreSQL (which I hope comes through in previous articles). However, in many of my roles, I found myself as the "accidental" systems administrator, where I would troubleshoot issues in production and do my best to keep things running and safe.

When it came to monitoring my Postgres databases, I initially took what I knew about monitoring a web application itself, i.e. looking at CPU, memory, and network usage, and used that to try to detect issues. In many cases, it worked: for instance, I could see a CPU spike on a PostgreSQL database and deduce that there was a runaway query slowing down the system.

Over time, I learned about other types of metrics that would make it easier to triage and mitigate PostgreSQL issues. Combined with what I learned as an accidental systems administrator, I've found they make a powerful toolkit that even helps with application construction.

To help with sharing these experiences, I set up a few PostgreSQL clusters with the PostgreSQL Operator monitoring stack. The Postgres Operator monitoring stack uses the Kubernetes support of pgMonitor to collect metrics about the deployment environment (e.g. Pods) and the specific PostgreSQL instances. I'll also add a slight Kubernetes twist to it, given there are some special consideration you need to make when monitoring Postgres on Kubernetes.

We'll start with my "go to" set of statistics, what I call "the vitals."

The Vital Statistics: CPU, Memory, Disk, and Network Utilization

↧

Andres Freund: Measuring the Memory Overhead of a Postgres Connection

October 7, 2020, 7:01 pm

≫ Next: Muhammad Usama: Configuring Pgpool-II watchdog: It’s going to be a lot easier

≪ Previous: Jonathan Katz: PostgreSQL Monitoring for Application Developers: The Vitals

One fairly common complaint about postgres is that is that each connection uses too much memory. Often made when comparing postgres' connection model to one where each connection is assigned a dedicated thread, instead of the current model where each connection has a dedicated process. To be clear: This is a worthwhile discussion to have. And there are several important improvements we could make to reduce memory usage. That said, I think one common cause of these concerns is that the easy ways to measure the memory usage of a postgres backend, like top and ps, are quite misleading.

↧

Muhammad Usama: Configuring Pgpool-II watchdog: It’s going to be a lot easier

October 8, 2020, 4:13 am

≫ Next: Andres Freund: Analyzing the Limits of Connection Scalability in Postgres

≪ Previous: Andres Freund: Measuring the Memory Overhead of a Postgres Connection

Watchdog is the high availability component of Pgpool-II. Over the past few releases watchdog has gotten a lot of attention from the Pgpool-II developer community and received lots of upgrades and stability improvements.

One of the not very strong areas of pgpool-II watchdog is its configuration interface. Watchdog cluster requires quite a few config settings on each node, and it’s very easy to get it wrong and hard to debug.

For example in a three-node Pgpool-II cluster, we normally require to configure the following parameters in each pgpool.conf

# Local node identificatin (Unique for each node) 
wd_hostname
we_port
wd_heartbeat_port

# Node #1 endpoint
other_pgpool_hostname0
other_pgpool_port0
other_wd_port0

# Node #2 endpoint
other_pgpool_hostname1
other_pgpool_port1
other_wd_port1

# Local device setting for sending heartbeat
heartbeat_device

#Node #1 heartbeat endpoint
heartbeat_destination0
heartbeat_destination_port0

#Node #2 heartbeat endpoint
heartbeat_destination1
heartbeat_destination_port1

The main issue here is not the number of parameters. The issue is the value of almost each of these parameters is different for each pgpool-II node and that makes configuring, debugging, adding and removing of a watchdog difficult to manage.

One pgpool.conf for every node

When we were discussing the features for Pgpool-II 4.2 last year we decided to make the ease of use as one of the priorities for the next release. And since the most difficult configuration belongs to the watchdog area so we decided to fix that first up.

In the upcoming 4.2 version Pgpool-II uses the unified watchdog configuration and unlike the previous versions now we can use the same pgpool.conf file for every Pgpool-II node.

For the same three node pgpool-II cluster, The configuration file will now need to set these below parameters once and can use same pgpool.conf file for each node

# Node #1 config
hostname0
wd_port0
pgpool_port0

# Node #2 config
hostname1
wd_port1
pgpool_port1

# Node #3 config
hostname2
wd_port2
pgpool_port2

#Node #1 heartbeat config
heartbeat_hostname0
heartbeat_port0
heartbeat_device0
    
#Node #2 heartbeat config
heartbeat_hostname1
heartbeat_port1
heartbeat_device1

#Node #3 heartbeat config
heartbeat_hostname2
heartbeat_port2
heartbeat_device2

Once these parameters are configured the next step is to specify the unique node id of each pgpool-II node. For that purpose, pgpool-II opted to use the same technique as used by other distributed software like zookeeper (myid file), i.e. A separate single-line configuration file to set the local node id.

For that purpose, similar to the myid file, Pgpool-II uses a pgpool_node_id file.
pgpool_node_id just contains a single integer in human-readable ASCII text that represents the local pgpool-II node id.

# create pgpool_node_id for node #1
echo 1 > etc/pgpool_node_id

Conclusion

Although from the look of it, this seems like a small feature yet it is a very useful one and makes the watchdog cluster deployment a lot easier and less error-prone. On top of it, it’s not very easy to add/remove the pgpool-II nodes from the watchdog cluster.

Pgpool-II 4.2 alpha was released just a few days ago, and GA is expected in around a monthtime. So I thought to blog about this feature as upgrading to 4.2 would require keeping these configuration changes in mind since the watchdog needs to be reconfigured before proceeding with the upgrade. Otherwise upgrading to 4.2 without considering these changes would lead to downtime.

muhammad usama

Muhammad Usama is a database architect / PostgreSQL consultant at HighGo Software and also Pgpool-II core committer. Usama has been involved with database development (PostgreSQL) since 2006, he is the core committer for open source middleware project Pgpool-II and has played a pivotal role in driving and enhancing the product. Prior to coming to open source development, Usama was doing software design and development with the main focus on system-level embedded development. After joining the EnterpriseDB, an Enterprise PostgreSQL’s company in 2006 he started his career in open source development specifically in PostgreSQL and Pgpool-II. He is a major contributor to the Pgpool-II project and has contributed to many performance and high availability related features.

The post Configuring Pgpool-II watchdog: It’s going to be a lot easier appeared first on Highgo Software Inc..

↧

Andres Freund: Analyzing the Limits of Connection Scalability in Postgres

October 8, 2020, 11:30 am

≫ Next: Egor Rogov: Indexes in PostgreSQL — 10 (Bloom)

≪ Previous: Muhammad Usama: Configuring Pgpool-II watchdog: It’s going to be a lot easier

One common challenge with Postgres for those of you who manage busy Postgres databases, and those of you who foresee being in that situation, is that Postgres does not handle large numbers of connections particularly well.

While it is possible to have a few thousand established connections without running into problems, there are some real and hard-to-avoid problems.

Since joining Microsoft last year in the Azure Database for PostgreSQL team—where I work on open source Postgres—I have spent a lot of time analyzing and addressing some of the issues with connection scalability in Postgres.

In this post I will explain why I think it is important to improve Postgres’ handling of large number of connections. Followed by an analysis of the different limiting aspects to connection scalability in Postgres.

In an upcoming post I will show the results of the work we’ve done to improve connection handling and snapshot scalability in Postgres—and go into detail about the identified issues and how we have addressed them in Postgres 14.

Why connection scalability in Postgres is important

In some cases problems around connection scalability are caused by unfamiliarity with Postgres, broken applications, or other issues in the same vein. And as I already mentioned, some applications can have a few thousand established connections without running into any problems.

A frequent counter-claim to requests to improve Postgres’ handling of large numbers of connection counts is that there is nothing to address. That the desire/need to handle large numbers of connection is misguided, caused by broken applications or similar. Often accompanied by references to the server only having a limited number of CPU cores.

There certainly are cases where the best approach is to avoid large numbers of connections, but there are—in my opinion—pretty clear reasons for needing larger number of connections in Postgres. Here are the main ones:

Central state and spikey load require large numbers of connections: It is common for a database to be the shared state for an application (leaving non-durable caching services aside). Given the cost of establishing a new database connection (TLS, latency, and Postgres costs, in that order) it is obvious that applications need to maintain pools of Postgres connections that are large enough to handle the inevitable minor spikes in incoming requests. Often there are many servers running [web-]application code using one centralized database.

To some degree this issue can be addressed using Postgres connection poolers like PgBouncer or more recently Odyssey. To actually reduce the number of connections to the database server such poolers need to be used in transaction (or statement) pooling modes. However, doing so precludes the use of many useful database features like prepared statements, temporary tables, …

Latency and result processing times lead to idle connections: Network latency and application processing times will often result in individual database connections being idle the majority of the time, even when the applications are issuing database requests as fast as they can.

Common OLTP database workloads, and especially web applications, are heavily biased towards reads. And with OLTP workloads, the majority of SQL queries are simple enough to be processed well below the network latency between application and database.

Additionally the application needs to process the results of the database queries it sent. That often will involve substantial work (e.g. template processing, communication with cache servers, …).

To drive this home, here is a simple experiment using pgbench (a simple benchmarking program that is part of Postgres). In a memory-resident, read-only pgbench workload (executed on my workstation¹, 20/40 CPU cores/threads) I am comparing the achievable throughput across increasing client counts between a non-delayed pgbench and a pgbench with simulated delays. For the simulated delays, I used a 1ms network delay and a 1ms processing delay. The non-delayed pgbench peaks around 48 clients, the delayed run around 3000 connections. Even comparing on-machine TCP connections to a 10GBe between two physically close machines moves the peak from around 48 connections closer to 500 connections.

Scaling out to allow for higher connection counts can increase cost: Even in cases where the application’s workload can be distributed over a number of Postgres instances, the impact of latency combined with low maximum connection limits will often result in low utilization of the database servers, while exerting pressure to increase the number of database servers to handle the required number of connections. That can increase the operational costs substantially.

Surveying connection scalability issues

My goal in starting this project was to improve Postgres’ ability to handle substantially larger numbers of connections. To do that—to pick the right problem to solve—I first needed to understand which problems were most important, otherwise it would have been easy to end up with micro-optimizations without improving real-world workloads.

So my first software engineering task was to survey the different aspects of connection scalability limitations in Postgres, specifically:

By the end of this deep dive into the connection scalability limitations in Postgres, I hope you will understand why I concluded that snapshot scalability should be addressed first.

Memory usage

There are 3 main aspects to problems around memory usage of a large numbers of connections:

Constant connection overhead, the amount of memory an established connection uses
Cache bloat, the increase in memory usage due to large numbers of database objects
Query memory usage, memory used by query execution itself

Constant connection overhead

Postgres, as many of you will know, uses a process-based connection model. When a new connection is established, Postgres’ supervisor process creates a dedicated process to handle that connection going forward. The use of a “full blown process” over the use of of threads has some advantages like increased isolation/robustness, but also some disadvantages.

One common complaint is that each connection uses too much memory. That is, at least partially, a common observation because it is surprisingly hard to measure the increase in memory usage by an additional connection.

In a recent post about measuring the memory overhead of a Postgres connection I show that it is surprisingly hard to accurately measure the memory overhead. And that in many workloads, and with the right configuration—most importantly, using huge_pages—the memory overhead of each connection is below 2 MiB.

Conclusion: connection memory overhead is acceptable

When each connection only has an overhead of a few MiB, it is quite possible to have thousands of established connections. It would obviously be good to use less memory, but memory is not the primary issue around connection scalability.

Cache bloat

Another important aspect of memory-related connection scalability issues can be that, over time, the memory usage of a connection increases, due to long-lived resources. This particularly is an issue in workloads that utilize long-lived connections combined with schema-based multi-tenancy.

Unless applications implement some form of connection <-> tenant association, each connection over time will access all relations for all tenants. That leads to Postgres’ internal catalog metadata caches growing beyond a reasonable size, as currently (as of version 13) Postgres does not prune its metadata caches of unchanging rarely-accessed contents.

Problem illustration

To demonstrate the issue of cache bloat, I created a simple test bed with 100k tables, with a few columns and single primary serial column index². Takes a while to create.

With the recently addedpg_backend_memory_contexts view it is not too difficult to see the aggregated memory usage of the various caches (although it would be nice to see more of the different types of caches broken out into their own memory contexts). See ³.

In a new Postgres connection, not much memory is used:

name	parent	size_bytes	size_human	num_contexts
CacheMemoryContext	TopMemoryContext	524288	512 kB	1
index info	CacheMemoryContext	149504	146 kB	80
relation rules	CacheMemoryContext	8192	8192 bytes	1

But after forcing all Postgres tables we just created to be accessed⁴, this looks very different:

name	parent	size_bytes	size_human	num_contexts
CacheMemoryContext	TopMemoryContext	621805848	593 MB	1
index info	CacheMemoryContext	102560768	98 MB	100084
relation rules	CacheMemoryContext	8192	8192 bytes	1

As the metadata cache for indexes is created in its own memory context, num_contexts for the “index info” contexts nicely shows that we accessed the 100k tables (and some system internal ones).

Conclusion: cache bloat is not the major issue at this moment

A common solution for the cache bloat issue is to drop “old” connections from the application connection pooler after a certain age. Many connection pooler libraries/web frameworks support that.

As there is a feasible workaround, and as cache bloat is only an issue for databases with a lot of objects, cache bloat is not the major issue at the moment (but worthy of improvement, obviously).

Query memory usage

The third aspect is that it is hard to limit memory used by queries. The work_mem setting does not control the memory used by a query as a whole, but only of individual parts of a query (e.g. sort, hash aggregation, hash join). That means that a query can end up requiring work_mem several times over⁵.

That means that one has to be careful setting work_mem in workloads requiring a lot of connections. With larger work_mem settings, practically required for analytics workloads, one can’t reasonably use a huge number of concurrent connections and expect to never hit memory exhaustion related issues (i.e. errors or the OOM killer).

Luckily most workloads requiring a lot of connection don’t need a high work_mem setting, and it can be set on the user, database, connection, and transaction level.

Snapshot scalability

There are a lot of recommendations out there strongly recommending to not set max_connections for Postgres to a high value, as high values can cause problems. In fact, I’ve argued that myself many times.

But that is only half the truth.

Setting max_connections to a very high value alone only leads at best (worst?) to a very small slowdown in itself, and wastes some memory. E.g. on my workstation¹ there is no measurable performance difference for a read-only pgbench between max_connections=100 and a value as extreme max_connections=100000 (for the same pgbench client count, 48 in this case). However the memory required for Postgres does increase measurable with such an extreme setting. With shared_buffers=16GBmax_connections=100 uses 16804 MiB, max_connections=100000 uses 21463 MiB of shared memory. That is a large enough difference to potentially cause a slowdown indirectly (although most of that memory will never be used, therefore not allocated by the OS in common configurations).

The real issue is that currently Postgres does not scale well to having a large number of established connections, even if nearly all connections are idle.

To showcase this, I used two separate pgbench⁶ runs. One of them just establishes connections that are entirely idle (using a test file that just contains \sleep 1s, causing a client-side sleep). Another to run a normal pgbench read-only workload.

This is far from reproducing the worst possible version of the issue, as normally the set of idle connections varies over time, which makes this issue considerably worse. This version is much easier to reproduce however.

This is a very useful scenario to test, because it allows us to isolate the cost of additional connections pretty well isolated. Especially when the count of active connections is low, the system CPU usage is quite low. If there is a slowdown when the number of idle connections increases, it is clearly related to the number of idle connections.

If we instead measured the throughput with a high number of active connections, it’d be harder to pinpoint whether e.g. the increase in context switches or lack of CPU cycles is to blame for slowdowns.

graph showing significant performance degradation at higher idle connection counts — Throughput of one active connection in presence of a variable number of idle connections

These results⁷ clearly show that the achievable throughput of active connections decreases significantly when the number of idle connections increases.

In reality “idle” connections are not entirely idle, but send queries at a lower rate. To simulate that I’ve used the the below to simulate clients only occasionally sending queries:

\sleep100msSELECT1;

graph showing significant performance degradation at higher mostly-idle connection counts — Throughput of one active connection in presence of a variable number of mostly-idle connections

The results⁸ show that the slightly more realistic scenario causes active connections to slow down even worse.

Cause

Together these results very clearly show that there is a significant issue handling large connection counts, even when CPU/memory are plentiful. The fact that a single active connection slows down by more than 2x due to concurrent idle connections points to a very clear issue.

A CPU profile quickly pinpoints the part of Postgres responsible:

50% of the CPU time is spent in GetSnapshotData() — Profile of one active connection running read-only pgbench concurrently with 5000 idle connections, bottleneck is clearly in `GetSnapshotData()`

Obviously the bottleneck is entirely in the GetSnapshotData() function. That function performs the bulk of the work necessary to provide readers with transaction isolation. GetSnapshotData() builds so called “snapshots” that describe which effects of concurrent transactions are visible to a transaction, and which are not. These snapshots are built very frequently (at least once per transaction, very commonly more often).

Even without knowing its implementation, it does make some intuitive sense (at least I think so, but I also know what it does) that such a task gets more expensive the more connections/transactions need to be handled.

Two blog posts by Brandur explain the mechanics and issues surrounding this in more detail: - How Postgres Makes Transactions Atomic - How to Manage Connections Efficiently in Postgres, or Any Database

Conclusion: Snapshot scalability is a significant limit

A large number of connections clearly reduce the efficiency of other connections, even when idle (which as explained above, is very common). Except for reducing the number of concurrent connections and issuing fewer queries, there is no real workaround for the snapshot scalability issue.

Connection model & context switches

As mentioned above, Postgres uses a one-process-per-connection model. That works well in a lot of cases, but is a limiting factor for dealing with 10s to 100s of thousands of connections.

Whenever a query is received by a backend process, the kernel needs to perform a context switch to that process. That is not cheap. But more importantly, once the result for the query has been computed, the backend will commonly be idle for a while—the query result has to traverse the network, be received and processed by the application, before the application sends a new query. That means on a busy server another process/backend/connection will need to be scheduled—another context switch (cross-process context switches are more expensive than doing process-kernel-same process, e.g. as part of a syscall).

Note that switching to a one-thread-per-connection model does not address this issue to a meaningful degree: while some of the context switches may get cheaper, context switches still are the major limit. There are reasons to consider switching to threads, but connection scalability itself is not a major one (without additional architectural changes, some of which may be easier using threads).

To handle huge numbers of connections a different type of connection model is needed. Instead of using a process/thread-per-connection model, a fixed/limited number of processes/threads need to handle all connections. By waiting for incoming queries on many connections at once and then processing many queries without being interrupted by the OS CPU scheduler, efficiency can very significantly be improved.

This is not a brilliant insight by me. Architectures like this are in wide use, and have widely been discussed. See e.g. the C10k problem, coined in 1999.

Besides avoiding context switches, there are many other performance benefits that can be gained. E.g. on higher core count machines, a lot of performance can be gained by increasing locality of shared memory, e.g. by binding specific processes/threads and regions of memory to specific CPU cores.

However, changing Postgres to support a different kind of connection model like this is a huge undertaking. That does not just require carefully separating many dependencies between processes and connections, but also user-land scheduling between different queries, support for asynchronous IO, likely a different query execution model (to avoid needing a separate stack for each query), and much more.

Conclusion: Start by improving snapshot scalability in Postgres

In my opinion, the memory usage issues are not as severe as the other issues discussed. Partially because the memory overhead of connections is less big than it initially appears, and partially because issues like Postgres’ caches using too much memory can be worked around reasonably.

We could, and should, make improvements around memory usage in Postgres, and there are several low enough hanging fruits. But I don’t think, as things currently are, that improving memory usage would, on its own, change the picture around connection scalability, at least not on a fundamental level.

In contrast, there is no good way to work around the snapshot scalability issues. Reducing the number of established connections significantly is often not feasible, as explained above. There aren’t really any other workarounds.

Additionally, as the snapshot scalability issue is very localized, it is quite feasible to tackle it. There are no fundamental paradigm shifts necessary.

Lastly, there is the aspect of wanting to handle many tens of thousands of connections, likely by entirely switching the connection model. As outlined, that is a huge project/fundamental paradigm shift. That doesn’t mean it should not be tackled, obviously.

Addressing the snapshot scalability issue first thus seems worthwhile, promising significant benefits on its own.

But there’s also a more fundamental reason for tackling snapshot scalability first: While e.g. addressing some memory usage issues at the same time, switching the connection model would not at all address the snapshot issue. We would obviously still need to provide isolation between the connections, even if a connection wouldn’t have a dedicated process anymore.

Hopefully now you understand why I chose to focus on Postgres snapshot scalability first. More about that in my next blog post.

Footnotes

2x xeon gold 5215, 192GiB of RAM, kernel 5.8.5, debian Sid ↩︎

Creating 100k tables with psql:

postgres[2627319][1]=# SELECT format('begin;create table foo_%1s(id serial primary key, data1 int, data2 text, data3 json);commit;', g.i) FROM generate_series(1, 100000) g(i)\gexec
COMMIT
COMMIT
…

↩︎

Query cache memory usage:

WITHRECURSIVEcontextsAS(SELECT*FROMpg_backend_memory_contexts),cachesAS(SELECT*FROMcontextsWHEREname='CacheMemoryContext'UNIONALLSELECTcontexts.*FROMcachesJOINcontextsON(contexts.parent=caches.name))SELECTname,parent,sum(total_bytes)size_bytes,pg_size_pretty(sum(total_bytes))size_human,count(*)ASnum_contextsFROMcachesGROUPBYname,parentORDERBYSUM(total_bytes)DESC;

↩︎

Query to access all tables named foo*:

DO$$DECLAREcntint:=0;vrecord;BEGINFORvINSELECT*FROMpg_classWHERErelkind='r'andrelnameLIKE'foo%'LOOPEXECUTEformat('SELECT count(*) FROM %s',v.oid::regclass::text);cnt=cnt+1;IFcnt%100=0THENCOMMIT;ENDIF;ENDLOOP;RAISENOTICE'tables %1',cnt;END;$$;

↩︎

Even worse, there can also be several queries in progress at the same time, e.g. due to the use of cursors. It is however not common to concurrently use many cursors. ↩︎
This is with pgbench modified to wait until all connections are established. Without that pgbench modification, sometimes a subset of clients may not be able to connect, particularly before the fixes described in this article. See this mailing listpost for details. ↩︎
Idle Connections vs Active Connections:
PG Version Idle Connections Active Connections TPS
12 0 1 33457
12 100 1 33705
12 1000 1 30558
12 2500 1 26075
12 5000 1 23284
12 10000 1 14496
12 0 48 1032435
12 100 48 960847
12 1000 48 902109
12 2500 48 759723
12 5000 48 702680
12 10000 48 521558
↩︎

PG Version	Idle Connections	Active Connections	TPS
12	0	1	33457
12	100	1	33705
12	1000	1	30558
12	2500	1	26075
12	5000	1	23284
12	10000	1	14496
12	0	48	1032435
12	100	48	960847
12	1000	48	902109
12	2500	48	759723
12	5000	48	702680
12	10000	48	521558

Mostly Idle Connections vs Active Connections:

PG Version	Less active Connections	Active Connections	TPS
12	0	1	33773
12	100	1	29074
12	1000	1	25327
12	2500	1	19752
12	5000	1	9807
12	10000	1	6049
12	0	48	1040616
12	100	48	953755
12	1000	48	759366
12	2500	48	733000
12	5000	48	636057
12	10000	48	416819

↩︎

↧