Stefan Fercot: Combining pgBackRest and Streaming Replication

November 27, 2018, 4:00 pm

≫ Next: Pavel Stehule: Orafce - simple thing that can help

≪ Previous: Pavel Stehule: PostgreSQL 12 - psql - csv output

pgBackRest is a well-known powerful backup and restore tool. It offers a lot of possibilities.

While pg_basebackup is commonly used to setup the initial database copy for the Streaming Replication, it could be interesting to reuse a previous database backup (eg. taken with pgBackRest) to perform this initial copy.

Furthermore, the --delta option provided by pgBackRest can help us to re-synchronize an old secondary server without having to rebuild it from scratch.

To reduce the load on the primary server during a backup, pgBackRest even allows to take backups from a standby server.

We’ll see in this blog post how to do that.

For the purpose of this post, we’ll use 2 nodes called primary and secondary. Both are running on CentOS 7.

We’ll cover some pgBackRest tips but won’t go deeper in the PostgreSQL configuration, nor in the Streaming Replication best practices.

Installation

On both primary and secondary server, install PostgreSQL and pgBackRest packages directly from the PGDG yum repositories:

$ sudo yum install -y https://download.postgresql.org/pub/repos/yum/11/redhat/\
rhel-7-x86_64/pgdg-centos11-11-2.noarch.rpm
$ sudo yum install -y postgresql11-server postgresql11-contrib pgbackrest

Check that pgBackRest is correctly installed:

$ pgbackrest
pgBackRest 2.07 - General help

Usage:
    pgbackrest [options] [command]

Commands:
    archive-get     Get a WAL segment from the archive.
    archive-push    Push a WAL segment to the archive.
    backup          Backup a database cluster.
    check           Check the configuration.
    expire          Expire backups that exceed retention.
    help            Get help.
    info            Retrieve information about backups.
    restore         Restore a database cluster.
    stanza-create   Create the required stanza data.
    stanza-delete   Delete a stanza.
    stanza-upgrade  Upgrade a stanza.
    start           Allow pgBackRest processes to run.
    stop            Stop pgBackRest processes from running.
    version         Get version.

Use 'pgbackrest help [command]'for more information.

Create a basic PostgreSQL cluster on primary:

$ sudo /usr/pgsql-11/bin/postgresql-11-setup initdb
$ sudo systemctl enable postgresql-11
$ sudo systemctl start postgresql-11

Setup a shared repository between the hosts

To be able to share the backups between the hosts, we’ll here create a nfs export from secondary and mount it on primary.

Install and activate nfs server on secondary:

$ sudo yum -y install nfs-utils
$ sudo systemctl enable nfs-server.service 
$ sudo systemctl start nfs-server.service 
$ sudo firewall-cmd --permanent--add-service=nfs
$ sudo firewall-cmd --reload

Create the backup repository and export it:

$ sudo mkdir /mnt/backups
$ sudo chown postgres: /mnt/backups/
$ sudo echo"/mnt/backups primary(rw,sync,no_root_squash)">> /etc/exports
$ sudo exportfs -a

Install nfs client and mount the shared repository on primary:

$ sudo yum -y install nfs-utils
$ sudo echo"secondary:/mnt/backups /mnt/backups nfs rw,sync,hard,intr 0 0">> /etc/fstab
$ sudo mount /mnt/backups/

The storage of your backups is completely up to you. The requirement here is to have that storage available on both servers.

If needed, you might even encrypt your repository. To do that, follow the documentation.

Configure pgBackRest to backup the local cluster

By default, the configuration file is /etc/pgbackrest.conf. Let’s make a copy:

$ sudo cp /etc/pgbackrest.conf /etc/pgbackrest.conf.bck

Update the primary configuration:

[global]repo1-path=/mnt/backupsrepo1-retention-full=1process-max=2log-level-console=infolog-level-file=debug[mycluster]pg1-path=/var/lib/pgsql/11/data

Configure archiving in the postgresql.conf file:

archive_mode = on
archive_command = 'pgbackrest --stanza=mycluster archive-push %p'

The PostgreSQL cluster must be restarted after making these changes and before performing a backup.

Let’s finally create the stanza and check the configuration:

$ sudo-u postgres pgbackrest --stanza=mycluster stanza-create
P00   INFO: stanza-create command begin 2.07: 
			--log-level-console=info --log-level-file=debug  
			--pg1-path=/var/lib/pgsql/11/data --repo1-path=/mnt/backups 
			--stanza=mycluster
P00   INFO: stanza-create command end: completed successfully

$ sudo-u postgres pgbackrest --stanza=mycluster check
P00   INFO: check command begin 2.07: -
			-log-level-console=info --log-level-file=debug  
			--pg1-path=/var/lib/pgsql/11/data --repo1-path=/mnt/backups 
			--stanza=mycluster
P00   INFO: WAL segment 000000010000000000000001 successfully stored in the 
	archive at '/mnt/backups/archive/mycluster/11-1/0000000100000000/
	000000010000000000000001-ee7d07fc95b699231dac05d3b5c9f4b1dda22488.gz'
P00   INFO: check command end: completed successfully

Insert some test data in the database

Using pgbench, let’s create some test data:

$ sudo-iu postgres createdb test$ sudo-iu postgres /usr/pgsql-11/bin/pgbench -i-s 100 test

Prepare the servers for Streaming Replication

On primary server, add to postgresql.conf:

listen_addresses = '*'

Create a specific user for the replication:

$ sudo-iu postgres psql
postgres=# CREATE ROLE replic_user WITH LOGIN REPLICATION PASSWORD 'mypwd';

Configure pg_hba.conf:

host replication replic_user secondary md5

Restart the cluster and allow the service in the firewall (if needed):

$ sudo systemctl restart postgresql-11.service
$ sudo firewall-cmd --permanent--add-service=postgresql
$ sudo firewall-cmd --reload

Configure ~postgres/.pgpass on secondary servers:

$ echo"*:*:replication:replic_user:mypwd">> ~postgres/.pgpass
$ chown postgres: ~postgres/.pgpass
$ chmod 0600 ~postgres/.pgpass

Perform a backup

Let’s take our first backup on the primary server:

$ sudo-u postgres pgbackrest --stanza=mycluster --type=full backup
P00   INFO: backup command begin 2.07: 
		--log-level-console=info --log-level-file=debug 
		--pg1-path=/var/lib/pgsql/11/data --process-max=2 
		--repo1-path=/mnt/backups --repo1-retention-full=1 
		--stanza=mycluster --type=full
P00   INFO: execute non-exclusive pg_start_backup() with label "...": 
		backup begins after the next regular checkpoint completes
P00   INFO: backup start archive = 000000010000000000000057, lsn = 0/57000028
...
P00   INFO: full backup size = 1.4GB
P00   INFO: execute non-exclusive pg_stop_backup() and wait for all WAL segments 
        to archive
P00   INFO: backup stop archive = 000000010000000000000057, lsn = 0/57000168
P00   INFO: new backup label = 20181127-152908F
P00   INFO: backup command end: completed successfully
P00   INFO: expire command begin
P00   INFO: expire command end: completed successfully

Secondary setup

Configure /etc/pgbackrest.conf:

[global]repo1-path=/mnt/backupsrepo1-retention-full=1process-max=2log-level-console=infolog-level-file=debug[mycluster]pg1-path=/var/lib/pgsql/11/datarecovery-option=standby_mode=onrecovery-option=primary_conninfo=host=primary user=replic_userrecovery-option=recovery_target_timeline=latest

Usage of the --delta option:

Restore or backup using checksums.
During a restore, by default the PostgreSQL data and tablespace directories are expected to be present but empty. This option performs a delta restore using checksums.
During a backup, this option will use checksums instead of the timestamps to determine if files will be copied.

We use this option here to avoid having to clean the data directory of the secondary server, which can be very helpful in case of huge volumes. If you kept several full backups for example, it could be interesting to use this option for the backups too. Like the other parameters, it can also be set in the configuration files.

Restore the backup taken from the primary server:

$ sudo-u postgres pgbackrest --stanza=mycluster --delta restore
P00   INFO: restore command begin 2.07: 
		--delta--log-level-console=info --log-level-file=debug 
		--pg1-path=/var/lib/pgsql/11/data --process-max=2 
		--recovery-option=standby_mode=on 
		--recovery-option="primary_conninfo=host=primary user=replic_user"--recovery-option=recovery_target_timeline=latest 
		--repo1-path=/mnt/backups --stanza=mycluster
P00   INFO: restore backup set 20181127-152908F
P00   INFO: remove invalid files/paths/links from /var/lib/pgsql/11/data
...
P00   INFO: write /var/lib/pgsql/11/data/recovery.conf
P00   INFO: restore global/pg_control 
		(performed last to ensure aborted restores cannot be started)
P00   INFO: restore command end: completed successfully

Actually, the recovery-option parameters allow pgBackRest to configure the recovery.conf file:

$ cat /var/lib/pgsql/11/data/recovery.conf 
primary_conninfo ='host=primary user=replic_user'
recovery_target_timeline ='latest'
standby_mode ='on'
restore_command ='pgbackrest --stanza=mycluster archive-get %f "%p"'

All we have to do now is to start the PostgreSQL cluster:

$ sudo systemctl enable postgresql-11
$ sudo systemctl start postgresql-11

If the replication setup is correct, you should see those processes on the secondary server:

# ps -ef |grep postgres
postgres 19610     1  ... /usr/pgsql-11/bin/postmaster -D /var/lib/pgsql/11/data/
postgres 19614 19610  ... postgres: startup   recovering 000000010000000000000058
postgres 19621 19610  ... postgres: walreceiver   streaming 0/58000140

We now have a 2-nodes cluster working with Streaming Replication and archives recovery as safety net.

Take backups from the secondary server

pgBackRest can perform backups on a standby server instead of the primary. Both the primary and secondary databases configuration are required, even if the majority of the files will be copied from the secondary to reduce load on the primary.

To do so, adjust the /etc/pgbackrest.conf file on secondary:

[global]repo1-path=/mnt/backupsrepo1-retention-full=1process-max=2log-level-console=infolog-level-file=debugbackup-standby=ydelta=y[mycluster]pg1-host=primarypg1-path=/var/lib/pgsql/11/datapg2-path=/var/lib/pgsql/11/datarecovery-option=standby_mode=onrecovery-option=primary_conninfo=host=primary user=replic_userrecovery-option=recovery_target_timeline=latest

Options added are:

delta: to allow delta backup and restore without using --delta
backup-standby
pg1-host and pg1-path

Perform a backup from secondary:

$ sudo-u postgres pgbackrest --stanza=mycluster --type=full backup
P00   INFO: backup command begin 2.07: 
		--backup-standby--delta--log-level-console=info 
        --log-level-file=debug 
		--pg1-host=primary --pg1-path=/var/lib/pgsql/11/data 
		--pg2-path=/var/lib/pgsql/11/data --process-max=2 
		--repo1-path=/mnt/backups --repo1-retention-full=1 
		--stanza=mycluster --type=full
P00   INFO: execute non-exclusive pg_start_backup() with label "...": 
		backup begins after the next regular checkpoint completes
P00   INFO: backup start archive = 00000001000000000000005B, lsn = 0/5B000028
P00   INFO: wait for replay on the standby to reach 0/5B000028
P00   INFO: replay on the standby reached 0/5B0000D0, checkpoint 0/5B000060
...
P00   INFO: full backup size = 1.4GB
P00   INFO: execute non-exclusive pg_stop_backup() and wait for all WAL segments 
        to archive
P00   INFO: backup stop archive = 00000001000000000000005B, lsn = 0/5B000130
P00   INFO: new backup label = 20181127-164924F
P00   INFO: backup command end: completed successfully
P00   INFO: expire command begin
...
P00   INFO: expire command end: completed successfully

Even incremental backups can be taken:

$ sudo-u postgres pgbackrest --stanza=mycluster --type=incr backup
P00   INFO: backup command begin 2.07: 
		--backup-standby--delta--log-level-console=info 
        --log-level-file=debug 
		--pg1-host=primary --pg1-path=/var/lib/pgsql/11/data 
		--pg2-path=/var/lib/pgsql/11/data --process-max=2 
		--repo1-path=/mnt/backups --repo1-retention-full=1 
		--stanza=mycluster --type=incr
P00   INFO: last backup label = 20181127-164924F, version = 2.07
P00   INFO: execute non-exclusive pg_start_backup() with label "...": 
		backup begins after the next regular checkpoint completes
P00   INFO: backup start archive = 00000001000000000000005D, lsn = 0/5D000028
P00   INFO: wait for replay on the standby to reach 0/5D000028
P00   INFO: replay on the standby reached 0/5D0000D0, checkpoint 0/5D000060
...
P00   INFO: incr backup size = 1.4GB
P00   INFO: execute non-exclusive pg_stop_backup() and wait for all WAL segments 
        to archive
P00   INFO: backup stop archive = 00000001000000000000005D, lsn = 0/5D000130
P00   INFO: new backup label = 20181127-164924F_20181127-165743I
P00   INFO: backup command end: completed successfully
P00   INFO: expire command begin
P00   INFO: expire command end: completed successfully

$ sudo-u postgres pgbackrest --stanza=mycluster info
stanza: mycluster
    status: ok
    cipher: none

    db (current)
        wal archive min/max (11-1): 
            00000001000000000000005B / 00000001000000000000005D

        full backup: 20181127-164924F
            timestamp start/stop: 2018-11-27 16:49:24 / 2018-11-27 16:50:10
            wal start/stop: 00000001000000000000005B / 00000001000000000000005B
            database size: 1.4GB, backup size: 1.4GB
            repository size: 83.7MB, repository backup size: 83.7MB

        incr backup: 20181127-164924F_20181127-165743I
            timestamp start/stop: 2018-11-27 16:57:43 / 2018-11-27 16:57:54
            wal start/stop: 00000001000000000000005D / 00000001000000000000005D
            database size: 1.4GB, backup size: 50.5KB
            repository size: 83.7MB, repository backup size: 3.9KB
            backup reference list: 20181127-164924F

Conclusion

pgBackRest offers a lot of possibilities. We’ve seen in this post some tips to use in addition to Streaming Replication.

We’ve also seen in previous posts that changes can sometimes happen in PostgreSQL itself (eg. integrate recovery.conf into postgresql.conf).

Using a tool supported by the community rather than your own script will also help you keep compatibility with those changes.

↧

Pavel Stehule: Orafce - simple thing that can help

November 28, 2018, 12:23 am

≫ Next: Hans-Juergen Schoenig: Transactions in PostgreSQL: READ COMMITTED vs. REPEATABLE READ

≪ Previous: Stefan Fercot: Combining pgBackRest and Streaming Replication

I merged small patch to master branch of Orafce. This shows a wide PostgreSQL possibilities and can decrease a work necessary for migration from Oracle to Postgres.

One small/big differences between Oracle and any other databases is meaning of empty string. There are lot of situation, when Oracle use empty string as NULL, and NULL as empty string. I don't know any other database, that does it.

Orafce has native type (not domain type) varchar2 and nvarchar2. Then it is possible to define own operators. I implemented || concat operator as null safe for these types. So now it is possible to write:

postgres=# select null || 'xxx'::varchar2 || null;
┌──────────┐
│ ?column? │
╞══════════╡
│ xxx      │
└──────────┘
(1 row)

When you port some application from Oracle to Postgres, then is good to disallow empty strings in Postgres. One possible solution is using generic C trigger function replace_empty_string(). This trigger function can check any text type field in stored rows and can replace empty strings by NULLs. Sure, you should to fix any check like

colname = '' or colname  '' in your application, and you should to use just only colname IS [NOT] NULL. Then the code will be same on Oracle and PostgreSQL too, and you can use automatic translation by ora2pg.

↧

Hans-Juergen Schoenig: Transactions in PostgreSQL: READ COMMITTED vs. REPEATABLE READ

November 28, 2018, 1:00 am

≫ Next: Alexey Lesovsky: Global shortcuts and PostgreSQL queries.

≪ Previous: Pavel Stehule: Orafce - simple thing that can help

The ability to run transactions is the core of every modern relational database system. The idea behind a transaction is to allow users to control the way data is written to PostgreSQL. However, a transaction is not only about writing – it is also important to understand the implications on reading data for whatever purpose (OLTP, data warehousing, etc.).

Understanding transaction isolation levels

One important aspect of transactions in PostgreSQL and therefore in all other modern relational databases is the ability to control when a row is visible to a user and when it is not. The ANSI SQL standard proposes 4 transaction isolation levels (READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ and SERIALIZABLE) to allow users to explicitly control the behavior of the database engine. Unfortunately the existence of transaction isolation levels is still not as widely known as it should be, and therefore I decided to blog about this important topic to give more PostgreSQL users the chance to apply this very important, yet under-appreciated feature.

The two most commonly used transaction isolation levels are READ COMMITTED and REPEATABLE READ. In PostgreSQL READ COMMITTED is the default isolation level and should be used for normal OLTP operations. In contrast to other systems such as DB2 or Informix, PostgreSQL does not provide support for READ UNCOMMITTED, which I personally consider to be a thing of the past anyway.

What READ COMMITTED does

In READ COMMITTED mode, every SQL statement will see changes which have already been committed (e.g. new rows added to the database) by some other transactions. In other words: If you run the same SELECT statement multiple times within the same transaction, you might see different results. This is something you have to take into account when writing an application.

However, within a statement the data you see is constant – it does not change. A SELECT statement (or any other statement) will not see changes committed WHILE the statement is running. Within an SQL statement, data and time are basically “frozen”.

What REPEATABLE READ does

In the case of REPEATABLE READ the situation is quite different: A transaction running in REPEATABLE READ mode will never see the effects of transactions committing concurrently – it will keep seeing the same data and offer you a consistent snapshot throughout the entire transaction. If your goal is to do reporting or if you are running some kind of data warehousing workload, REPEATABLE READ is exactly what you need, because it provides consistency. All pages of your report will see exactly the same set of data. There is no need to worry about concurrent transactions.

Transaction isolation in PostgreSQL visualized

Digging through a theoretical discussion might not be what you are looking for. So let us take a look at a picture showing graphically how things work:

PostgreSQL transaction isolation — READ COMMITTED vs. REPEATABLE READ in PostgreSQL

Let us assume that we have 17 rows in a table. In my example three transactions will happen concurrently. A READ COMMITTED, a REPEATABLE READ and a writing transaction. The write happens while our two reads execute their first SELECT statement. The important thing here: The data is not visible to concurrent transactions. This is a really important observation. The situation changes during the second SELECT. The REPEATABLE READ transaction will still see the same data, while the READ COMMITTED transaction will see the changed row count.

REPEATABLE READ is really important for reporting because it is the only way to get a consistent view of the data set even while it is being modified.

The post Transactions in PostgreSQL: READ COMMITTED vs. REPEATABLE READ appeared first on Cybertec.

↧

Alexey Lesovsky: Global shortcuts and PostgreSQL queries.

November 28, 2018, 2:42 am

≫ Next: Viorel Tabara: What's New in PostgreSQL 11

≪ Previous: Hans-Juergen Schoenig: Transactions in PostgreSQL: READ COMMITTED vs. REPEATABLE READ

Using your favorite hotkeys on queries in Linux

One of my colleagues often talks about using hot keys for his favourite SQL queries and commands in iterm2 (e.g. for checking current activity or to view lists of largest tables).

Usually, I listen to this with only half an ear, because iterm2 is available only for MacOS and I am a strong Linux user. Once this topic came up again I thought perhaps this function could be realised not only through iterm2 but through an alternative tool or settings in desktop environment.
Due to me being an old KDE user, I opened “System settings” and began to check all settings related to keyboard, input, hotkeys and so on. What I have found is number of settings that allow to emulate text input from the keyboard. Using this feature I configured bindings for my favourite queries. So now, for executing needed query I don’t need to search it in my queries collection, copy and paste… I just press hotkey and got it in active window, it can be psql console, work chat, text editor or something else.

Here is how these bindings are configured:
Open “System settings” application and go to “Shortcuts”. There is “Custom Shortcuts” menu. Here, optionally, we should create a dedicated group for our shortcuts. I named my group “PostgreSQL hot queries”. When creating a shortcut, select “Global shortcut” and next “Send Keyboard Input”.

Now, we need to setup a new shortcut, give it a name and description. And here is the most interesting part. From the different Linux users, sometimes I’ve heard that KDE must have been written by aliens, and that statement has not been completely clear for me, until this moment since I never had serious issues with KDE. Now, after configuring the shortcuts, I tend to agree with this statement more and more.

Ok, here we go, next we should type text of a query which should appear when hotkey will be pressed. So, instead of plain query text you have to input alien sequences of symbols.

Check out the screenshot with example of query that should show current activity from pg_stat_activity.

Ok, when I typed in the first query, I understood that I might have to spent a whole day inputting other queries. I took sed and made one-liner that reads query text and replaces most often occurences of symbols’ sequences. But even this sed one-liner is not perfect and I had to fix some errors in the translated query.

sed -e "s/$[a-zA-Z0-9\'\,\.\/\=\-]\{1\}$/\1:/g" -e 's/::/Shift+;:Shift+;:/g' -e 's/ /Space:/g' -e 's/(/Shift+9:/g' -e 's/\~/Shift+\`:/g' -e 's/)/Shift+0:/g' -e 's/_/Shift+-:/g' -e 's/\^/Shift+6:/g' -e 's/$/Enter:/g' -e 's/||/Shift+\\\:Shift+\\\:/g' -e 's/*/Shift+8:/g' -e 's/</Shift+,:/g' -e 's/>/Shift+.:/g' filename

Huh, it looks like avada kedavra but it works in most cases and with sed and seldom tiny edits I managed to configure hotkeys for my favourite queries.

Of course, we can discuss the convenience of shortcuts’ configuration, but, unfortunately, that’s what we have available at the moment. Perhaps in future versions of KDE it will be possible to specify source file with query instead of typing mutant sequences of symbols.

What I thought may be useful is an example with configured shortcuts that you can use for importing instead of typing (you should activate them after import). But I’m pretty sure you have your own favourite queries so you may want to add them, so be ready to spent some time to configure your own hotkeys. In case of using my settings you may need to change hotkeys combinations for your preferences (I used Win+1..0 combinations).

I hope this KDE feature will be useful and convenient for you.

↧

Viorel Tabara: What's New in PostgreSQL 11

November 28, 2018, 2:34 am

≫ Next: Bruce Momjian: Data Storage Options

≪ Previous: Alexey Lesovsky: Global shortcuts and PostgreSQL queries.

PostgreSQL 11 was released on October 10th, 2018, and on schedule, marking the 23rd anniversary of the increasingly popular open source database.

While a complete list of changes is available in the usual Release Notes, it is worth checking out the revamped Feature Matrix page which just like the official documentation has received a makeover since its first version which makes it easier to spot changes before diving into the details.

For example on the Release Notes page, the “Channel binding for SCAM authentication” is buried under the Source Code while the matrix has it under the Security section. For the curious here’s a screenshot of the interface:

PostgreSQL Feature Matrix

Additionally, the Bucardo Postgres Release Notes page linked above, is handy in its own way, making it easy to search for a keyword across all versions.

What’s New? With literally hundreds of changes, I will go through the differences listed in the Feature Matrix.

Covering Indexes for B-trees (INCLUDE)

CREATE INDEX received the INCLUDE clause which allows indexes to include non-key columns. Its use case for frequent identical queries, is well described in Tom Lane’s commit from November 22nd, which updates the development documentation (meaning that the current PostgreSQL 11 documentation doesn’t have it yet), so for the full text refer to section 11.9. Index-Only Scans and Covering Indexes in the development version.

Parallelized CREATE INDEX for B-tree Indexes

As alluded in the name, this feature is only implemented for the B-tree indexes, and from Robert Haas’ commit log we learn that the implementation may be refined in the future. As noted from the CREATE INDEX documentation, while both parallel and concurrent index creation methods take advantage of multiple CPUs, in the case of CONCURRENT only the first table scan will be performed in parallel.

Related to this new feature are the configuration parameters maintenance_work_mem and maintenance_parallel_maintenance_workers.

Lastly, the number of parallel workers can be set per table using the ALTER TABLE command and specifying a value for parallel_workers.

PostgreSQL Management & Automation with ClusterControl

Learn about what you need to know to deploy, monitor, manage and scale PostgreSQL

Download the Whitepaper

Just-In-Time (JIT) Compilation for Expression Evaluation and Tuple Deforming

With its own JIT chapter in the documentation, this new feature relies on PostgreSQL being compiled with LLVM support (use pg_config to verify).

The topic of JIT in PostgreSQL is complex enough (see the JIT README reference in the documentation) to require a dedicated blog, in the meantime, the CitusData blog on JIT is a very good read for those interested to dive deeper into the subject.

Parallelized Hash Joins

This performance improvement to parallel queries is the result of adding a shared hash table, which as Thomas Munro explains in his Parallel Hash for PostgreSQL blog, avoids partitioning the hash table providing that it fits in work_mem, which thus far for PostgreSQL appears to be a better solution than the partition-first algorithm. The same blog describes the PostgreSQL architecture obstacles that the author had to overcome in his quest for add parallelization to hash joins that speaks to the complexity of the work that was required in order to implement this feature.

Default Partition

This is a catch all partition to store rows that do not match any other defined partition. In cases where a new partition is added a CHECK constraint is recommended in order to avoid a scan of the default partition which can be slow when the default partition contains a large number of rows.

The default partition behavior is explained in the documentation of ALTER TABLE and CREATE TABLE.

Partitioning by a Hash Key

Also called hash partitioning, and as pointed out in the commit message, the feature allows partitioning of tables in such a way that partitions will hold a similar number of rows. This is achieved by providing a modulus, which in the more simple scenario is recommended to be equal to the number of partitions, and the remainder should be different for each partition.

For more details and an example see the CREATE TABLE documentation page.

Support for PRIMARY KEY, FOREIGN KEY, Indexes, and Triggers on Partitioned Tables

Table partitioning is already a big step in improving performance of large tables, and the addition of these features addresses the limitations that partitioned tables have had since PostgreSQL 10 when the modern-style “declarative partitioning” was introduced.

Work by Alvaro Herrera is underway for allowing foreign keys to reference primary keys, and is scheduled for the next PostgreSQL major version 12.

UPDATE on a Partition Key

As explained in the patch commit log this update prevents PostgreSQL from throwing an error when an update to the partition key invalidates a row, and instead the row will be moved to an appropriate partition.

Channel Binding for SCRAM Authentication

This is a security measure aim at preventing man-in-the-middle attacks in SASL Authentication and is thoroughly detailed in the author’s blog. The feature requires a minimum of OpenSSL 1.0.2.

CREATE PROCEDURE and CALL Syntax for SQL Stored Procedures

PostgreSQL has had CREATE FUNCTION since 1996, with version 1.0.1, however, functions cannot handle transactions. As mentioned in the documentation, the CREATE PROCEDURE command is not fully compatible with the SQL standard.

Note: Stay tuned for an upcoming blog which deep dives into this feature

Conclusion

PostgreSQL 11 major updates focus on performance improvements through parallel execution, partitioning and Just-In-Time compilation. Stored procedures allow for full transaction control and can be written in a variety of PL languages.

Tags:

PostgreSQL

postgres

upgrade

↧

Bruce Momjian: Data Storage Options

November 28, 2018, 7:30 am

≫ Next: Pavel Stehule: plpgsql_check can detect bad default volatility flag

≪ Previous: Viorel Tabara: What's New in PostgreSQL 11

Since I have spent three decades working with relational databases, you might think I believe all storage requires relational storage. However, I have used enough non-relational data stores to know that each type of storage has its own benefits and costs.

It is often complicated to know which data store to use for your data. Let's look at the different storage levels, from simplest to most complex:

Flat files: Flat files are exactly what they sound like — an unstructured stream of bytes stored in a file. There is no structure defined in the file itself — structure must be implemented in the application that reads the file. This is obviously the simplest way to store data, and works well for small data volumes when only a single user and a few well-coordinated applications need access. File locking is required to serialize multi-user access. Changes typically require a complete file rewrite.
Word processing documents: This is similar to flat files, but defines structure in the data file, e.g., highlighting, sections. The same flat file limitations apply.
Spreadsheet: This is similar to word processing documents, but adds more capabilities, including computations and the definition of relationships between data elements. Data is more atomized in this format than in the previous one.
NoSQL stores: This removes many of the limitations from previous data stores. Multi-user access is supported, including locking, and modification of single data elements does not require rewriting all data.
Relational databases: This is the most complex data storage option. Rigid structure is enforced internally, though unstructured options exist. Data access occurs using a declarative language that is dynamically optimized based on that structure. Multi-user and multi-application access is efficient.

You might think that since relational databases have the most features, everything should use it. However, with features come complexity and rigidity. Therefore, all levels are valid for some use cases:

Flat files are ideal for read-only, single-user, and single-application access, e.g., configuration files. Modification are easy using a text editor.
How many organizations do you know that use word processing documents or spreadsheets to manage their organizations? This is often used at the departmental level when only a few people must access or modify the data.
NoSQL became popular among application developers who didn't want the rigidity of relational systems, and wanted simpler multi-node deployment.
Relational databases still win for multi-user, multi-application workloads.

↧

Pavel Stehule: plpgsql_check can detect bad default volatility flag

November 28, 2018, 8:20 am

≫ Next: Michael Paquier: Postgres 12 highlight - DOS prevention

≪ Previous: Bruce Momjian: Data Storage Options

Common performance problem of plpgsql function when these functions are used from some more complex queries is using default VOLATILE flag. There are not possible to do more aggressive optimization of this function call. plpgsql_check can detect this issue now:

CREATE OR REPLACE FUNCTION public.flag_test1(integer)
 RETURNS integer
 LANGUAGE plpgsql
 STABLE
AS $function$
begin
  return $1 + 10;
end;
$function$;

CREATE OR REPLACE FUNCTION public.flag_test2(integer)
 RETURNS integer
 LANGUAGE plpgsql
 VOLATILE
AS $function$
begin
  return (select * from fufu where a = $1 limit 1);
end;
$function$;

postgres=# select * from plpgsql_check_function('flag_test1(int)', performance_warnings => true);
┌────────────────────────────────────────────────────────────────────┐
│                       plpgsql_check_function                       │
╞════════════════════════════════════════════════════════════════════╡
│ performance:00000:routine is marked as STABLE, should be IMMUTABLE │
└────────────────────────────────────────────────────────────────────┘
(1 row)

postgres=# select * from plpgsql_check_function('flag_test2(int)', performance_warnings => true);
┌──────────────────────────────────────────────────────────────────────┐
│                        plpgsql_check_function                        │
╞══════════════════════════════════════════════════════════════════════╡
│ performance:00000:routine is marked as VOLATILE, should be STABLE    │
└──────────────────────────────────────────────────────────────────────┘
(1 row)

↧

Michael Paquier: Postgres 12 highlight - DOS prevention

November 29, 2018, 8:50 pm

≫ Next: Marco Slot: Why the RDBMS is the future of distributed databases, ft. Postgres and Citus

≪ Previous: Pavel Stehule: plpgsql_check can detect bad default volatility flag

A couple of months ago a thread has begun on the PostgreSQL community mailing lists about a set of problems where it is possible to lock down PostgreSQL from connections just by running a set of queries with any user, having an open connection to the cluster being enough to do a denial of service.

For example, in one session do the following by scanning pg_stat_activity in a transaction with any user:

BEGIN;
SELECT count(*) FROM pg_stat_activity;

This has the particularity to take an access share lock on the system catalog pg_authid which is a critical catalog used for authentication. And then, with a second session and the same user, do for example VACUUM FULL on pg_authid, like that:

VACUUM FULL pg_authid;

This user is not an owner of the relation so VACUUM will fail. However, at this stage the second session will be stuck until the first session commits as an attempt to take a lock on the relation will be done, and a VACUUM FULL takes an exclusive lock, which prevents anything to read or write it. Hence, in this particular case, as pg_authid is used for authentication, then no new connections can be done to the instance until the transaction of the first session has committed.

As the thread continued, more commands have been mentioned as having the same kind of issues:

As mentioned above, VACUUM FULL is a pattern. In this case, queuing for a lock on a relation for which an operation will fail should not happen. This takes an exclusive lock on the relation.
TRUNCATE, for reasons similar to VACUUM FULL.
REINDEX on a database or a schema.

The first two cases have been fixed for PostgreSQL 12, with commit a556549 for VACUUM and commit f841ceb for TRUNCATE. Note that similar work has been done a couple of years ado with for example CLUSTER in commit cbe24a6. In all those cases, the root of the problem is to make sure that the user has the right to take a lock on a relation before attempting it and locking it, so this has basically required a bit of refactoring so as the code involved makes use of RangeVarGetRelidExtended() which has a custom callback to do the necessary ownership and/or permission checks beforehand. All this infrastructure is present in PostgreSQL for a couple of years, added via commit 2ad36c4. Still getting the patches into the right shape has required some thoughts as changes should remain backward-compatible (for example with VACUUM, a non-authorized attempt does not result in an error, but in a warning), and things got a bit trickier with the addition of partitioned tables from Postgres 10.

The case of REINDEX, fixed by commit 661dd23, is a bit more exotic as the root issue is different. A user can run REINDEX SCHEMA/DATABASE on respectively a schema or a database if he is an owner of it. The interesting fact is that shared catalogs (like pg_authid) would be included in the list of what gets reindexed even if the user owning the schema/database does not own those shared catalogs, causing a lock conflict. In this case, the fix has been to tighten a bit REINDEX so as shared catalogs don’t get reindexed if the user is not its owner. This patch found its way to PostgreSQL 11 and above, and has required a behavior change.

Fixing all those issues would have not been possible thanks to a lot of individuals, first Robert Haas, Alvaro Herrera and Noah Misch who worked on an infrastructure to improve queue locking behavior a couple of years ago, and then to several folks who have spent time arguing and reviewing the different patches proposed for the three cases mentioned in this post: mainly Nathan Bossart, Kyotaro Horiguchi and more. If more commands can be improved in this area, feel free to report them for example by referring to the bug report guidelines. And we will get that patched up and improved.

↧

Marco Slot: Why the RDBMS is the future of distributed databases, ft. Postgres and Citus

November 30, 2018, 12:15 am

≫ Next: Liaqat Andrabi: Webinar : Introduction to OmniDB [Follow Up]

≪ Previous: Michael Paquier: Postgres 12 highlight - DOS prevention

Around 10 years ago I joined Amazon Web Services and that’s where I first saw the importance of trade-offs in distributed systems. In university I had already learned about the trade-offs between consistency and availability (the CAP theorem), but in practice the spectrum goes a lot deeper than that. Any design decision may involve trade-offs between latency, concurrency, scalability, durability, maintainability, functionality, operational simplicity, and other aspects of the system—and those trade-offs have meaningful impact on the features and user experience of the application, and even on the effectiveness of the business itself.

Perhaps the most challenging problem in distributed systems, in which the need for trade-offs is most apparent, is building a distributed database. When applications began to require databases that could scale across many servers, database developers began to make extreme trade-offs. In order to achieve scalability over many nodes, distributed key-value stores (NoSQL) put aside the rich feature set offered by the traditional relational database management systems (RDBMS), including SQL, joins, foreign keys, and ACID guarantees. Since everyone wants scalability, it would only be a matter of time before the RDBMS would disappear, right? Actually, relational databases have continued to dominate the database landscape. And here’s why:

The most important aspect to consider when making trade-offs in a distributed system (or any system) is development cost.

The trade-offs made by your database software will have significant impact on the development cost of your application. Handling data in an advanced application that needs to be usable, reliable, and performant is a problem that is inherently complex to solve. The number of man hours required to successfully address every little subproblem can be enormous. Fortunately, a database can take care of many of these subproblems, but database developers face the cost problem as well. It actually takes many decades to build the functionality, guarantees, and performance that make a database good enough for most applications. That’s where time and again, established relational databases like PostgreSQL and MySQL come in.

At Citus Data, we’ve taken a different angle to addressing the need for database scalability. My team and I spent the past several years transforming an established RDBMS into a distributed database, without losing its powerful capabilities or forking from the underlying project. In doing so we found that an RDBMS is the perfect foundation for building a distributed database.

Lowering application development costs by building on top of an RDBMS

What makes an RDBMS so attractive for developing an application—especially an open source RDBMS, and especially a cloud RDBMS—is that you can effectively leverage the engineering investments that have been made into the RDBMS over a period of decades and utilise these RDBMS capabilities in your app, lowering your development cost.

An RDBMS provides you with:

meaningful guarantees around data integrity and durability
enormous flexibility to manipulate and query your data
state-of-the-art algorithms and data structures for getting high performance under diverse workloads.

These capabilities matter to almost any non-trivial application, but take a long time to develop. On the other hand, some applications have a workload that is too demanding for a single machine and therefore require horizontal scalability.

Many new distributed databases are being developed and are implementing RDBMS functionality like SQL on top of distributed key-value stores (“NewSQL”). While these newer database can use the resources of multiple machines, they are still nowhere near the established relational database systems in terms of SQL support, query performance, concurrency, indexing, foreign keys, transactions, stored procedures, etc. Using such a database leaves you with many complex problems to solve in your application.

An alternative that has been employed by many big Internet companies is manual, application-layer sharding of an RDBMS—typically PostgreSQL or MySQL. Manual sharding means that there are many RDBMS nodes and the application decides which node to connect to based on some condition (e.g. the user ID). The application itself is responsible for how to handle data placement, schema changes, querying multiple nodes, replicating tables, etc—so if you’re doing manual sharding, you end up implementing your own distributed database within your application, which is perhaps even more costly.

Fortunately, there is a way around the development cost conundrum.

PostgreSQL has been under development for several decades, with an incredible focus on code quality, modularity, and extensibility. This extensibility offers a unique opportunity: to transform PostgreSQL into a distributed database, without forking it. That’s how we built Citus.

Citus: Becoming the world’s most advanced distributed database

When I joined a start-up called Citus Data around 5 years ago, I was humbled by the challenge of building an advanced distributed database in a competitive market without any existing infrastructure, brand recognition, go to market, capital, or large pool of engineers. The development cost alone seemed like it would be insurmountable. But in the same way that application developers leverage PostgreSQL to build a complex application, we leveraged PostgreSQL to build… distributed PostgreSQL.

Instead of creating a distributed database from scratch, we created Citus, an open sourcePostgreSQL extension that transparently distributes tables and queries in a way that provides horizontal scale, but with all the PostgreSQL features that application developers need to be successful.

Using internal hooks that Postgres calls when planning a query, we were able to add the concept of a distributed table to Postgres.

Distributing Queries the Citus Way — Citus is loaded into PostgreSQL and hooks into the query planner

The shards of a distributed table are stored in regular PostgreSQL nodes with all their existing capabilities, and Citus sends regular SQL commands to query the shards and then combines the results. We also added the concept of a reference table, which is replicated across all nodes and can therefore be joined with the distributed tables by any column. By further adding support for distributed transactions, query routing, distributed subqueries and CTEs, sequences, upserts and more, we got to a point where most advanced PostgreSQL features just work, but now at scale.

The Citus distributed database architecture

Citus is still relatively young but is already one of the most advanced distributed databases in the world by building on top of PostgreSQL. While comparing to the full feature set of PostgreSQL is humbling and there is still a lot to do, the capabilities that Citus provides today and the way it scales out, make it largely unique in the distributed database landscape. Many current Citus users initially built their business on a single node PostgreSQL server, using many of the advanced capabilities in Postgres, and then migrated to Citus with only a few weeks of development effort to convert their database schema to distributed and reference tables. With any other database, such a migration from a single-node database to a distributed database could have taken months—or even years.

Turning Postgres features into superpowers using Citus

An RDBMS like PostgreSQL has an almost endless array of features and a mature SQL engine, which lets you query your data in infinitely many ways. Of course, these features are only useful to an application if they are also fast. Fortunately, PostgreSQL is fast and keeps getting faster with new features such as just-in-time query compilation, but when you have so much data or traffic that a single machine is too slow then those powerful features aren’t quite as useful anymore… unless you can combine the computing power of many machines. That’s where features become superpowers.

By taking PostgreSQL features and scaling them out, Citus has a number of superpowers that enables users to grow their database to any size while maintaining high performance and all its functionality.

Query routing means taking (part of) the query and letting the RDBMS node that stores the relevant shards handle the query instead of gathering or reshuffling the intermediate results, which is possible when the query filters and joins by the distribution column. Query routing allows Citus to support all the SQL functionality of the underlying PostgreSQL servers at scale for multi-tenant (SaaS) applications, which typically filter by tenant ID. This approach has minimal overhead in terms of distributed query planning time and network traffic, which enables high concurrency and low latency.
Parallel, distributed SELECT across all shards in a distributed table allows you to query large amounts of data in a fraction of the time compared to sequential execution, which means that you can build applications that have consistent response times even as your data and number of customers grow by scaling out your database. The Citus query planner converts SELECT queries that read data from multiple shards into one or more map-reduce-like steps, where each shard is queried in parallel (map) and then the results are merged or reshuffled (reduce). For linear scale, most work should be done in the map steps, which is typically the case for queries that join or group by the distribution column.
Joins are an essential part of SQL for two reasons: 1) they give enormous flexibility to query your data in different ways allowing you to avoid complex data processing logic in your application, 2) they allow you to make your data representation a lot more compact. Without joins, you need to store a lot of redundant information in every row, which drastically increases the amount of hardware your need to store or scan the table, or keep it in memory. With joins, you can store a compact opaque ID and do advanced filtering without having to read all the data.
Reference tables appear like any other table but they are transparently replicated across all the nodes in the cluster. In a typical star schema all your dimension tables would be reference tables and your facts table a distributed table. The facts table can then be joined (in parallel!) with any of the dimension tables on any column without moving any data over the network. In multi-tenant applications, reference tables can be used to hold data that is shared among tenants.
Subquery pushdown is the marriage between parallel, distributed SELECT, query routing and joins. Queries across all shards that include advanced subquery trees (e.g. joins between subqueries) can be parallelised in a single round through subquery pushdown as long as they join all distributed tables on the distribution column (while reference tables can be joined on any column). This enables very advanced analytical queries that still exhibit linear scalability. Citus can recognise pushdownable subqueries by leveraging the transformations that the PostgreSQL planner already does for all queries, and generate separate plans for all remaining subqueries. This allows all kinds of subqueries and CTEs to be efficiently distributed.
Indexes are like the legs of a table. Without them, it’s going to take a lot of effort to get things from the table, and it’s not really a table. PostgreSQL in particular provides very powerful indexing features such as partial indexes, expression indexes, GIN, GiST, BRIN, and covering indexes. This allows queries (including joins!) to remain fast even at massive scale. It’s worth remembering that an index lookup is usually faster than a thousand cores scanning through the data. Citus supports all PostgreSQL index types by indexing individual shards. Power users like Heap and Microsoft especially like to use partial indexes to handle diverse queries over many different event types.
Distributed transactions allow you to make a set of changes at once or not at all, which greatly adds to the reliability of your application. Citus can delegate transactions to PostgreSQL nodes using a method that is similar to query pushdown, inheriting its ACID properties. For transactions across shards, Citus uses PostgreSQL’s built-in 2PC machinery and adds a distributed deadlock detector that uses internal PostgreSQL functions to get lock tables from all the nodes.
Parallel, distributed DML allows very large amounts of data to be transformed and maintained in relatively little time and in a transactional manner. A common application of distributed DML is an INSERT…SELECT command which aggregates rows from a raw data table into a rollup table. By parallelising the INSERT…SELECT over all available cores, the database will always be fast enough to aggregate the incoming events, while the application (e.g. dashboard) queries the compact, heavily indexed rollup table. Another example is a Citus user who ingested 26 billion rows of bad data and fixed it using distributed updates that modified over 700k rows/sec on average.
Bulk loading is an essential feature for applications that analyse large volumes of data. Even on a single node, PostgreSQL’s COPY command can append hundreds of thousands of rows per second to a table, which already beats most distributed database benchmarks. Citus can fan out the COPY stream to append and index many rows in parallel across many PostgreSQL servers, which scales to millions of rows per second.
Stored procedures and functions (incl. triggers) provide a programming environment inside your database for implementing business logic that cannot be captured by a single SQL query. The ability to program your database is especially beneficial when you need a set of operations to be transactional, but you have no need to go back-and-forth between your application server and the database. Using a stored procedure simplifies your app and makes the database more efficient since you avoid keeping transactions open while making network round-trips. While it may put slightly more load on your database, this becomes much less of an issue when your database scales.

While most of these features seem essential to develop a complex application which needs to scale, far from all distributed databases support them. Below we give a comparison of some popular distributed databases based on publicly available documentation.

Comparison table

Let our powers combine…

What’s even more important than having superpowers in your distributed database is being able to combine your database superpowers in order to solve a complex use case.

Image courtesy of Turner Broadcasting. © Turner Broadcasting. Captain Planet created by Barbara Pyle and Ted Turner.

Support for query routing, reference tables, indexes, distributed transactions and stored procedures makes even the most advanced multi-tenant OLTP apps such as Copper able to scale beyond a single PostgreSQL node using Citus without making any sacrifices in their application.

If you use subquery pushdown in combination with parallel, distributed DML, then you can transform large volumes data inside the database. A common example is building a rollup table using INSERT…SELECT, which can be parallelised to keep up with any kind of data volume. Combined with bulk loading through COPY, indexes, joins, and partitioning, you have a database that is supremely suitable for time series data and real-time analytics applications like the Algolia dashboard at scale.

As Min Wei from Microsoft pointed out in his talk about how Microsoft uses Citus and PostgreSQL to analyse Windows data: Citus enables you to solve large scale OLAP problems using distributed OLTP.

GoodOldSQL

Citus is a little different from other distributed databases, which are often developed from scratch. Citus does not introduce any functionality that wasn’t already in PostgreSQL. The Citus database makes existing functionality scale in a way that satisfies use cases that require scale. What’s important is that most of the PostgreSQL functionality has been developed and battle-tested over decades for a wide range of use cases, and the functional requirements for today’s use cases are ultimately not that different; it’s primarily the scale and size of data that’s different. That is why a distributed database like Citus, that’s built on top of the world’s most advanced open source RDBMS (PostgreSQL!), can be the most powerful tool in your arsenal when building a modern application.

↧

Liaqat Andrabi: Webinar : Introduction to OmniDB [Follow Up]

November 30, 2018, 5:59 am

≫ Next: Bruce Momjian: Extensibility

≪ Previous: Marco Slot: Why the RDBMS is the future of distributed databases, ft. Postgres and Citus

A database management tool that simplifies what is complex and drives performance. OmniDB is one such tool with which you can connect to several different databases – including PostgreSQL, Oracle, MySQL and others.

2ndQuadrant recently hosted a webinar on this very topic: Introduction to OmniDB. The webinar was presented by OmniDB co-founders and PostgreSQL consultants at 2ndQuadrant, Rafael Castro & William Ivanski.

The recording of the webinar is now available here.

Questions that Rafael and William couldn’t respond to during the live webinar have been answered below.

Q1: There are other open source GUI tools around to manage PostgreSQL. Why are you investing efforts on a new tool?

A1: When OmniDB was created we wanted a web tool, and not all available tools offered this architecture. Also, as advanced SQL developers, we wanted fewer forms and more SQL templates. Finally, we also wanted the freedom to develop features that don’t exist in other tools, or existed but were unmaintained or not up-to-date, such as customizable monitoring dashboard, console tab and the debugger which now supports PG 11 procedures.

Q2: Currently it is not possible to import data from a file into a database. Do you plan to implement such feature?

A2: Yes, we will implement this soon. There will be an interface for the user to upload and configure data to be imported, and also in the Console Tab there will be a new \copy command.

Q3: Is it possible to view the query plan ?

A3: Yes, it is possible to view the query plan using the magnifying glass icons in the Query Tab. The first one will do an EXPLAIN, and the second an EXPLAIN ANALYZE. The output can be seen as a list or as a tree.

Q4: Is it possible to pass parameters in the EXPLAIN command ?

A4: You can always manually execute EXPLAIN with any parameters that you need. However, the graphical component to view the plan only allows EXPLAIN or EXPLAIN ANALYZE. We will investigate the possibility to make the EXPLAIN command customizable for the graphical component.

For any questions, comments, or feedback, please visit our website or send an email to webinar@2ndquadrant.com

↧

Bruce Momjian: Extensibility

November 30, 2018, 1:30 pm

≫ Next: Bruce Momjian: Views vs. Materialized Views

≪ Previous: Liaqat Andrabi: Webinar : Introduction to OmniDB [Follow Up]

Extensibility was built into Postgres from its creation. In the early years, extensibility was often overlooked and made Postgres server programming harder. However, in the last 15 years, extensibility allowed Postgres to adapt to modern workloads at an amazing pace. The non-relational data storage options mentioned in this presentation would not have been possible without Postgres's extensibility.

↧

Bruce Momjian: Views vs. Materialized Views

December 3, 2018, 7:45 am

≫ Next: Tatsuo Ishii: log_client_messages in Pgpool-II 4.0

≪ Previous: Bruce Momjian: Extensibility

Views and materialized views are closely related. Views effectively run the view query on every access, while materialized views store the query output in a table and reuse the results on every materialized view reference, until the materialized view is refreshed. This cache effect becomes even more significant when the underlying query or tables are slow, such as analytics queries and foreign data wrapper tables. You can think of materialized views as cached views.

↧

Tatsuo Ishii: log_client_messages in Pgpool-II 4.0

December 3, 2018, 11:59 pm

≫ Next: Alexey Lesovsky: Why avoid long transactions?

≪ Previous: Bruce Momjian: Views vs. Materialized Views

Pgpool-II 4.0 adds new logging feature called "log_client_messages". This allows to log messages coming from frontend. Up to 3.7 the only way to log frontend messages was enable debugging log, which produced tremendous amount of logs.

For example, with log_client_messages enabled, "pgbench -S -M parepared -t 2" produces frontend logs below:

2018-12-04 16:43:45: pid 6522: LOG: Parse message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL: statement: "P0_1", query: "SELECT abalance FROM pgbench_accounts WHERE aid = $1;"
2018-12-04 16:43:45: pid 6522: LOG: Sync message from frontend.
2018-12-04 16:43:45: pid 6522: LOG: Bind message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL: portal: "", statement: "P0_1"
2018-12-04 16:43:45: pid 6522: LOG: Describe message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL: portal: ""
2018-12-04 16:43:45: pid 6522: LOG: Execute message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL: portal: ""
2018-12-04 16:43:45: pid 6522: LOG: Sync message from frontend.

As you can see, pgbench sends a query "SELECT abalance FROM pgbench_accounts WHERE aid = $1;" using prepared statement "P0_1", then bind message to bind parameter to be bound to "$1".
It then sends describe message to obtain meta data, and finally sends execute message to run the query.

Below are the second execution of query (remember that we add "-t 2" parameter to execute 2 transactions).

2018-12-04 16:43:45: pid 6522: LOG: Bind message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL: portal: "", statement: "P0_1"
2018-12-04 16:43:45: pid 6522: LOG: Describe message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL: portal: ""
2018-12-04 16:43:45: pid 6522: LOG: Execute message from frontend.
2018-12-04 16:43:45: pid 6522: DETAIL: portal: ""
2018-12-04 16:43:45: pid 6522: LOG: Sync message from frontend.
2018-12-04 16:43:45: pid 6522: LOG: Terminate message from frontend.

This time no parse message is sent because pgbench reuses the named statement "P0_1", which eliminates the parse/analythis step. So pgbench runs in the mode very fast comparing with other mode.

In summary log_client_messages is useful when you want to extract very detail info regarding what the client is doing.

↧

Alexey Lesovsky: Why avoid long transactions?

December 5, 2018, 2:15 am

≫ Next: Bruce Momjian: The Meaning of WAL

≪ Previous: Tatsuo Ishii: log_client_messages in Pgpool-II 4.0

The majority of PostgreSQL community clearly understands why long and idle transactions are “bad”. But when you talk about it to the newcomers it’s always a good idea to backup your explanation with some real tests.

While preparing slides for my presentation about vacuum I have made a simple test case with long transaction using pgbench. Here are the results.

pgbench -c8 -P 60 -T 3600 -U postgres pgbench
starting vacuum...end.
progress: 60.0 s, 9506.3 tps, lat 0.841 ms stddev 0.390
progress: 120.0 s, 5262.1 tps, lat 1.520 ms stddev 0.517
progress: 180.0 s, 3801.8 tps, lat 2.104 ms stddev 0.757
progress: 240.0 s, 2960.0 tps, lat 2.703 ms stddev 0.830
progress: 300.0 s, 2575.8 tps, lat 3.106 ms stddev 0.891

in the end

progress: 3300.0 s, 759.5 tps, lat 10.533 ms stddev 2.554
progress: 3360.0 s, 751.8 tps, lat 10.642 ms stddev 2.604
progress: 3420.0 s, 743.6 tps, lat 10.759 ms stddev 2.655
progress: 3480.0 s, 739.1 tps, lat 10.824 ms stddev 2.662
progress: 3540.0 s, 742.5 tps, lat 10.774 ms stddev 2.579
progress: 3600.0 s, 868.2 tps, lat 9.215 ms stddev 2.569

This is a standard TPC-B pgbench test, running on a small database which completely resides in shared buffers (it removes disk IO influences).

As you can see, the performance measured in transaction per second initially dropped during the first few minutes of the test and continues to reduce further.

Look at the statistics from the vacuum logs:

tuples: 0 removed, 692428 remain, 691693 are dead but not yet removable, oldest xmin: 62109160
tuples: 0 removed, 984009 remain, 983855 are dead but not yet removable, oldest xmin: 62109160
tuples: 0 removed, 1176821 remain, 1176821 are dead but not yet removable, oldest xmin: 62109160
tuples: 0 removed, 1494122 remain, 1494122 are dead but not yet removable, oldest xmin: 62109160
tuples: 0 removed, 2022284 remain, 2022284 are dead but not yet removable, oldest xmin: 62109160
tuples: 0 removed, 2756298 remain, 2756153 are dead but not yet removable, oldest xmin: 62109160
tuples: 0 removed, 3500913 remain, 3500693 are dead but not yet removable, oldest xmin: 62109160
tuples: 0 removed, 4631448 remain, 4631354 are dead but not yet removable, oldest xmin: 62109160
tuples: 0 removed, 5377941 remain, 5374941 are dead but not yet removable, oldest xmin: 62109160

Number of dead rows that aren't cleaned up has increased over time, which means that vacuum is not working properly.

So, if you ever need proof for why long transactions are bad for your database, here you have it!

↧

Bruce Momjian: The Meaning of WAL

December 5, 2018, 7:30 am

≫ Next: Tomas Vondra: Databases vs. encryption

≪ Previous: Alexey Lesovsky: Why avoid long transactions?

The write-ahead log (WAL) is very important for Postgres reliability. However, how it works is often unclear.

The "write-ahead" part of the write-ahead log means that all database changes must be written to pg_wal files before commit. However, shared buffers dirtied by a transaction can be written (and fsync'ed) before or after the transaction commits.

Huh? Postgres allows dirty buffers to be written to storage before the transaction commits? Yes. When dirty buffers are written to storage, each modified row is marked with the currently-executing transaction id that modified it. Any session viewing those rows knows to ignore those changes until the transaction commits. If it did not, a long transaction could dirty all the available shared buffers and prevent future database changes.

↧

Tomas Vondra: Databases vs. encryption

December 5, 2018, 7:30 am

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 12 – Add log_statement_sample_rate parameter

≪ Previous: Bruce Momjian: The Meaning of WAL

Let’s assume you have some sensitive data, that you need to protect by encryption. It might be credit card numbers (the usual example), social security numbers, or pretty much anything you consider sensitive. It does not matter if the encryption is mandated by a standard like PCI DSS or if you just decided to encrypt the sensitive stuff. You need to do the encryption right and actually protecting the information in both cases. Unfortunately, full-disk-encrytion and pgcrypto are not a good fit for multiple reasons, and application-level encryption reduces the database to “dumb” storage. Let’s look at an alternative approach – offloading the encryption to a separate trusted component, implemented as a custom data type.

Note: A couple of weeks ago at pgconf.eu 2018, I presented a lightning talk introducing a PoC of an alternative approach to encrypting data in a database. I got repeatedly asked about various details since then, so let me just explain it in this blog post.

FDE and pgcrypto

In the PostgreSQL world, people will typically recommend two solutions to this problem – full-disk encryption and pgcrypto. Unfortunately, neither of them really works for this use case :-(

Full-disk encryption (FDE) is great. It’s transparent to the database (and application), so there are no implementation changes needed. The overhead is very low, particularly when your CPU supports AES-NI etc. The problem is it only really protects against someone stealing the disk. It does not protect against OS-level attacks (rogue sysadmin, someone gaining remote access to the box or backups, …). Nor does it protect against database-level attacks (think SQL injection). And most importantly, it’s trivial to leak the plaintext data into server log, various monitoring systems etc. Not great.

pgcrypto addresses some of these issues as the encryption happens in the database. But it means the database has to know the keys, and those are likely part of SQL queries and so the issue with leaking data into server logs and monitoring systems is still there. Actually, it’s worse because this time we’re leaking the encryption keys, not just the plaintext data.

Application-level encryption

So neither full-disk encryption nor pgcrypto is a viable solution to the problem at hand. There inherent issue with both solutions is that the database sees plaintext data (on input), and so can leak them into various output channels. In case of pgcrypto the database actually sees the keys, and leaking those is even deadlier.

This is why many practical systems use application-level encryption – all the encryption/decryption happens in the application, and the database only sees encrypted data.

The unfortunate consequence is that the database acts as “dumb” storage, as it can’t do anything useful with the encrypted data. It can’t compare the plaintext values (it can’t even determine if two plaintext values are equal due to nonces), etc. That means it’s impossible to build indexes on the encrypted data, do aggregation, or anything that we expect from decent relational database.

There are workarounds for some of these issues. E.g. you may compute SHA-1 hash of the credit card number, build an index on it and use it for lookups, but this may weaken the encryption when the encrypted data have low entrophy (just like credit card numbers).

This means application-level encryption often results in a lot of the processing moves to the application, which is inefficient and error-prone. There must be a better solution …

Encryption off-loading

The good thing on application-level encryption is that the database does not know the plaintext or encryption keys, which makes the system safe. The problem is that the database has no way to perform interesting operations on the data, so a lot of the processing moves to the application level.

So let’s start with the application-level encryption, but let’s add another component to the system, performing the important operations on encrypted data on behalf of the database.

This component also knows the encryption keys, but it’s much smaller and simpler than the database. It’s only task is to receive encrypted data from the database, and perform some predefined operation(s) on it. For example, it might receive two encrypted values, decrypt and compare them, and return -1/0/1 just like regular comparator.

This way the database still does not know anything sensitive, yet it can meaningfully perform indexing, aggregation and similar tasks. And while there’s another component with the knowledge of encryption keys, it may be much simpler and smaller compared to the whole RDMBS, with much smaller attack surface.

The encryption component

But what is a “component” in this context? It might be as simple as a small service running on a different machine, communicating over TCP.

Or it might be a separate process running on the same host, providing better performance due to replacing TCP with some form of IPC communication.

A more elaborate version of this would be running the process in a trusted execution environment, providing additional isolation. Pretty much every mainstream CPU vendor has a way to do this – Intel has SGX, AMD has SEV, ARM has TrustZone. The component might also run on a HSM or a device like usbarmory.

Each solution has a different set of pros / cons / limitations / performance characteristics, of course.

ccnumber

So, how could this be implemented on the database side, without messing with the database internals directly too much? Thankfully, PostgreSQL is extremely extensible and among other things it allows implementing custom data types. And that’s exactly what we need here. The experimental ccnumber extension implements a custom data type, offloading comparisons to component, either over TCP or IPC (using POSIX memory queues).

The encryption is done using libsodium (docs), a popular and easy-to-use library providing all the important pieces (authenticated encryption, keyed hashing).

Performance

Offloading operations to a separate component is certainly slower than evaluating them directly, but how much? The extension is merely a PoC and so there’s certainly room for improvement, but a benchmark may provide at least some rough estimates. The following chart shows how long it takes to build an index on ccnumber on a table with 22M rows.

The blue bar represents an index created directly on the bytea value, representing the encrypted value. It’s clear the overhead is significant, creating the index is at least an order of magnitude slower (with TCP being twice slower than mq message queues). Furthermore, the custom data type can’t use sorting optimizations like abbreviated keys etc. (unlike the plain bytea type).

But this does not make this approach to encryption impractical. CREATE INDEX is very intensive in terms of number of comparisons, and you do it only very rarely in practice.

What probably matters much more is impact on inserts and lookups – and those operations actually do very few comparisons. It’s quite rare to see index with more than 5 or 6 levels, so you usually need very few comparisons to determine which leaf page to look at. And the other stuff (WAL logging, etc.) is not cheap either, so making the comparisons a bit more expensive won’t make a huge difference.

Moreover, an alternative to slightly more expensive index access is sending much more data to the application, and doing the filtering there.

The other observation is that increasing the number of workers does not speed CREATE INDEX significantly. The reason is that the parallel index build performs about 60% more comparisons compared to non-parallel one. And as those extra comparisons happen in the serial part of the algorithm, it limits the speedup (just like Amdahl’s law says).

Another way to look at this is how fast can the crypto component evaluate requests from the database, illustrated by the following chart:

On this particular system (with Xeon E5-2620 v4 CPU, the TCP-based version handles up to 100k operations per second, and the MQ-based one handles about 170k. With 8 workers (matching the CREATE INDEX test) it can achieve about 630k / 1M ops.

For comparison, I’ve included also usbarmory, which is using a single-code NXP i.MX53 ARM® Cortex™-A8 CPU. Obviously, this CPU is much weaker compared to the Xeon, and handles only about 5000 operations per second. But it would still serve quite well as a custom HSM.

Summary

I hope this post demonstrates the offloading approach to encryption is viable, and solves the FDE and pgcrypto issues.

That is not to say the extension is complete or ready for production use, of course. One of the main missing pieces is key management – obtaining the keys, rotating then when needed, etc. The PoC extension has the keys hard-coded, but clearly that’s not a viable solution. Admittedly, key management is a very challenging topic on it’s own, but it also depends on how your application already does that – it seems natural to do it the same way here. I’d welcome feedback, suggestions or ideas how to approach this in a flexible manner.

↧

Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 12 – Add log_statement_sample_rate parameter

December 5, 2018, 11:34 am

≫ Next: Kaarel Moppel: PostgreSQL affiliate projects for horizontal multi-terabyte scaling

≪ Previous: Tomas Vondra: Databases vs. encryption

On 29th of November 2018, Alvaro Herrera committed patch: Add log_statement_sample_rate parameter This allows to set a lower log_min_duration_statement value without incurring excessive log traffic (which reduces performance). This can be useful to analyze workloads with lots of short queries. Author: Adrien Nayrat Discussion: https://postgr.es/m/-ee1e-db9f-fa97-@anayrat.info One of the problems I did encounter … Continue reading

↧

Kaarel Moppel: PostgreSQL affiliate projects for horizontal multi-terabyte scaling

December 6, 2018, 1:00 am

≫ Next: Quinn Weaver: BDR talk by Mark Wong of 2nd Quadrant

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 12 – Add log_statement_sample_rate parameter

Some weeks ago I wrote about some common concepts / performance hacks, how one can (relatively) easily scale to a terabyte cluster or more. And based on my experience visiting customers from various industries, 80% of them are not even reaching that threshold…but just to be clear I wanted to write another post showing that a couple of terabytes are of course not the “end station” for Postgres, given one is ready to roll up his sleeves and get “hands dirty“ so to say. So let’s look here at some additional Postgres-like projects for cases where you still want to make use of your Postgres know-how and SQL skills over big amounts of data.

But be warned, the road is getting bumpy now – we now usually need to change the applications and also the surrounding bits and we’re doing sharding, meaning data does not live on a single node anymore so SQL aggregates over all data can get quirky. Also we’re mostly extending the rock-solid core PostgreSQL with 3rd-party extensions or using forks with constraining characteristics, so you might have to re-define and re-import the data and you might need to learn some new query constructs and forget some standard PostgreSQL ones…so generally be prepared to pull out a bit of hair if you’ve got any left:) But OK, here some projects that you should know of.

Postgres extensions/derivatives for multi-terabyte scale-out

Sharding via PL/Proxy stored procedures

This kind of “old school” solution was created and battle tested in Skype (huge user of Postgres by the way!) by scaling an important cluster to 32 nodes so it obviously works pretty well. The main upside on the other hand is that all data and data access is sharded for you automatically after you pick a stored procedure parameter as shard key and you can use all of the standard Postgres features…with the downside that, well, all data access needs to go over PL/pgSQL stored procedures which most developers I guess are not so versed in. In short PL/Proxy is just some glue to get the stored procedure call to reach the correct shard so the performance penalty is minimal. Does not support Postgres 11 yet though …

Sharding with Postgres-XL

Postgres-XL could perhaps be described as an “PostgreSQL-based” sharding framework. And it lives actually somewhat under the umbrella of the PostgreSQL Global Development Group. It lags though a bit behind a Postgres major version or two, and the setup is not so easy as “apt install postgresql” by far due to the distributed nature – with coordinators, transaction managers and data nodes on the picture. But it can help you to manage and run queries on tens of terabytes of data with relatively little restrictions! Of course there are some caveats (e.g. no triggers, constraints on FK-s) as with any PostgreSQL derivative and one can’t also expect the same level of support from the community when countering technical problems – but nevertheless it’s actively maintained and a good choice if you want to stay “almost” Postgres with your 50TB+ of data. Biggest cluster I’ve heard of by the way has even 130 TB of data on it so worth checking out!

Sharding with Citus

Citus is an extension to standard Postgres so in that sense a bit lighter concept compared to the previous contenders and more “up to date”, but not a transparent drop-in replacement for all scaling needs still. It adds “distributed table” features similar to Postgres-XL but with a bit simpler architecture (only data and coordinator nodes) and according to documentation is especially well-suited for multi-tenant and “realtime analytics“ use cases. It has some caveats like all the others – e.g. “shard-local” constraints, no subqueries in the WHERE clause, no window functions on sharded/partitioned tables, but defining the tables works like normally, one just needs some function calls to activate the distributed behaviour. The project is also under very active development with a decent company behind it for those who require support, so this might be a good choice for your next 50 TB+ project.

Greenplum – a PostgreSQL fork for Data Warehousing

Greenplum– a massively parallel processing (MPP) database system might just be the oldest active Postgres fork alive. It started based on version 8.2 and was developed behind closed doors for a long time, but being Open Source for a couple of years it’s now making up the lost time and trying to modernize itself to include features from latest Postgres versions. The architecture though seems quite a bit more complex as for the above-mentioned alternatives and needs thorough studying. You’d also be giving up on some more advanced/recent SQL features but I can imagine the architecture decisions are made with certain performance aspects in mind so it might be a worthwhile trade-off and also behind the product stands a huge (publicly noted) consulting company named Pivotal, so again a serious alternative.

Final words

To conclude – don’t be afraid that your can’t scale up with Postgres! There are quite some options (I’m pretty sure I forgot some products also) for every taste and also professional support is available.

One more remark though to the Postgres community – I think that making the “scaling” topic a bit more discoverable for newcomers would do a lot of good for general Postgres adoption and adding a few words to the official documentation might even be appropriate – currently there’s a bit on HA and replication here, but the word “scaling” is not even mentioned in this context.

The post PostgreSQL affiliate projects for horizontal multi-terabyte scaling appeared first on Cybertec.

↧

Quinn Weaver: BDR talk by Mark Wong of 2nd Quadrant

December 6, 2018, 12:37 pm

≫ Next: Pavel Stehule: New release of Orafce extension

≪ Previous: Kaarel Moppel: PostgreSQL affiliate projects for horizontal multi-terabyte scaling

In the Bay Area? This Wednesday, 2018-12-12, Mark Wong from 2nd Quadrant talking about BDR (Bi-Directional Replication, a form of multi-master) for PostgreSQL. This is a great chance to get inside, real-time info on BDR.

Multi-master is one of those features where when you need it, you really, really need it, and if you're in that category, this talk is for you. It's also of interest to anyone trying to figure out the best solution for scaling and redundancy beyond one machine and one data center.

To attend, you must RSVP at Meetup with your full name (for building security's guest list).

↧

Pavel Stehule: New release of Orafce extension

December 6, 2018, 11:19 pm

≫ Next: Michael Paquier: Postgres 12 highlight - New PGXS options for isolation and TAP tests

≪ Previous: Quinn Weaver: BDR talk by Mark Wong of 2nd Quadrant

I released new mostly release of Orafce. Today it is massive package of emulation often used Oracle's API and the emulation is on maximum what is possible.

Now Orafce has good and very nice documentation written by Horikawa Tomohiro (big thanks for his work).

There are not too much news for people who use Oracle 3.6:

possibility to better emulate || operator for varchar2 and nvarchar2 types
few bugfixes
only PostgreSQL 9.4 and newer are supported
support for PostgreSQL 11, current master branch (future PostgreSQL 12) is supported too

↧