Hans-Juergen Schoenig: PostgreSQL: LIMIT vs FETCH FIRST ROWS … WITH TIES

July 14, 2021, 1:00 am

≫ Next: Paul Ramsey: Generating JSON Directly from Postgres

≪ Previous: Neil Chen: First contact with the pg_filedump

Most people in the SQL and in the PostgreSQL community have used the LIMIT clause provided by many database engines. However, what many do not know is that LIMIT / OFFSET are off standard and are thus not portable. The proper way to handle LIMIT is basically to use SELECT … FETCH FIRST ROWS. However, there is more than meets the eye.

LIMIT vs. FETCH FIRST ROWS

Before we dig into some of the more advanced features we need to see how LIMIT and FETCH FIRST ROWS can be used. To demonstrate this feature, I have compiled a simple data set:

test=# CREATE TABLE t_test (id int);
CREATE TABLE
test=# INSERT INTO t_test 
VALUES 	(1), (2), (3), (3), 
(4), (4), (5);
INSERT 0 7
test=# TABLE t_test;
 id
----
  1
  2
  3
  3
  4
  4
  5
(7 rows)

Our data set has 7 simple rows. Let’s see what happens if we use LIMIT:

test=# SELECT * FROM t_test LIMIT 3;
 id
----
  1
  2
  3
(3 rows)

In this case, the first three rows are returned. Note that we are talking about ANY rows here. Whatever can be found first is returned. There is no special order.

The ANSI SQL compatible way of doing things is as follows:

test=# SELECT * 
           FROM  t_test 
           FETCH FIRST 3 ROWS ONLY;
 id
----
  1
  2
  3
(3 rows)

Many of you may never have used or seen this kind of syntax before, but this is actually the “correct” way to handle LIMIT.

However, there is more: What happens if NULL is used inside your LIMIT clause? The result might surprise you::

test=# SELECT * FROM t_test LIMIT NULL;
 id
----
  1
  2
  3
  3
  4
  4
  5
(7 rows)

The database engine does not know when to stop returning rows. Remember, NULL is undefined, so it does not mean zero. Therefore, all rows are returned. You have to keep that in mind in order to avoid unpleasant surprises…

FETCH FIRST … ROWS WITH TIES

WITH TIES has been introduced in PostgreSQL 13 and fixes a common problem: handling duplicates. If you fetch the first couple of rows, PostgreSQL stops at a fixed number of rows. However, what happens if the same data comes again and again? Here is an example:

test=# SELECT * 
           FROM  t_test 
           ORDER BY id 
           FETCH FIRST 3 ROWS WITH TIES;
 id
----
  1
  2
  3
  3
(4 rows)

In this case, we’ve actually got 4 rows, not just 3. The reason is that the last value shows up again after 3 rows, so PostgreSQL decided to include it as well. What is important to mention here is that an ORDER BY clause is needed, because otherwise, the result would be quite random. WITH TIES is therefore important if you want to include all rows of a certain kind – without stopping at a fixed number of rows.

Suppose one more row is added:

test=# INSERT INTO t_test VALUES (2);
INSERT 0 1
test=# SELECT * 
           FROM  t_test 
           ORDER BY id 
           FETCH FIRST 3 ROWS WITH TIES;
 id
----
  1
  2
  2
(3 rows)

In this case, we indeed get 3 rows, because it is not about 3 types of values, but really about additional, identical data at the end of the data set.

WITH TIES: Managing additional columns

So far we have learned something about the simplest case using just one column. However, that’s far from practical. In a real work application, you will certainly have more than a single column. So let us add one:

test=# ALTER TABLE t_test 
           ADD COLUMN x numeric DEFAULT random();
ALTER TABLE
test=# TABLE t_test;
 id |     	x     	 
----+--------------------
  1 |  0.258814135879447
  2 |  0.561647200043165
  3 |  0.340481941960185
  3 |  0.999635345010109
  4 |  0.467043266494571
  4 |  0.742426363498449
  5 | 0.0611112678267247
  2 |  0.496917052156565
(8 rows)

In the case of LIMIT nothing changes. However, WITH TIES is a bit special here:

test=# SELECT * 
            FROM  t_test 
            ORDER BY id 
            FETCH FIRST 4 ROWS WITH TIES;
 id |     	x    	 
----+-------------------
  1 | 0.258814135879447
  2 | 0.561647200043165
  2 | 0.496917052156565
  3 | 0.999635345010109
  3 | 0.340481941960185
(5 rows)

What you can see here is that 5 rows are returned. The fifth row is added because id = 3 appears more than once. Mind the ORDER BY clause: We are ordering by id. For that reason, the id column is relevant to WITH TIES.

Let’s take a look at what happens when the ORDER BY clause is extended:

test=# SELECT * 
           FROM  t_test 
           ORDER BY id, x 
           FETCH FIRST 4 ROWS WITH TIES;
 id |     	x    	 
----+-------------------
  1 | 0.258814135879447
  2 | 0.496917052156565
  2 | 0.561647200043165
  3 | 0.340481941960185
(4 rows)

We are ordering by two columns. Therefore WITH TIES is only going to add rows if both columns are identical, which is not the case in my example.

LIMIT… Or finally…

WITH TIES is a wonderful new feature provided by PostgreSQL. However, it is not only there to limit data. If you are a fan of windowing functions you can also make use of WITH TIES as shown in one of my other blog posts covering advanced SQL features provided by PostgreSQL.

The post PostgreSQL: LIMIT vs FETCH FIRST ROWS … WITH TIES appeared first on Cybertec.

↧

Paul Ramsey: Generating JSON Directly from Postgres

July 14, 2021, 9:18 am

≫ Next: Hubert 'depesz' Lubaczewski: How to get list of elements from multiranges?

≪ Previous: Hans-Juergen Schoenig: PostgreSQL: LIMIT vs FETCH FIRST ROWS … WITH TIES

Too often, web tiers are full of boilerplate that does nothing except convert a result set into JSON. A middle tier could be as simple as a function call that returns JSON. All we need is an easy way to convert result sets into JSON in the database.

PostgreSQL has built-in JSON generators that can be used to create structured JSON output right in the database, upping performance and radically simplifying web tiers.

Fortunately, PostgreSQL has such functions, that run right next to the data, for better performance and lower bandwidth usage.

↧

Hubert 'depesz' Lubaczewski: How to get list of elements from multiranges?

July 15, 2021, 5:12 am

≫ Next: Egor Rogov: Locks in PostgreSQL: 3. Other locks

≪ Previous: Paul Ramsey: Generating JSON Directly from Postgres

So, some time ago, Pg devs added multi ranges – that is datatype that can be used to store multiple ranges in single column. The thing is that it wasn't really simple how to get list of ranges from within such multirange. There was no operator, no way to split it. A month ago Alexander … Continue reading

↧

Egor Rogov: Locks in PostgreSQL: 3. Other locks

July 14, 2021, 5:00 pm

≫ Next: Aya Iwata: Improved logging by libpq in PostgreSQL 14

≪ Previous: Hubert 'depesz' Lubaczewski: How to get list of elements from multiranges?

We've already discussed some object-level locks (specifically, relation-level locks), as well as row-level locks with their connection to object-level locks and also explored wait queues, which are not always fair.

We have a hodgepodge this time. We'll start with deadlocks (actually, I planned to discuss them last time, but that article was excessively long in itself), then briefly review object-level locks left and finally discuss predicate locks.

Deadlocks

When using locks, we can confront a deadlock. It occurs when one transaction tries to acquire a resource that is already in use by another transaction, while the second transaction tries to acquire a resource that is in use by the first. The figure on the left below illustrates this: solid-line arrows indicate acquired resources, while dashed-line arrows show attempts to acquire a resource that is already in use.

To visualize a deadlock, it is convenient to build the wait-for graph. To do this, we remove specific resources, leave only transactions and indicate which transaction waits for which other. If a graph contains a cycle (from a vertex, we can get to itself in a walk along arrows), this is a deadlock.

...

↧

Aya Iwata: Improved logging by libpq in PostgreSQL 14

July 14, 2021, 6:04 pm

≫ Next: Pavel Stehule: pspg 5.1.0 was released, psql \watch command now supports pspg

≪ Previous: Egor Rogov: Locks in PostgreSQL: 3. Other locks

Improved logging by libpq in PostgreSQL 14

The usability of the libpq feature to trace application's server/client communications has been enhanced in PostgreSQL 14, with an improved format and an option to control output.

↧

Pavel Stehule: pspg 5.1.0 was released, psql \watch command now supports pspg

July 16, 2021, 3:48 am

≫ Next: Dinesh Chemuduru: CentOS vs Rocky Linux Benchmark with PostgreSQL

≪ Previous: Aya Iwata: Improved logging by libpq in PostgreSQL 14

Today I released pspg 5.1.0. Mostly this is bugfix and refactoring release, but there is one, I hope, interesting function. You can try to press Ctrl o for temporal switch to terminal's primary screen. In primary screen you can see psql session. After pressing any key, the terminal switch to alternative screen with pspg.

Thanks to Tomas Munro work, the psql\watch command will supports pagers (in PostgreSQL 15). In this time only pspg can do this work (in streaming mode). When you set environment variable PSQL_WATCH_PAGER, the \watch command redirects otputs to specified pager (for pspgexport PSQL_WATCH_PAGER="pspg --stream". Next you can run command:


select * from pg_stat_database \watch 5


select * from pg_stat_activity where state='active' \watch 1

↧

Dinesh Chemuduru: CentOS vs Rocky Linux Benchmark with PostgreSQL

July 16, 2021, 3:38 pm

≫ Next: Amit Kapila: Logical Replication Of In-Progress Transactions

≪ Previous: Pavel Stehule: pspg 5.1.0 was released, psql \watch command now supports pspg

In December 2020, you might have seen an article from CentOS about shifting their focus towards CentOS stream, which is the upstream version of the RHEL. CentOS also mentioned that the version 8 would be EOL (end of life) by the end of the 2021. This means that it will no longer receive any updated fixes from it's upstream version of RHEL. A few days after this announcement, Rocky Linux was announced by the CentOS founder, Gregory Kurtzeras, as a 100% bug-for-bug compatible with RHEL. The Rocky Linux project quickly gained so much of attention, and also got sponsors from the cloud vendors like AWS, Google Cloud and Microsoft. We wanted to take this opportunity and write an article about CentOS vs Rocky Linux Benchmark with PostgreSQL.

CentOS

As of now, CentOS is widely used in productions, because it was a downstream version of RHEL. This means that the CentOS was receiving all the RHEL criticial bug fixes for free, which makes CentOS as robust and reliable as RHEL. The future of CentOS project is it's Stream version, which is the upstream of RHEL. This means that CentOS Stream may not be receiving any such critical bug fixes from RHEL, instead CentOS Stream bug fixes will be pushed down to the RHEL project.

CentOS vs Rocky Linux Benchmark

After seeing the Rocky Linux Project announcement, we wanted to run some benchmark regarding PostgreSQL and see if we get the same performance as CentOS 8 with Rocky Linux 8. To run this benchmark, we have chosen the phoronix tool, which offers a big list of benchmarking test suites. By using this phoronix tool, we will be running a few set of general benchmarks besides to PostgreSQL's pgbench.

Phoronix test suites

Phoronix is a system benchmarking tool which offers a big list of test suites. This tool provides test suites for the CPU, MEMORY, DISK, COMPILE, etc. for most of the operating systems. We will be using this tool to perform the benchmarking on both Rocky Linux and CentOS. For the purpose of this article, we have considered run the following test suites.

Compiler
CPU
Memory
Disk
Stress
PostgreSQL (pgbench)
Stress & PostgreSQL

By doing the above benchmarkings on these 2 instances, we will be able to understand whether Rocky Linux is going to be a true CentOS replacement.

Install phoronix test suite

After spinning two dedicated instances hosted in Linode, we installed the phoronix tool on these new instances. No other additional software or configuration changes have been done, since we do not want to have any drift in the benchmarking results.

To install phoronix, we executed these 3 steps on Rocky Linux and the CentOS instances.

Step 1 : Install the following dependencies

    # yum install -y wget php-cli php-xml bzip2 php-json php-zip

Step 2 : Download the latest repo.

    # wget https://phoronix-test-suite.com/releases/phoronix-test-suite-10.4.0.tar.gz

Step 3 : Extract the source and install.

    # tar -zxf phoronix-test-suite-10.4.0.tar.gz
    # cd phoronix-test-suite
    # ./install-sh
    which: no xdg-mime in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
    Phoronix Test Suite Installation Completed
    Executable File: /usr/bin/phoronix-test-suite
    Documentation: /usr/share/doc/phoronix-test-suite/
    Phoronix Test Suite Files: /usr/share/phoronix-test-suite/

CentOS vs Rocky Hardware configuration

Before performing the benchmarking, let us see the system capacities of the both instances. As you see in the below specifications table, both systems has the same configuration and the only difference is the operating system.

	CentOS	Spec	Rocky	Spec
CPU	PROCESSOR:	4 x AMD EPYC 7501 32-Core	PROCESSOR:	4 x AMD EPYC 7501 32-Core
	Core Count:	4	Core Count:	4
	Extensions:	SSE 4.2 + AVX2 + AVX + RDRAND + FSGSBASE	Extensions:	SSE 4.2 + AVX2 + AVX + RDRAND + FSGSBASE
	Cache Size:	16 MB	Cache Size:	16 MB
	Microcode:	0x1000065	Microcode:	0x1000065
	Core Family:	Zen	Core Family:	Zen

Memory	MEMORY:	1 x 8 GB RAM QEMU	MEMORY:	1 x 8 GB RAM QEMU

Chipset	MOTHERBOARD:	QEMU Standard PC	MOTHERBOARD:	QEMU Standard PC
	BIOS Version:	rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org	BIOS Version:	rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org
	Chipset:	Intel 82G33/G31/P35/P31 + ICH9	Chipset:	Intel 82G33/G31/P35/P31 + ICH9
	Network:	Red Hat Virtio device	Network:	Red Hat Virtio device

Disk	DISK:	171GB QEMU HDD + QEMU HDD	DISK:	171GB QEMU HDD + QEMU HDD
	File-System:	ext4	File-System:	ext4
	Mount Options:	relatime rw seclabel	Mount Options:	relatime rw seclabel
	Disk Scheduler:	MQ-DEADLINE	Disk Scheduler:	MQ-DEADLINE
	Disk Details:	Block Size: 4096	Disk Details:	Block Size: 4096

OS	OPERATING SYSTEM:	CentOS Linux 8	OPERATING SYSTEM:	Rocky Linux 8.4
	Kernel:	4.18.0-305.3.1.el8.x86_64 (x86_64)	Kernel:	4.18.0-305.3.1.el8_4.x86_64 (x86_64)
	Compiler:	GCC 8.4.1 20200928	Compiler:	GCC 8.4.1 20200928
	System Layer:	KVM	System Layer:	KVM
	Security:	SELinux	Security:	SELinux

Compiler Benchmarking

Phoronix provides compiler benchmarking test suite, and we will be using this suite to compare the benchmark results between CentOS and Rocky Linux. To perform the compiler benchmarking, let us build the kernel (linux-5.10.20) on both operating systems, and compare it's completion time.

    # phoronix-test-suite benchmark build-linux-kernel

Here are the results

	Time Seconds	Details
CentOS	553.54	OS: CentOS Linux 8, Kernel: 4.18.0-305.3.1.el8.x86_64 (x86_64), Compiler: GCC 8.4.1 20200928, File-System: ext4, Screen Resolution: 1024x768, System Layer: KVM
Rocky	546.10	OS: Rocky Linux 8.4, Kernel: 4.18.0-305.3.1.el8_4.x86_64 (x86_64), Compiler: GCC 8.4.1 20200928, File-System: ext4, Screen Resolution: 1024x768, System Layer: KVM

Compile time for Rocky Linux vs CentOS

By default, phoronix tool will run the test 3 times, and will provide an average duration of the test times. The above values which are listed are the average duration among the 3 runs. As per the compiler test results, Rocky Linux completed the kernel build a little bit faster than CentOS with not a huge gap between the results.

CPU Benchmarking

As both of the instances are having the same CPU capacity, let us run the sysbench with CPU profile, and see the number of events a cpu is processing per second.

    # phoronix-test-suite benchmark sysbench

Here are the results

	Events Per Second	Details
CentOS	2687.02	Processor: 4 x AMD EPYC 7501 32-Core (4 Cores)
Rocky	2701.93	Processor: 4 x AMD EPYC 7501 32-Core (4 Cores)

Events per second for Rocky Linux vs CentOS

Rocky linux processed a little more events (14.91 more) per second than the CentOS, and this value is the average from the 3 tests. Again, there is not a huge gap between Rocky Linux and CentOS.

Memory Benchmarking

Both hardware instances are having the same memory capacity, and this time let us use the sysbench memory profile to generate the load on the system. Here, the sysbench will initiate a block of memory in RAM, and will perform the seek tests.

    # phoronix-test-suite benchmark sysbench

Following are the results -

	Mb/sec	Details
CentOS	7907.09	Memory: 1 x 8 GB RAM QEMU
Rocky	7798.74	Memory: 1 x 8 GB RAM QEMU

Memory benchmarking for Rocky Linux vs CentOS

From the results, CentOS is doing a bit higher operations than the Rocky Linux. And again, this is the average result from the 3 test runs.

Disk Benchmarking

When it comes to disk benchmarking, we have multiple factors to consider such as direct, buffered, block size, sequential, random reads, etc. The phoronix tool provides an advanced benchmarking suite like flexible io (fio). We have considered running it on these two machines.
As there are many possibilities of benchmarking, we are only limiting it to the below set of test cases.

# phoronix-test-suite benchmark fio

Following are the results -

	Random Read	Random Write	Sequential Read	Sequential Write	Details
CentOS	25.7 MB/S - 3302 IOPS	639 MB/S - 81767 IOPS	1478 MB/S - 189000 IOPS	783 MB/S - 100133 IOPS	Engine: Linux AIO - Buffered: Yes - Direct: Yes - Block Size: 8KB QEMU HDD
Rocky	28.5 MB/S - 3648 IOPS	685 MB/S - 87633 IOPS	1537 MB/S - 197000 IOPS	811 MB/S - 103667 IOPS	Engine: Linux AIO - Buffered: Yes - Direct: Yes - Block Size: 8KB QEMU HDD

Sequential IOPS for Rocky Linux vs CentOS
Random IOPS for Rocky Linux vs CentOS

Seems that Rocky Linux with the same filesystem ext4 is giving some better throughput than CentOS, though the configuration is same.

Stress Benchmarking

This time, let us use the popular stress-ng test, to check the system stability. phoronix do provide the stress-ng benchmarking suite, which will initiate few benchmarking tests.

# phoronix-test-suite benchmark stress-ng

Following are the results -

op/sec	MMAP	NUMA	MEMFD	Atomic	Crypto	Malloc	Forking	SENDFILE	CPU Cache	CPU Stress
CentOS	9.03	34.69	59.03	320701.15	329.47	5394403.52	13126.39	22847.99	2.64	461.63
Rocky	10.22	33.70	61.09	317368.35	329.09	5886840.34	13116.39	22754.75	3.43	457.55

op/sec	Semaphores	Matrix Math	Vector Math	Memory Copying	Socket Activity	Context Switching	Glibc C String Functions	Glibc Qsort Data Sorting	System V Message Passing
CentOS	267137.47	8479.07	11581.05	1623.55	1460.90	787841.94	152539.11	21.85	1686145.73
Rocky	267230.52	8510.55	11578.17	1722.36	1519.78	797083.22	153499.17	19.23	1739696.00

As per the test results, Rocky Linux is almost matching with most of the CentOS results, and seems to be giving better performance in few of the areas like Memory Allocations, Socket Activity, Context Switches and in processing the Semaphores and System V Message Passing. And CentOS is in a bit leading position while doing the Forking, Atomic operations.

PGBench Benchmarking

Finally, we ran the much awaited PostgreSQL benchmarking on these two machines. Let us use the phoronix pgbench test suite, and run the both read , read write tests. Use the following command to run the pgbench and see the results. This suite will install the PostgreSQL-13, and will run the default pgbench test cases.

# phoronix-test-suite benchmark pgbench

Following are the results for Rocky Linux and CentOS with PostgreSQL -

	Read Only	Avg Latency	Read Write	Avg Latency	Details
CentOS	17570 TPS	5.693 ms	3402 TPS	29.40 ms	Scaling Factor: 1000 - Clients: 100
Rocky	17751 TPS	5.635 ms	3464 TPS	28.88 ms	Scaling Factor: 1000 - Clients: 100

pgbench TPS for Rocky Linux vs CentOS

This benchmarking is done with the default installation of PostgreSQL, with default settings on the both machines. No special changes have done to any configuration parameter. Seems, Rocky Linux is matching (slightly better TPS than CentOS) the CentOS benchmark results.

Stress & PostgreSQL

This is an additional benchmarking case we performed to know how PostgreSQL will behave in Rocky Linux and CentOS, while the system is already having some load on it. For this test, we will be running 2 phoronix suites in parallel. One suite will be running the stress-ng and other will be running pgbench. While we are running these suites, we followed the same test parameters which we did in the earlier tests. These tests are performed on these two hosts almost in the same time.

Session1

# phoronix-test-suite benchmark stress-ng

Session2

# phoronix-test-suite benchmark pgbench

Following are the results -

	Read Only	Avg Latency	Read Write	Avg Latency	Details
CentOS	12287 TPS	8.291 ms	2234 TPS	45.67 ms	Scaling Factor: 1000 - Clients: 100
Rocky	15051 TPS	6.781 ms	2606 TPS	39.27 ms	Scaling Factor: 1000 - Clients: 100

Stress and pgbench TPS for Rocky Linux vs CentOS

From the above results, it's a little bit surprise to see more TPS from Rocky linux than CentOS, with PostgreSQL and some load on the system.

Conclusion

After going through all these benchmarking results, Rocky Linux is giving a little bit better throughput when compared to CentOS but it's not a big difference in number. This proves that Rocky Linux is going to be a true replacement for CentOS, and the benchmarking numbers are already saying it. PostgreSQL is also pretty stable in Rocky Linux and producing the same (a bit higher) throughput as CentOS.

Looking forward for your valuable thoughts and inputs.

By the way, if you are interested in migrating your proprietary databases like Oracle or SQL Server to PostgreSQL, please feel free to contact us. We also provide Remote DBA services and Performance Assessments for PostgreSQL databases. You may fill the following form and our team will contact you soon.

The post CentOS vs Rocky Linux Benchmark with PostgreSQL appeared first on MigOps.

↧

Amit Kapila: Logical Replication Of In-Progress Transactions

July 17, 2021, 5:16 am

≫ Next: Hubert 'depesz' Lubaczewski: Display “settings” from plans on explain.depesz.com

≪ Previous: Dinesh Chemuduru: CentOS vs Rocky Linux Benchmark with PostgreSQL

Logical Replication was introduced in PostgreSQL-10 and since then it is being improved with each version. Logical Replication is a method to replicate the data selectively unlike physical replication where the data of the entire cluster is copied. This can be used to build a multi-master or bi-directional replication solution. One of the main differences as compared with physical replication was that it allows replicating the transaction only at commit time. This leads to apply lag for large transactions where we need to wait to transfer the data till the transaction is finished. In the upcoming PostgreSQL-14 release, we are introducing a mechanism to stream the large in-progress transactions. We have seen the replication performance improved by 2 or more times due to this for large transactions especially due to early filtering. See the performance test results reported on hackers and in another blog on same topic. This will reduce the apply lag to a good degree.

The first thing we need for this feature was to decide when to start streaming the WAL content. One could think if we have such a technology why not stream each change of transaction separately as and when we retrieve it from WAL but that would actually lead to sending much more data across the network because we need to send some additional transaction information with each change so that the apply-side can recognize the transaction to which the change belongs. To address this, in PostgreSQL-13, we have introduced a new GUC parameter logical_decoding_work_mem which allows users to specify the maximum amount of memory to be used by logical decoding, before which some of the decoded changes are either written to local disk or stream to the subscriber. The parameter is also used to control the memory used by logical decoding as explained in the blog.

The next thing that prevents incremental decoding was the delay in finding the association of subtransaction and top-level XID. During logical decoding, we accumulate all changes along with its (sub)transaction. Now, while sending the changes to the output plugin or stream to the other node, we need to combine all the changes that happened in the transaction which requires us to find the association of each top-level transaction with its subtransactions. Before PostgreSQL-14, we build this association at XLOG_XACT_ASSIGNMENT WAL record which we normally log after 64 subtransactions or at commit time because these are the only two times when we get such an association in the WAL. To find this association as it happened, we now also write the assignment info into WAL immediately, as part of the first WAL record for each subtransaction. This is done only when wal_level=logical to minimize the overhead.

Yet, another thing that is required for incremental decoding was to process invalidations at each command end. The basic idea of invalidations is that they make the caches (like relation cache) up-to-date to allow the next command to use up-to-date schema. This was required to correctly decode WAL incrementally as while decoding we will use the relation attributes from the caches. For this, when wal_level=logical, we write invalidations at the command end into WAL so that decoding can use this information. The invalidations are decoded and accumulated in top-transaction, and then executed during replay. This obviates the need to decode the invalidations as part of a commit record.

In previous paragraphs, the enhancements required in the server infrastructure to allow incremental decoding are explained. The next step was to provide APIs (stream methods) for out-of-core logical replication to stream large in-progress transactions. We added seven methods to the output plugin API to allow this. Those are: (stream_start_cb, stream_stop_cb, stream_abort_cb, stream_commit_cb and stream_change_cb) and two optional callbacks (stream_message_cb and stream_truncate_cb). For details about these APIs, refer to PostgreSQL docs.

When streaming an in-progress transaction, the changes (and messages) are streamed in blocks demarcated by stream_start_cb and stream_stop_cb callbacks. Once all the decoded changes are transmitted, the transaction can be committed using the stream_commit_cb callback (or possibly aborted using the stream_abort_cb callback). One example sequence of streaming transaction may look like the following:

/* Change logical_decoding_work_mem to 64kB in the session */
postgres=# show logical_decoding_work_mem;
logical_decoding_work_mem
---------------------------
64kB
(1 row)
postgres=# CREATE TABLE stream_test(data text);
CREATE TABLE

postgres=# SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
?column?
----------
init
(1 row)
postgres=# INSERT INTO stream_test SELECT repeat('a', 6000) || g.i FROM generate_series(1, 500) g(i);
INSERT 0 500
postgres=# SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '1', 'skip-empty-xacts', '1', 'stream-changes', '1');
data
--------------------------------------------------
opening a streamed block for transaction TXN 741
streaming change for TXN 741
streaming change for TXN 741
streaming change for TXN 741
...
...
streaming change for TXN 741
streaming change for TXN 741
streaming change for TXN 741
closing a streamed block for transaction TXN 741
opening a streamed block for transaction TXN 741
streaming change for TXN 741
streaming change for TXN 741
streaming change for TXN 741
...
...
streaming change for TXN 741
streaming change for TXN 741
closing a streamed block for transaction TXN 741
committing streamed transaction TXN 741
(505 rows)

The actual sequence of callback calls may be more complicated depending on the server operations. There may be blocks for multiple streamed transactions, some of the transactions may get aborted, etc.

Note that streaming is triggered when the total amount of changes decoded from the WAL (for all in-progress transactions) exceeds the limit defined by the logical_decoding_work_mem setting. At that point, the largest top-level transaction (measured by the amount of memory currently used for decoded changes) is selected and streamed. However, in some cases we still have to spill to disk even if streaming is enabled because we exceed the memory threshold but still have not decoded the complete tuple e.g., only decoded toast table insert but not the main table insert or decoded speculative insert but not the corresponding confirm record. However, as soon as we get the complete tuple we stream the transaction including the serialized changes.

While streaming in-progress transactions, the concurrent aborts may cause failures when the output plugin (or decoding of WAL records) consults catalogs (both system and user-defined). Let me explain this with an example, suppose there is one catalog tuple with (xmin: 500, xmax: 0). Now, the transaction 501 updates the catalog tuple and after that we will have two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0). Now, if 501 is aborted and some other transaction say 502 updates the same catalog tuple then the first tuple will be changed to (xmin: 500, xmax: 502). So, the problem is that when we try to decode the tuple inserted/updated in 501 after the catalog update, we will see the catalog tuple with (xmin: 500, xmax: 502) as visible because it will consider that the tuple is deleted by xid 502 which is not visible to our snapshot. And when we will try to decode with that catalog tuple, it can lead to a wrong result or a crash. So, it is necessary to detect concurrent aborts to allow streaming of in-progress transactions. For detecting the concurrent abort, during catalog scan we can check the status of the xid and if it is aborted we will report a specific error so that we can stop streaming current transaction and discard the already streamed changes on such an error. We might have already streamed some of the changes for the aborted (sub)transaction, but that is fine because when we decode the abort we will stream the abort message to truncate the changes in the subscriber.

To add support for streaming of in-progress transactions into the built-in logical replication, we need to primarily do four things:

(a) Extend the logical replication protocol to identify in-progress transactions, and allow adding additional bits of information (e.g. XID of subtransactions). Refer to PostgreSQL docs for the protocol details.

(b) Modify the output plugin (pgoutput) to implement the new stream API callbacks, by leveraging the extended replication protocol.

(c) Modify the replication apply worker, to properly handle streamed in-progress transaction by spilling the data to disk and then replaying them on commit.

(d) Provide a new option for streaming while creating a subscription.

The below example demonstrates how to set up the streaming via built-in logical replication:

Publisher node:

Set logical_decoding_work_mem = '64kB';
# Set up publication with some initial data

CREATE TABLE test_tab (a int primary key, b varchar);
INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar');
CREATE PUBLICATION tap_pub FOR TABLE test_tab;

Subscriber node:

CREATE TABLE test_tab (a int primary key, b varchar);
CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost port=5432 dbname=postgres' PUBLICATION tap_pub WITH (streaming = on);

Publisher Node:

# Ensure the corresponding replication slot is created on publisher node
select slot_name, plugin, slot_type from pg_replication_slots;
slot_name | plugin | slot_type
-----------+----------+-----------
tap_sub | pgoutput | logical
(1 row)

# Confirm there is no streamed bytes yet
postgres=# SELECT slot_name, stream_txns, stream_count, stream_bytes FROM pg_stat_replication_slots;
slot_name | stream_txns | stream_count | stream_bytes
-----------+-------------+--------------+--------------
tap_sub | 0 | 0 | 0
(1 row)

# Insert, update and delete enough rows to exceed the logical_decoding_work_mem (64kB) limit.
BEGIN;
INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
DELETE FROM test_tab WHERE mod(a,3) = 0;

# Confirm that streaming happened
SELECT slot_name, stream_txns, stream_count, stream_bytes FROM pg_stat_replication_slots;
slot_name | stream_txns | stream_count | stream_bytes
-----------+-------------+--------------+--------------
tap_sub | 1 | 22 | 1444410
(1 row)

Subscriber Node:
# The streamed data is still not visible.
select * from test_tab;
a | b
---+-----
1 | foo
2 | bar
(2 rows)

Publisher Node:
# Commit the large transactions
Commit;

Subscriber Node:
# The data must be visible on the subscriber
select count(*) from test_tab;
count
-------
3334
(1 row)

This feature was proposed in 2017 and committed in 2020 as part of various commits 0bead9af48, c55040ccd0, 45fdc9738b, 7259736a6e, and 464824323e. It took a long time to complete this feature because of the various infrastructure pieces required to achieve this. I would really like to thank all the people involved in this feature especially Tomas Vondra who has initially proposed it and then Dilip Kumar who along with me had completed various remaining parts and made it a reality. Then also to other people like Neha Sharma, Mahendra Singh Thalor, Ajin Cherian, and Kuntal Ghosh who helped throughout the project to do reviews and various tests. Also, special thanks to Andres Freund and other community members who have suggested solutions to some of the key problems of this feature. Last but not least, thanks to EDB and Fujitsu's management who encouraged me and some of the other members to work on this feature.

↧

Hubert 'depesz' Lubaczewski: Display “settings” from plans on explain.depesz.com

July 18, 2021, 11:07 am

≫ Next: Paolo Melchiorre: Maps with Django (part 2): GeoDjango, PostGIS and Leaflet

≪ Previous: Amit Kapila: Logical Replication Of In-Progress Transactions

Some time ago I wrote about new options for explains – one that prints settings that were modified from default. This looks like this: Aggregate (cost=35.36..35.37 rows=1 width=8) -> Index Only Scan using pg_class_oid_index on pg_class (cost=0.27..34.29 rows=429 width=0) Settings: enable_seqscan = 'off' Finally, today, I pushed a change that displays them on explain.depesz.com. To … Continue reading

↧

Paolo Melchiorre: Maps with Django (part 2): GeoDjango, PostGIS and Leaflet

July 18, 2021, 3:00 pm

≫ Next: Andreas 'ads' Scherbaum: Rafia Sabih

≪ Previous: Hubert 'depesz' Lubaczewski: Display “settings” from plans on explain.depesz.com

A quickstart guide to create a web map with the Python-based web framework Django using its module GeoDjango, the PostgreSQL database with its spatial extension PostGIS and Leaflet, a JavaScript library for interactive maps.

↧

Andreas 'ads' Scherbaum: Rafia Sabih

July 19, 2021, 7:00 am

≫ Next: Paul Ramsey: Waiting for PostGIS 3.2: ST_MakeValid

≪ Previous: Paolo Melchiorre: Maps with Django (part 2): GeoDjango, PostGIS and Leaflet

PostgreSQL Person of the Week Interview with Rafia Sabih: I am basically from India, currently living in Berlin, Germany. I started my PostgreSQL journey in my masters and continued it by joining EDB, India. Then, I increased my spectrum to Postgres on Kubernetes working at Zalando.

↧

Paul Ramsey: Waiting for PostGIS 3.2: ST_MakeValid

July 19, 2021, 8:01 am

≫ Next: Hans-Juergen Schoenig: PostgreSQL zheap: Current status

≪ Previous: Andreas 'ads' Scherbaum: Rafia Sabih

One of the less visible improvements coming in PostGIS 3.2 (via the GEOS 3.10 release) is a new algorithm for repairing invalid polygons and multipolygons.

Algorithms like polygon intersection, union and difference rely on guarantees that the structure of inputs follows certain rules. We call geometries that follow those rules "valid" and those that do not "invalid".

↧

Hans-Juergen Schoenig: PostgreSQL zheap: Current status

July 20, 2021, 1:00 am

≫ Next: Martin Davis: Using PostGIS and pg_featureserv with QGIS

≪ Previous: Paul Ramsey: Waiting for PostGIS 3.2: ST_MakeValid

zheap has been designed as a new storage engine to handle UPDATE in PostgreSQL more efficiently. A lot has happened since my last report on this important topic, and I thought it would make sense to give readers a bit of a status update – to see how things are going, and what the current status is.

zheap: What has been done since last time

Let’s take a look at the most important things we’ve achieved since our last status report:

logical decoding
work on UNDO
patch reviews for UNDO
merging codes
countless fixes and improvements

zheap: Logical decoding

The first thing on the list is definitely important. Most people might be familiar with PostgreSQL’s capability to do logical decoding. What that means is that the transaction log (= WAL) is transformed back to SQL so that it can be applied on some other machine, leading to identical results on the second server. The capability to do logical decoding is not just a given. Code has to be written which can decode zheap records and turn them into readable output. So far this implementation looks good. We are not aware of bugs in this area at the moment.

Work on UNDO

zheap is just one part of the equation when it comes to new storage engines. As you might know, a standard heap table in PostgreSQL will hold all necessary versions of a row inside the same physical files. In zheap this is not the case. It is heavily based on a feature called “UNDO” which works similar to what Oracle and some other database engines do. The idea is to move old versions of a row out of the table and then, in case of a ROLLBACK, put them back in .

What has been achieved is that the zheap code is now compatible with the new UNDO infrastructure suggested by the community (which we hope to see in core by version 15). The general idea here is that UNDO should not only be focused on zheap, but provide a generic infrastructure other storage engines will be able to use in the future as well. That’s why preparing the zheap code for a future UNDO feature of PostgreSQL is essential to success. If you want to follow the discussion on the mailing list, here is where you can find some more detailed information about zheap and UNDO.

Fixing bugs and merging

As you can imagine, a major project such as zheap will also cause some serious work on the quality management front. Let’s look at the size of the code:

[hs@node1 zheap_postgres]$ cd src/backend/access/zheap/

[hs@node1 zheap]$ ls -l *c

-rw-rw-r--. 1 hs hs  14864 May 27 04:25 prunetpd.c
-rw-rw-r--. 1 hs hs  27935 May 27 04:25 prunezheap.c
-rw-rw-r--. 1 hs hs  11394 May 27 04:25 rewritezheap.c
-rw-rw-r--. 1 hs hs  96748 May 27 04:25 tpd.c
-rw-rw-r--. 1 hs hs  13997 May 27 04:25 tpdxlog.c
-rw-rw-r--. 1 hs hs 285703 May 27 04:25 zheapam.c
-rw-rw-r--. 1 hs hs  59175 May 27 04:25 zheapam_handler.c
-rw-rw-r--. 1 hs hs  62970 May 27 04:25 zheapam_visibility.c
-rw-rw-r--. 1 hs hs  61636 May 27 04:25 zheapamxlog.c
-rw-rw-r--. 1 hs hs  16608 May 27 04:25 zheaptoast.c
-rw-rw-r--. 1 hs hs  16218 May 27 04:25 zhio.c
-rw-rw-r--. 1 hs hs  21039 May 27 04:25 zmultilocker.c
-rw-rw-r--. 1 hs hs  16480 May 27 04:25 zpage.c
-rw-rw-r--. 1 hs hs  43128 May 27 04:25 zscan.c
-rw-rw-r--. 1 hs hs  27760 May 27 04:25 ztuple.c
-rw-rw-r--. 1 hs hs  55849 May 27 04:25 zundo.c
-rw-rw-r--. 1 hs hs  51613 May 27 04:25 zvacuumlazy.c

[hs@node1 zheap]$ cat *c | wc -l

29696

For those of you out there who are anxiously awaiting a productive version of zheap, I have to point out that this is really a major undertaking which is not trivial to do. You can already try out and test zheap. However, keep in mind that we are not quite there yet. It will take more time, and especially feedback from the community to make this engine production-ready, capable of handling any workload reliably and bug-free.

I won’t go into the details of what has been fixed, but we had a couple of issues including bugs, compiler warnings, and so on.

What has also been done was to merge the zheap code with current versions of PostgreSQL, to make sure that we’re up to date with all the current developments.

Next steps to improve zheap

As far as the next steps are concerned, there are a couple of things on the list. One of the first things will be to work on the discard worker. Now what is that? Consider the following listing:

test=# BEGIN
BEGIN
test=*# CREATE TABLE sample (x int) USING zheap;
CREATE TABLE
test=*# INSERT INTO sample SELECT * FROM generate_series(1, 1000000) AS x;
INSERT 0 1000000
test=*# SELECT * FROM pg_stat_undo_chunks;
 logno  |  start           | prev |   size   | discarded | type |  type_header 
--------+------------------+------+----------+-----------+------+-----------------------------------
 000001 | 000001000021AC3D |      | 57       | f         | xact | (xid=745, dboid=16384, applied=f)
 000001 | 000001000021AC76 |      | 44134732 | f         | xact | (xid=748, dboid=16384, applied=f)
(2 rows)
test=*# COMMIT;
COMMIT
test=# SELECT * FROM pg_stat_undo_chunks;
 logno  |  start           | prev |   size   | discarded | type |  type_header 
--------+------------------+------+----------+-----------+------+-----------------------------------
 000001 | 000001000021AC3D |      |       57 | f         | xact | (xid=745, dboid=16384, applied=f)
 000001 | 000001000021AC76 |      | 44134732 | f         | xact | (xid=748, dboid=16384, applied=f)
(2 rows)

What we see here is that the UNDO chunks do not go away. They keep piling up. At the moment, it is possible to purge them manually:

test=# SELECT pg_advance_oldest_xid_having_undo();
 pg_advance_oldest_xid_having_undo
-----------------------------------
                            750
(1 row)

test=# SELECT * FROM pg_stat_undo_chunks;
logno  |  start           | prev |   size   | discarded | type |        type_header
-------+------------------+------+----------+-----------+------+-----------------------------------
000001 | 000001000021AC3D |      | 57       | t         | xact | (xid=745, dboid=16384, applied=f)
000001 | 000001000021AC76 |      | 44134732 | t         | xact | (xid=748, dboid=16384, applied=f)
(2 rows)

test=# SELECT pg_discard_undo_record_set_chunks();
pg_discard_undo_record_set_chunks
-----------------------------------
(1 row)

test=# SELECT * FROM pg_stat_undo_chunks;
logno | start | prev | size | discarded | type | type_header
------+-------+------+------+-----------+------+-------------
(0 rows)

As you can see, the UNDO has gone away. The goal here is that the cleanup should happen automatically – using a “discard worker”. Implementing this process is one of the next things on the list.

Community feedback is currently one of the bottlenecks. We invite everybody with an interest in zheap to join forces and help to push this forward. Everything from load testing to feedback on the design is welcome – and highly appreciated! zheap is important for UPDATE-heavy workloads, and it’s important to move this one forward.

Trying it all out

If you want to get involved, or just try out zheap, we have created a tarball for you which can be downloaded from our website. It contains our latest zheap code (as of May 27th, 2021).

Simply compile PostgreSQL normally:

./configure --prefix=/your_path/pg --enable-debug --with-cassert
make install
cd contrib
make install

Then you can create a database instance, start the server normally and start playing. Make sure that you add “USING zheap” when creating a new table, because otherwise PostgreSQL will create standard “heap” tables (so not zheap ones).

Finally …

We want to say thank you to Heroic Labs for providing us with all the support we have to make zheap work. They are an excellent partner and we recommend checking out their services. Their commitment has allowed us to allocate so many resources to this project, which ultimately benefits the entire community. A big thanks goes out to those guys.

If you want to know more about zheap, we suggest checking out some of our other posts on this topic. Here is more about zheap and storage consumption.

The post PostgreSQL zheap: Current status appeared first on Cybertec.

↧

Martin Davis: Using PostGIS and pg_featureserv with QGIS

July 21, 2021, 9:31 am

≫ Next: Luca Ferrari: How much data goes into the WALs? (part 2)

≪ Previous: Hans-Juergen Schoenig: PostgreSQL zheap: Current status

Using PostGIS and pg_featureserv with QGIS

My colleague Kat Batuigas recently wrote about using the powerful open-source QGIS desktop GIS to import data into PostGIS from an ArcGIS Feature Service. This is a great first step toward moving your geospatial stack onto the performant, open source platform provided by PostGIS. And there's no need to stop there! Crunchy Data has developed a suite of spatial web services that work natively with PostGIS to expose your data to the web, using industry-standard protocols. These include:

pg_tileserv - allows mapping spatial data using the MVT vector tile format
pg_featureserv - publishes spatial data using the OGC API for Features protocol

↧

Luca Ferrari: How much data goes into the WALs? (part 2)

July 14, 2021, 5:00 pm

≫ Next: Luca Ferrari: PostgreSQL Extension Catalogs

≪ Previous: Martin Davis: Using PostGIS and pg_featureserv with QGIS

I did some more experiments with WALs.

How much data goes into the WALs? (part 2**

In order to get a better idea about how WAL settings can change the situation within the WAL management, I decided to run a kind of automated test and store the results into a table, so that I can query them back later.
The idea is the same of my previous article: produce some workload, meausere the differences in the Log Sequence Numbers, and see how the size of WALs change depending on some setting. This is not an accurate research, it’s just a quick and dirty experiment.

At the end, I decided to share my numbers so that you can have a look at them and elaborate a bit more. For example, I’m no good at all at doing graphs (I know only the very minimum about gnuplot!).

!!! WARNING !!!

WARNING: this is not a guide on how to tune WAL settings!This is not even a real and comprhensive set of experiments, it is just what I’ve played with to see how much traffic can be generated for certain amount of workloads.
Your case and situation could be, and probably is, different from the very simple test I’ve done, and I do not pretend to be right about the small and obvious conclusions I come up at the end. In the case you see or know something that can help making more clear what I write in the following, please comment or contact me!

Set up

First of all I decided to run an INSERT only workload, so that the size of the resulting table does not include any bloating and is therefore comparable to the effort about the WAL records.
No other database activity was ongoing, so that the only generated WAL traffic was about my own workload.
Each time the configuration was changed, the system was restarted, so that every workload started with the same (empty) clean situation and without any need to reason about ongoing checkpoints. Of course, checkpoints were happening as usual, but not at the beginning of the workload.

I used two tables to run the test:

wal_traffic stores the results of each run;
wal_traffic_data is used to store the data about every workload, that is tuples inserted in the database.
The wal_traffic_data was dropped and re-created every time a new run was started, so to avoid data bloating It is interesting to note that any workload setup activity is performed before the server is restarted, so that the only WAL traffic measured is as close as possible to the workload only.
The wal_traffic table is defined as follows:

CREATETABLEIFNOTEXISTSwal_traffic(pkintgeneratedalwaysasidentity,workloadtext,lsn_startpg_lsn,lsn_endpg_lsn,lsn_insert_startpg_lsn,lsn_insert_endpg_lsn,runintdefault0,data_sizebigintdefault0,wal_sizebigintgeneratedalwaysas(lsn_end-lsn_start)stored,wal_data_rationumericgeneratedalwaysas((lsn_end-lsn_start)::real/data_size*100)stored,wal_insert_data_rationumericgeneratedalwaysas((lsn_insert_end-lsn_insert_start)::real/data_size*100)stored,settingsjsonb,workload_repetitionsintdefault1,ts_starttimestampdefaultcurrent_timestamp,ts_endtimestampdefaultcurrent_timestamp,PRIMARYKEY(pk));

The workload field stores the text string about the executed query.
The lsn_xxx fields store the location within the WAL, in particular:

lsn_start and lsn_end store the result of pg_current_wal_lsn() function invoked at the begin and at the end of the workload;
lsn_insert_start and lsn_insert_end store the result of pg_current_wal_insert_lsn() function invoked at the beginning and ending of the workload.

I decided to store both the information to be able to examine differences in a more accurate way, however for this kind of experiment the differences between the values are pretty much useless.
The data_size column contains the result of pg_relation_size(), that is a rough estimation of the volumen of data produced during the workload.
The columns wal_size, wal_data_ratio, and wal_insert_data_ratioare generated, and contain repsectively the amount of generated WAL records and the ratio between the size of the actual data and that of the WAL records.
Last, the settings column contains a jsonb representation of the settings used to run the test, like for example the value for wal_level, wal_compression and so on.

There is also a view to quickly get results about the workload size:

CREATEORREPLACEVIEWvw_wal_trafficASselectpg_size_pretty(data_size)asdata_size,pg_size_pretty(wal_size)aswal_size,wal_data_ratio::numeric(7,2)||' %'asratio,wal_insert_data_ratio::numeric(7,2)||'%'asins_ratio,ts_end-ts_startaselapsed_time,settingsfromwal_traffic;

Details about the workloads

I’ve prepared two different workload, both based on INSERTs.
The first workload does two transactions: the first one inserts a certain amount of tuples, while the second inserts a smaller amount of tuples. In particular, the first transaction inserts a number of tuples specified by $workload_scale, while the second transaction inserts 1/5 of the same value.

BEGIN;INSERTINTO$WORKLOAD_TABLESELECTv,md5(v::text)::text||random()::textFROMgenerate_series(1,$workload_scale)v;COMMIT;BEGIN;INSERTINTO$WORKLOAD_TABLESELECTv+v,t||' - '||t||random()::textFROM$WORKLOAD_TABLEWHEREv%5=0;COMMIT;

The $workload_scale variable assumes the values ranging from 100 to 10 million growing by a factor of ten (e.g., 100, 1000, 10000 and so on).
The second workload type is shorter, and does the following:

DO$$DECLAREiint;BEGINFORiIN1..$workload_scaleLOOPINSERTINTO$WORKLOAD_TABLESELECT1,md5(random()::text)::text;ENDLOOP;END$$;

Therefore performs the same number of tuple insertions as in the previous transaction, but it does by looping. The final effect is that the first workload executes a single INSERT statetement, while the second workload executes several INSERT statements.

The usage of random() within the INSERT statements is to generate some more traffic on logical decoding.

The Workload Workflow

In order to do the tests, I wrote an ugly shell script with the following workflow:

truncate the wal_traffic_data table, so that its size on disk does not include previous experiments;
execute a few ALTER SYSTEM to set some configuration on WAL related parameters (wal_level, full_page_writes, wal_compression and so on);
restart the PostgreSQL system, so to ensure every test has a clean and clear situation;
get the current WAL position (pg_current_wal_lsn() and pg_current_wal_insert_lsn());
execute the workload with the right scale;
get the current WAL position (pg_current_wal_lsn() and pg_current_wal_insert_lsn());
insert the result tuple with WAL differences into wal_traffic;
loop with a different scaling factor.

The Results

It is now time to have a look at the test results.

Let’s consider a few results:

testdb=>select*fromvw_wal_trafficwheresettings->>'wal_level'='minimal'andsettings->>'wal_compression'='on';-[RECORD1]+-----------------------------------------------------------------------------------------------------data_size|1205MBwal_size|2148MBratio|178.27%ins_ratio|178.27%elapsed_time|00:02:33.282366settings|{"wal_level":"minimal","wal_log_hints":"off","wal_compression":"on","full_page_writes":"off"}-[RECORD2]+-----------------------------------------------------------------------------------------------------data_size|1205MBwal_size|2148MBratio|178.27%ins_ratio|178.27%elapsed_time|00:02:34.882126settings|{"wal_level":"minimal","wal_log_hints":"on","wal_compression":"on","full_page_writes":"on"}

As you can see, for 1,2 GB of data the system has produced roughly 2,1 GB of WAL records. And the situation is even worst when there is no wal_compression (as you could expect):

-[RECORD8]+------------------------------------------------------------------------------------------------------data_size|1205MBwal_size|2402MBratio|199.34%ins_ratio|199.34%elapsed_time|00:02:30.725138settings|{"wal_level":"minimal","wal_log_hints":"off","wal_compression":"off","full_page_writes":"on"}

this time, for the same amount of data, the WAL size is almost double that of the real data.
Changing the setting of wal_level to logical or replicat does not change very much the situation,

It is possible to get the best ratio between the WAL produced and the data stored:

testdb=>select*fromvw_wal_trafficvwhereratio=(selectmin(ratio)fromvw_wal_trafficwheresettings->>'wal_level'=v.settings->>'wal_level')andv.settings->>'wal_level'IN('minimal','replica','logical');-[RECORD1]+-----------------------------------------------------------------------------------------------------data_size|16kBwal_size|16kBratio|101.95%ins_ratio|101.95%elapsed_time|00:00:00.133674settings|{"wal_level":"logical","wal_log_hints":"on","wal_compression":"on","full_page_writes":"off"}-[RECORD2]+-----------------------------------------------------------------------------------------------------data_size|16kBwal_size|16kBratio|101.56%ins_ratio|101.56%elapsed_time|00:00:00.120578settings|{"wal_level":"replica","wal_log_hints":"on","wal_compression":"on","full_page_writes":"on"}-[RECORD3]+-----------------------------------------------------------------------------------------------------data_size|16kBwal_size|18kBratio|111.13%ins_ratio|100.34%elapsed_time|00:00:00.427126settings|{"wal_level":"minimal","wal_log_hints":"off","wal_compression":"on","full_page_writes":"off"}

and on the other side, the worst ratio:

testdb=>select*fromvw_wal_trafficvwhereratio=(selectmax(ratio)fromvw_wal_trafficwheresettings->>'wal_level'=v.settings->>'wal_level')andv.settings->>'wal_level'IN('minimal','replica','logical');-[RECORD1]+-----------------------------------------------------------------------------------------------------data_size|8192byteswal_size|23kBratio|289.16%ins_ratio|190.72%elapsed_time|00:00:00.266881settings|{"wal_level":"minimal","wal_log_hints":"off","wal_compression":"off","full_page_writes":"on"}-[RECORD2]+-----------------------------------------------------------------------------------------------------data_size|8192byteswal_size|23kBratio|289.16%ins_ratio|190.72%elapsed_time|00:00:00.112946settings|{"wal_level":"minimal","wal_log_hints":"off","wal_compression":"off","full_page_writes":"on"}-[RECORD3]+-----------------------------------------------------------------------------------------------------data_size|8192byteswal_size|23kBratio|284.47%ins_ratio|190.63%elapsed_time|00:00:00.076021settings|{"wal_level":"logical","wal_log_hints":"off","wal_compression":"off","full_page_writes":"on"}-[RECORD4]+-----------------------------------------------------------------------------------------------------data_size|8192byteswal_size|23kBratio|289.65%ins_ratio|190.53%elapsed_time|00:00:00.113793settings|{"wal_level":"replica","wal_log_hints":"off","wal_compression":"off","full_page_writes":"on"}

From the above, it is clear that the worst cases are those with wal_compression disabled, while the best cases are those with compression enabled.

Download the Results

The results are available by means of a CSV file, so you can load and inspect them yourself. In order to load the files, create a table wal_traffic_results as follows:

testdb=>createtablewal_traffic_results(runint,workloadtext,wal_sizebigint,data_sizebigint,wal_data_rationumeric(5,2),wall_clocktime,wal_leveltext,wal_log_hintstext,wal_compressiontext,full_page_writestext);

and then load the CSV file with a command like the following one:

testdb=>\copywal_traffic_resultsfromwal_traffic.csvwith(formatcsv,header);

Please note that I’ve split the jsonb field into a set of columns with a query like the following one, that produced the CSV file:

%psql-A--csv -h miguel -c'select run, workload, wal_size, data_size, wal_data_ratio,  ts_end - ts_start as wall_clock, x.* from wal_traffic
cross join lateral jsonb_to_record( settings ) as x( wal_level text, wal_log_hints text, wal_compression text, full_page_writes text );'testdb>!wal_traffic.csv

More Results

From the de-jsonb representation of the results, it is easier to get a glance at the WAL ratio by workload type

testdb=>selectworkload,min(wal_data_ratio),max(wal_data_ratio),max(wal_data_ratio)-min(wal_data_ratio)asdifffromwal_traffic_resultsgroupbyworkloadorderby4asc;-[RECORD1]-------------------------------------------------------------------------------workload|'BEGIN;                                                                          +
         | INSERT INTO wal_traffic_workload SELECT v, md5( v::text )::text || random()::text+
         | FROM generate_series( 1, 1000000 ) v;                                            +
         | COMMIT;                                                                          +
         |                                                                                  +
         | BEGIN;                                                                           +
         | INSERT INTO wal_traffic_workload                                                 +
         | SELECT v + v, t || '' - '' || t || random()::text                                +
         | FROM wal_traffic_workload                                                        +
         | WHERE v % 5 = 0;                                                                 +
         | COMMIT;'min|125.55max|125.60diff|0.05-[RECORD2]-------------------------------------------------------------------------------workload|'DO $wl$ DECLARE                                                                 +
         |   i int;                                                                         +
         | BEGIN                                                                            +
         |   FOR i IN 1 .. 1000000 LOOP                                                     +
         |     INSERT INTO wal_traffic_workload SELECT 1, md5( random()::text )::text;      +
         |   END LOOP;                                                                      +
         | END $wl$;'min|125.76max|125.82diff|0.06-[RECORD3]-------------------------------------------------------------------------------workload|'DO $wl$ DECLARE                                                                 +
         |   i int;                                                                         +
         | BEGIN                                                                            +
         |   FOR i IN 1 .. 10000000 LOOP                                                    +
         |     INSERT INTO wal_traffic_workload SELECT 1, md5( random()::text )::text;      +
         |   END LOOP;                                                                      +
         | END $wl$;'min|125.76max|125.92diff|0.16-[RECORD4]-------------------------------------------------------------------------------workload|'BEGIN;                                                                          +
         | INSERT INTO wal_traffic_workload SELECT v, md5( v::text )::text || random()::text+
         | FROM generate_series( 1, 100000 ) v;                                             +
         | COMMIT;                                                                          +
         |                                                                                  +
         | BEGIN;                                                                           +
         | INSERT INTO wal_traffic_workload                                                 +
         | SELECT v + v, t || '' - '' || t || random()::text                                +
         | FROM wal_traffic_workload                                                        +
         | WHERE v % 5 = 0;                                                                 +
         | COMMIT;'min|125.49max|125.73diff|0.24-[RECORD5]-------------------------------------------------------------------------------workload|'DO $wl$ DECLARE                                                                 +
         |   i int;                                                                         +
         | BEGIN                                                                            +
         |   FOR i IN 1 .. 100000 LOOP                                                      +
         |     INSERT INTO wal_traffic_workload SELECT 1, md5( random()::text )::text;      +
         |   END LOOP;                                                                      +
         | END $wl$;'min|125.72max|125.97diff|0.25-[RECORD6]-------------------------------------------------------------------------------workload|'BEGIN;                                                                          +
         | INSERT INTO wal_traffic_workload SELECT v, md5( v::text )::text || random()::text+
         | FROM generate_series( 1, 10000 ) v;                                              +
         | COMMIT;                                                                          +
         |                                                                                  +
         | BEGIN;                                                                           +
         | INSERT INTO wal_traffic_workload                                                 +
         | SELECT v + v, t || '' - '' || t || random()::text                                +
         | FROM wal_traffic_workload                                                        +
         | WHERE v % 5 = 0;                                                                 +
         | COMMIT;'min|124.99max|126.55diff|1.56-[RECORD7]-------------------------------------------------------------------------------workload|'DO $wl$ DECLARE                                                                 +
         |   i int;                                                                         +
         | BEGIN                                                                            +
         |   FOR i IN 1 .. 10000 LOOP                                                       +
         |     INSERT INTO wal_traffic_workload SELECT 1, md5( random()::text )::text;      +
         |   END LOOP;                                                                      +
         | END $wl$;'min|125.14max|127.47diff|2.33-[RECORD8]-------------------------------------------------------------------------------workload|'BEGIN;                                                                          +
         | INSERT INTO wal_traffic_workload SELECT v, md5( v::text )::text || random()::text+
         | FROM generate_series( 1, 10000000 ) v;                                           +
         | COMMIT;                                                                          +
         |                                                                                  +
         | BEGIN;                                                                           +
         | INSERT INTO wal_traffic_workload                                                 +
         | SELECT v + v, t || '' - '' || t || random()::text                                +
         | FROM wal_traffic_workload                                                        +
         | WHERE v % 5 = 0;                                                                 +
         | COMMIT;'min|178.27max|199.46diff|21.19-[RECORD9]-------------------------------------------------------------------------------workload|'BEGIN;                                                                          +
         | INSERT INTO wal_traffic_workload SELECT v, md5( v::text )::text || random()::text+
         | FROM generate_series( 1, 1000 ) v;                                               +
         | COMMIT;                                                                          +
         |                                                                                  +
         | BEGIN;                                                                           +
         | INSERT INTO wal_traffic_workload                                                 +
         | SELECT v + v, t || '' - '' || t || random()::text                                +
         | FROM wal_traffic_workload                                                        +
         | WHERE v % 5 = 0;                                                                 +
         | COMMIT;'min|121.58max|152.01diff|30.43-[RECORD10]------------------------------------------------------------------------------workload|'DO $wl$ DECLARE                                                                 +
         |   i int;                                                                         +
         | BEGIN                                                                            +
         |   FOR i IN 1 .. 1000 LOOP                                                        +
         |     INSERT INTO wal_traffic_workload SELECT 1, md5( random()::text )::text;      +
         |   END LOOP;                                                                      +
         | END $wl$;'min|118.45max|167.37diff|48.92-[RECORD11]------------------------------------------------------------------------------workload|'BEGIN;                                                                          +
         | INSERT INTO wal_traffic_workload SELECT v, md5( v::text )::text || random()::text+
         | FROM generate_series( 1, 100 ) v;                                                +
         | COMMIT;                                                                          +
         |                                                                                  +
         | BEGIN;                                                                           +
         | INSERT INTO wal_traffic_workload                                                 +
         | SELECT v + v, t || '' - '' || t || random()::text                                +
         | FROM wal_traffic_workload                                                        +
         | WHERE v % 5 = 0;                                                                 +
         | COMMIT;'min|101.56max|247.46diff|145.90-[RECORD12]------------------------------------------------------------------------------workload|'DO $wl$ DECLARE                                                                 +
         |   i int;                                                                         +
         | BEGIN                                                                            +
         |   FOR i IN 1 .. 100 LOOP                                                         +
         |     INSERT INTO wal_traffic_workload SELECT 1, md5( random()::text )::text;      +
         |   END LOOP;                                                                      +
         | END $wl$;'min|124.02max|289.65diff|165.63

There are certain workload (by type and size) that do not produce any sensible variation in the WAL produced, while for example the last workload for a small amount of tuples produces a very wide range of WAL record writes.
We could also query to search for a trend in the ratio:

testdb=>selectwal_data_ratio,wal_level,wal_log_hints,wal_compression,full_page_writesfromwal_traffic_resultswhereworkloadlike'%FOR i IN 1 .. 100 LOOP%'orderby1desc;-[RECORD1]----|--------wal_data_ratio|289.65wal_level|replicawal_log_hints|offwal_compression|offfull_page_writes|on...-[RECORD6]----|--------wal_data_ratio|256.35wal_level|logicalwal_log_hints|onwal_compression|offfull_page_writes|on...-[RECORD19]---|--------wal_data_ratio|150.78wal_level|replicawal_log_hints|onwal_compression|onfull_page_writes|off...-[RECORD36]---|--------wal_data_ratio|124.02wal_level|replicawal_log_hints|onwal_compression|onfull_page_writes|on

The above confirms how much wal_compression is going to reduce the WAL traffic.
And again, the wal_level is not going to influence the WAL size too much:

testdb=>selectmin(wal_data_ratio),max(wal_data_ratio),wal_levelfromwal_traffic_resultswhereworkloadlike'%FOR i IN 1 .. 100 LOOP%'groupbywal_levelorderby1desc,2desc;-[RECORD1]------min|146.00max|289.16wal_level|minimal-[RECORD2]------min|124.12max|284.47wal_level|logical-[RECORD3]------min|124.02max|289.65wal_level|replica

Conclusions

Even a small amount of real data can produce quite a lot amount of WAL records, and this is good because within those records there are all the information PostgreSQL needs to keep our data at safe, that after all its our final goal.
WAL related settings can, of course, influence the amount of generated data and the idea behind this article is not to provide an exhaustive guide to tune WALs, rather to show how you can measure your WAL traffic depending on the workload you are facing.
This should then help you to decide the right way to tune your WALs.
In the case you find something wrong in the approach described above, or want to integrate or share your experience, please comment on contact me.

↧

Luca Ferrari: PostgreSQL Extension Catalogs

July 19, 2021, 5:00 pm

≫ Next: Paul Brebner: Streaming JSON Data Into PostgreSQL Using Open Source Kafka Sink Connectors (Pipeline Series Part 6)

≪ Previous: Luca Ferrari: How much data goes into the WALs? (part 2)

How to see the available and/or installed extensions?

PostgreSQL Extension Catalogs

There are three main catalogs that can be useful when dealing with extensions:

The former one, pg_extension provides information about which extensions are installed in the current database, while the latter, pg_available_extensions provides information about which extensions are available to the cluster.
The difference is simple: to be used an extension must appear first on pg_available_extensions, that means it has been installed on the cluster (e.g., via pgxnclient). From this point on, the extension can be installed into the database by means of a CREATE EXTENSION statement; as a result the extension will appear into the pg_extension catalog.

As an example:

testdb=>selectname,default_versionfrompg_available_extensions;name|default_version--------------------|-----------------intagg|1.1plpgsql|1.0dict_int|1.0dict_xsyn|1.0adminpack|2.1intarray|1.3amcheck|1.2autoinc|1.0isn|1.2bloom|1.0fuzzystrmatch|1.1jsonb_plperl|1.0btree_gin|1.3jsonb_plperlu|1.0btree_gist|1.5hstore|1.7hstore_plperl|1.0hstore_plperlu|1.0citext|1.6lo|1.1ltree|1.2cube|1.4insert_username|1.0moddatetime|1.0dblink|1.2earthdistance|1.1file_fdw|1.0pageinspect|1.8pg_buffercache|1.3pg_freespacemap|1.2pg_prewarm|1.2pg_stat_statements|1.8pg_trgm|1.5pg_visibility|1.2pgcrypto|1.3pgrowlocks|1.2pgstattuple|1.5postgres_fdw|1.0refint|1.0seg|1.3bool_plperl|1.0plperlu|1.0sslinfo|1.2anon|0.9.0tablefunc|1.0tcn|1.0tsm_system_rows|1.0bool_plperlu|1.0tsm_system_time|1.0pgaudit|1.5pg_qualstats|2.0.2unaccent|1.1plperl|1.0orafce|3.13uuid-ossp|1.1xml2|1.1pg_background|1.0

The above list represents all the available extensions installed on the cluster, thus those I can execute a CREATE EXTENSION against.

The pg_available_extensions has an installed_version field that provides the version number of the extension installed in the current database, or NULL if the extension is not installed in the current database. Therefore, in order to know if an extension is installed or not in a database, you can run a query like the following:

<br/<

testdb=>selectname,default_version,installed_versionfrompg_available_extensionswhereinstalled_versionisnotnull;name|default_version|installed_version---------------|-----------------|-------------------plpgsql|1.0|1.0dblink|1.2|1.2orafce|3.13|3.13pg_background|1.0|1.0

This is a little too much effort, and since extension could have been installed with different flags in different database, the pg_extension catalog provides a more detailed and narrowed information: it lists all extensions that have been installed on the current database.

Therefore, to see what a database can use, that means which extensions it has access to, I need to use the pg_extension catalog:

testdb=>selectextname,extversionfrompg_extension;extname|extversion---------------|------------plpgsql|1.0orafce|3.13dblink|1.2pg_background|1.0

The current database has a much smaller list of available extensions.

Extension Version Numbers

As you know, an extension can come with different version number and the beauty of this mechanism is that it is easy to upgrade an extension from one version to another.
The pg_available_extensions catalog provides only the last (i.e., newest) version of an available extension. Let’s try with a very popular extension: pg_stat_statements:

testdb=>selectname,default_version,installed_versionfrompg_available_extensionswherename='pg_stat_statements';name|default_version|installed_version--------------------|-----------------|-------------------pg_stat_statements|1.8|

The extension could be installed to the version 1.8 and is currently not available in the current database.
But what about other version numbers?
The catalog pg_available_extension_versions provides a list of all available versions an extension is currently available:

testdb=>selectname,version,installed,relocatablefrompg_available_extension_versionswherename='pg_stat_statements'orderbyversion;name|version|installed|relocatable--------------------|---------|-----------|-------------pg_stat_statements|1.4|f|tpg_stat_statements|1.5|f|tpg_stat_statements|1.6|f|tpg_stat_statements|1.7|f|tpg_stat_statements|1.8|f|t

As you can see, the extension is available in five different versions, and I can choose the version that fit the best my requirements.
This catalog provides different information, in particular it can give you an idea if the extension can be installed only by superusers (field superuser) or by a user with appropriate privileges (field trusted), as well as other required extensions (field requires_name), and relocatability.

↧

Paul Brebner: Streaming JSON Data Into PostgreSQL Using Open Source Kafka Sink Connectors (Pipeline Series Part 6)

July 22, 2021, 4:56 pm

≫ Next: Paul Ramsey: Waiting for PostGIS 3.2: ST_Contour and ST_SetZ

≪ Previous: Luca Ferrari: PostgreSQL Extension Catalogs

Having explored one fork in the path (Elasticsearch and Kibana) in the previous pipeline blog series (here is part 5), in this blog we backtrack to the junction to explore the alternative path (PostgreSQL and Apache Superset). But we’ve lost our map (aka the JSON Schema)—so let’s hope we don’t get lost or attacked by mutant radioactive monsters.

Fork in an abandoned uranium mine
*(Source: Shutterstock)*

Just to recap how we got here, here’s the Kafka source connector -> Kafka -> Kafka sink connector -> Elasticsearch -> Kibana technology pipeline we built in the previous blog series:

And here’s the blueprint for the new architecture, with PostgreSQL replacing Elasticsearch as the target sink data store, and Apache Superset replacing Kibana for analysis and visualization:

Well that’s the plan, but you never know what surprises may be lurking in the unmapped regions of a disused uranium mine!

1. Step 1: PostgreSQL database

In the news recently was Instaclustr’s acquisition of Credativ, experts in the open source PostgreSQL database (and other technologies). As a result, Instaclustr has managed PostgreSQL on its roadmap, and I was lucky to get access to the internal preview (for staff only) a few weeks ago. This has all the features you would expect from our managed services, including easy configuration and provisioning (via the console and REST API), server connection information and examples, and built-in monitoring with key metrics available (via the console and REST API).

Having a fully functional PostgreSQL database available in only a few minutes is great, but what can you do with it?

The first thing is to decide how to connect to it for testing. There are a few options for client applications including psql (a terminal-based front-end), and a GUI such as pgAdmin4 (which is what I used). Once you have pdAdmin4 running on your desktop you can easily create a new connection to the Instaclustr managed PostgreSQL server as described here, using the server IP address, username, and password, all obtained from the Instaclustr console.

You also need to ensure that the IP address of the machine you are using is added to the firewall rules (and saved) in the console (and update this if your IP address changes, which mine does regularly when working from home). Once connected you can create tables with the GUI, and run arbitrary SQL queries to write and read data to test things out.

I asked my new Credativ colleagues what else you can do with PostgreSQL, and they came up with some unexpectedly cool things to try—apparently PostgreSQL is Turing Equivalent, not “just” a database:

Classic Towers of Hanoi in PostgreSQL

A space-based strategy game implemented entirely within a PostgreSQL database

GTS2— A whole game in PostgreSQL and pl/python where you can drive around in python-rendered OSM-Data with a car, with collision detection and all the fun stuff!

A Turing Machine implemented in PostgreSQL. This is an example I found, just to prove the Turing machine point. It was probably easier to build than the one below, which is an actual working Turing machine, possibly related to this one. Although the simplest implementation is just a roll of toilet paper and some rocks.

By Rocky Acosta – Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=24369879

So that’s enough PostgreSQL fun for the time being, let’s move onto the real reason we’re using it, which is to explore the new pipeline architecture.

2. Step 2: Kafka and Kafka Connect Clusters

The next step was to recreate the Kafka and Kafka Connect setup that I had for the original pipeline blogs, as these are the common components that are also needed for the new experiment.

So, first I created a Kafka cluster. There’s nothing special about this cluster configuration (although you should ensure that all the clusters are created in the same AWS region to reduce latency and costs—all Instaclustr clusters are provisioned so they are spread over all AWS availability zones for high availability).

Then I created a Kafka Connect Cluster targeting the Kafka cluster. There are a couple of extra configuration steps required (one before provisioning, and one after).

If you are planning on bringing your own (BYO) connectors, then you have to tick the “Use Custom Connectors” checkbox and add the details for the S3 bucket where your connectors have been uploaded to. You can find the bucket and details in your AWS console, and you need the AWS access key id, AWS secret access key, and the S3 bucket name. Here are the instructions for using AWS S3 for custom Kafka connectors.

Because we are going to use sink connectors that connect to PostgreSQL, you’ll also have to configure the Kafka Connect cluster to allow access to the PostgreSQL server we created in Step 1, using the “Connected Clusters” view as described here.

Finally, ensure that the IP address of your local computer is added to the firewall rules for the Kafka and Kafka Connect clusters, and remember to keep a record of the usernames/passwords for each cluster (as the Instaclustr console only holds them for a few days for security reasons).

3. Step 3: Kafka Connectors

This sink-hole in Australia (Umpherston Sinkhole, Mount Gambier) leads to something more pleasant than an abandoned uranium mine.
*(Source: Shutterstock)*

Before we can experiment with streaming data out of Kafka into PostgreSQL, we need to replicate the mechanism we used in the earlier blogs to get the NOAA tidal data into it, using a Kafka REST source connector as described in section 5 of this blog. Remember that you need to run a separate connector for every station ID that you want to collect data from. I’m just using a small subset for this experiment. I checked (using the kafka-console-consumer, you’ll need to set up the kafka properties file with the Kafka cluster credentials from the Instaclustr console for this to work), and the sensor data was arriving in the Kafka topic that I’d set up for this purpose.

But now we need to select a Kafka Connect sink connector. This part of the journey was fraught with some dead ends, so if you want to skip over the long and sometimes dangerous journey to the end of the tunnel, hop in a disused railway wagon for a short cut to the final section (3.5) which reveals “the answer”!

3.1 Open Source Kafka Connect PostgreSQL Sink Connectors

Previously I used an open source Kafka Connect Elasticsearch sink connector to move the sensor data from the Kafka topic to an Elasticsearch cluster. But this time around, I want to replace this with an open source Kafka Connect sink connector that will write the data into a PostgreSQL server. However, such connectors appear to be as rare as toilet paper on shop shelves in some parts of the world in 2020 (possibly because some monster Turing Machine needed more memory). However, I did finally track (only) one example down:

Kafka sink connector for streaming JSON messages into a PostgreSQL table
1. Last updated two years ago, and is unsupported
2. MIT License

3.2 Open Source Kafka Connect JDBC Sink Connectors

Why is there a shortage of PostgreSQL sink connectors? The reason is essentially that PostgreSQL is just an example of the class of SQL databases, and SQL databases typically have support for Java Database Connectivity (JDBC) drivers. Searching for open source JDBC sink connectors resulted in more options.

So, searching in the gloom down the mine tunnel I found the following open source JDBC sink connector candidates, with some initial high-level observations:

IBM Kafka Connect sink connector for JDBC
1. Last updated months ago
2. It has good instructions for building it
3. Not much documentation about configuration
4. It has PostgreSQL support
5. Apache 2.0 License
Aiven Connect JDBC Connector
1. A fork of the Confluent kafka-connect-jdbc connector
2. It has a source and a sink connector
3. It has PostgreSQL support
4. Good configuration and data type mapping documentation
5. Apache 2.0 License
Apache Camel JDBC Kafka Sink Connector
1. Sink only available.
2. Apache 2.0 License

How should you go about selecting a connector to trial? I had two criteria in mind. The first factor was “easy to build and generate an uber jar file” so I could upload it to the AWS S3 bucket to get it into the Instaclustr Managed Kafka Connect cluster I was using. The second factor relates to how the connectors map the data from the Kafka topic to the PostgreSQL database, tables, and columns, i.e. the data mappings. Luckily I had a close look at the PostgreSQL JSON side of this puzzle in my previous blog, where I discovered that you can store a schemaless JSON object into PostgreSQL in a single column, of type jsonb, with a gin index.

3.3 Finding a Map (Schema)

A Mine Map: https://www.minemaps.psu.edu/about.htm

Taking another look at my Tidal sensor data again, I was reminded that it is structured JSON, but without an explicit JSON schema. Here’s an example of one record:

{"metadata": {
   "id":"8724580",
   "name":"Key West",
   "lat":"24.5508”,
   "lon":"-81.8081"},
"data":[{
   "t":"2020-09-24 04:18",
   "v":"0.597",
      "s":"0.005", "f":"1,0,0,0", "q":"p"}]}

It has two fields, metadata and data. The data field is an array with one element (although it can have more elements if you request a time range rather than just the latest datum). However, it doesn’t have an explicit schema.

What are the theoretically possible options for where to put JSON schemas for Kafka sink connectors to use?

No schema
Explicit schema in each Kafka record (this wastes space and requires the producer to write a record with the correct schema as well as the actual data)
Explicit schema in the connector configuration. This is a logical possibility, but would potentially limit the connector to working with only one topic at a time. You’d need multiple configurations and therefore connectors to work with topics with different schemas.
Explicit schema stored somewhere else, for example, in a schema registry

But why/when is a schema needed? Looking at the code of some of the connectors, it appears that the schema is primarily used to auto-create a table with the correct columns and types, but this is assuming that you want to transform/extract the JSON fields to multiple columns.

And is the data mapping separate from the schema? Does it use the schema? Let’s look closer at a typical example, the IBM and Aiven connectors which are based on the Confluent approach which (1) requires an explicit schema in each Kafka record, and (2) requires Kafka record values to be structs with primitive fields.

Assuming we could “flatten” and remove some of the unwanted fields, then the JSON tidal data would look like this:

{
   "name":"Key West",
   "lat":"24.5508”,
   "lon":"-81.8081",
   "t":"2020-09-24 04:18",
   "v":"0.597"
}

Then the complete Kafka record, with schema and payload would look like this:

{
  "schema": {
    "type": "struct",
    "fields": [
        { "field": "name", "type": "string", "optional": false },
        { "field": "lat", "type": "float32", "optional": false },
        { "field": "lon", "type": "float32", "optional": false },
        { "field": "t", "type": "timestamp", "optional": false },
        { "field": "v", "type": "float32", "optional": false }
    ]
  },
  "payload": {
         "name":"Key West",
         "lat":"24.5508”,
         "lon":"-81.8081",
         "t":"2020-09-24 04:18",
         "v":"0.597"
  }
}

So we’ve essentially drawn our own schema/map; let’s see if it helps us to find a way out of the mine!

I constructed some example records with this format and put them into a test Kafka topic with the kafka-console-producer.

The connector configuration requires a PostgreSQL connection URL, user and password, the topic to read from and the table name to write to. The schema is used to auto-create a table with the required columns and types if one doesn’t exist, and then the payload values are just inserted into the named columns.

Here’s an example PostgreSQL configuration from the IBM connector documentation:

{
  "name": "jdbc-sink-connector",
  "config": {
    "connector.class": "com.ibm.eventstreams.connect.jdbcsink.JDBCSinkConnector",
    "tasks.max": "1",
    "topics": "tidal_data_test",
    "connection.url": "jdbc:postgresql://PGIP:5432/postgres",
    "connection.user": "PGuser",
    "connection.password": "PGpassword",
    "connection.ds.pool.size": 5,
    "insert.mode.databaselevel": true,
    "table.name.format": "tides_table"
  }
}

Note that with PostgreSQL you need the IP address, port number, and the database name, postgres (in this example) for the URL.

Building the IBM connector was easy, and I was able to upload the resulting uber jar to my AWS S3 bucket and sync the Instaclustr managed Kafka Connect cluster to see that it was available in the cluster (just remember to refresh the browser to get the updated list of connectors).

Here’s the CURL command used to configure and run this connector on the Instaclustr Managed Kafka Connect (KC) cluster (note that you need the credentials for both the Kafka Connect cluster REST API and the PostgreSQL database):

curl https://kcip:8083/connectors -X POST -H 'Content-Type: application/json' \
-k -u kcuser:kcpassword \
-d '{
  "name": "jdbc-sink-connector",
  "config": {
    "connector.class": "com.ibm.eventstreams.connect.jdbcsink.JDBCSinkConnector",
    "tasks.max": "1",
    "topics": "tidal_data_test",
    "connection.url": "jdbc:postgresql://PGIP:5432/postgres",
    "connection.user": "user",
    "connection.password": "password",
    "connection.ds.pool.size": 5,
    "insert.mode.databaselevel": true,
    "table.name.format": "tides_table"
  }
}
'

You can easily check that the connector is running in the Instaclustr console (Under “Active Connectors”), and you should ensure that the Kafka Connect logs are shipped to an error topic in case something goes wrong (it probably will, as it often takes several attempts to configure Kafka connector correctly).

The IBM connector was easy to build, configure, and start running, however, it had a few issues. As a result of the first error I had a look at the code, and discovered that a schema is also required for the table name. So the correct value for “table.name.format” above is “public.tides_table” (for my example). Unfortunately, the next error was impossible to resolve (“org.apache.kafka.connect.errors.DataException: Cannot list fields on non-struct type”) and we concluded it was a problem with the Kafka record deserialization.

I also tried the Aiven sink connector (which has a very similar configuration to the IBM one), but given the lack of details on how to build it, a dependency on Oracle Java 11, the use of Gradle, which I’m not familiar with, and finally this error (“java.sql.SQLException: No suitable driver found for jdbc:postgresql:///…”, possibly indicating that the PostgreSQL driver needs to be included in the build and/or an uber jar generated to include the driver), I didn’t have any success with it either.

3.4 The Schemaless Approach

Time to burn our hand drawn map (schema)
*(Source: Shutterstock)*

So that’s two out of four connectors down, two to go.

Given the requirement to have an explicit schema in the Kafka records, the IBM and Aiven connectors really weren’t suitable for my use case anyway, so no great loss so far. And even though I’d had success with the Apache Camel connectors in the previous blogs, this time around the documentation for the Camel JDBC sink connector didn’t have any configuration examples, so it wasn’t obvious how it would work and if it needed a schema or not. This left the first(PostgreSQL-specific) connector as the only option remaining, so, let’s throw away our hand-drawn map and try the schemaless idea out.

The idea behind this connector is that elements from a JSON Kafka record message are parsed out into column values, specified by a list of columns, and a list of parse paths in the connector configuration.

Here’s an example configuration for my JSON tidal data to extract the name, t and v values from the JSON record, and insert them into name, time, and value columns:

"config": {
    "connector.class": "com.justone.kafka.sink.pg.json.PostgreSQLSinkConnector",
    "tasks.max": "1",
    "topics": "tidal_data",
    "db.host": "jdbc:postgresql://PGIP:5432",
    "db.database": "postgres",
    "db.username": "pguser",
    "db.password": "pgpassword",
    "db.schema": "public",
    "db.table": "tides-test2",
    "db.columns": "name,time,value",
    "db.json.parse": "/@metadata/@name,/@data/#0/@t,/@data/#0/@v",
    "db.delivery" : "fastest"
   }

This connector was also hard to build (it was missing some jars, which I found in an earlier release). The documentation doesn’t say anything about how the table is created, so given the lack of schema and type information, I assumed the table had to be manually created before use. Unfortunately, I wasn’t even able to start this connector running in the Instaclustr Kafka Connect cluster, as it resulted in this error when trying to configure and run it:

ERROR Uncaught exception in REST call to /connectors (org.apache.kafka.connect.runtime.rest.errors.ConnectExceptionMapper:61)

3.5 Success With a Customized IBM Connector

So with our last candle burning low and the sounds of ravenous monsters in the darkness of the mine growing louder by the second, it was time to get creative.

The IBM connector had been the easiest to build, so it was time to rethink my requirements. Based on my previous experience with JSON and PostgreSQL, all I really needed the connector to do was to insert the entire JSON Kafka record value into a single jsonb column type in a specified table. The table would also need a single automatically incremented id field, and a gin index on the jsonb table. It turned out to be straightforward to modify the behavior of the IBM connector to do this.

The table and gin index (with a unique name including the table name) is created automatically if it doesn’t already exist, and the column names are currently hardcoded to be “id” (integer) and “json_object” (type jsonb). I did have to modify the SQL code for creating the table for it to work with PostgreSQL correctly, and customize the insert SQL.

Based on my previous experiences with sink connectors failing due to badly formed JSON or JSON with a different schema to expected, I was pleasantly surprised to find that this connector was very robust. This is due to the fact that PostgreSQL refuses to insert badly formed JSON into a jsonb column type and throws an exception, and the IBM connector doesn’t fail under these circumstances; it just logs an error to the Kafka error topic and moves onto the next available record.

However, my previous experiments failed to find a way to prevent the JSON error messages from being indexed into Elasticsearch, so I wondered if there was a solution for PostgreSQL that would not cause the connector to fail.

The solution was actually pretty simple, as PostgreSQL allows constraints on json columns. This constraint (currently executed manually after the table is created, but it could be done automatically in the connector code by analyzing the fields in the first record found) uses one of the PostgreSQL JSON existence operators (‘?&’) to add a constraint to ensure that ‘metadata’ and ‘data’ exist as top-level keys in the json record:

ALTER TABLE tides_table ADD CONSTRAINT field_check CHECK (json_object ?& array['metadata', 'data']);

This excludes error records like this: {“error”:”error message”}, but doesn’t exclude records which have ‘metadata’ and ‘data’ and superfluous records. However, extra fields can just be ignored when processing the jsonb column later on, so it looks like we have found light at the end of the mine tunnel and are ready to try out the next part of the experiment, which includes building. deploying, and running Apache Superset, configuring it to access PostgreSQL, and then graphing the tidal data.

Further Resources

Here’s the customized Kafka Connect sink connector that I developed, with a prebuilt compressed uber jar file (which has everything including the PostgreSQL driver) included.

Follow the Pipeline Series

Building a Real-Time Tide Data Processing Pipeline: Using Apache Kafka, Kafka Connect, Elasticsearch, and Kibana—Part 1
Building a Real-Time Tide Data Processing Pipeline: Using Apache Kafka, Kafka Connect, Elasticsearch, and Kibana—Part 2
Getting to Know Apache Camel Kafka Connectors (Pipeline Series Part 3)
Monitoring Kafka Connect Pipeline Metrics with Prometheus (Pipeline Series Part 4)
Scaling Kafka Connect Streaming Data Processing (Pipeline Series Part 5)
Streaming JSON Data Into PostgreSQL Using Open Source Kafka Sink Connectors (Pipeline Series Part 6)

The post Streaming JSON Data Into PostgreSQL Using Open Source Kafka Sink Connectors (Pipeline Series Part 6) appeared first on Instaclustr.

↧

Paul Ramsey: Waiting for PostGIS 3.2: ST_Contour and ST_SetZ

July 23, 2021, 11:33 am

≫ Next: David Z: The Amazing Buffer Tag in PostgreSQL

≪ Previous: Paul Brebner: Streaming JSON Data Into PostgreSQL Using Open Source Kafka Sink Connectors (Pipeline Series Part 6)

Waiting for PostGIS 3.2: ST_Contour and ST_SetZ

One theme of the 3.2 release is new analytical functionality in the raster module, and access to cloud-based rasters via the "out-db" option for rasters. Let's explore two new functions and exercise cloud raster support at the same time.

↧

David Z: The Amazing Buffer Tag in PostgreSQL

July 23, 2021, 2:59 pm

≫ Next: Andreas 'ads' Scherbaum: Roman Druzyagin

≪ Previous: Paul Ramsey: Waiting for PostGIS 3.2: ST_Contour and ST_SetZ

1. Overview

I was working on the PostgreSQL storage related features recently, and I found PostgreSQL has designed an amazing storage addressing mechanism, i.e. Buffer Tag. In this blog, I want to share with you my understanding about the Buffer Tag and some potential usage of it.

2. Buffer Tag

I was always curious to know how PostgreSQl can find out the tuple data blocks so quickly when the first time I started to use PostgreSQL in an IoT project, but I never got the chance to look it into details even though I knew PostgreSQL is an very well organized open-source project. Until I recently got a task which needs to solve some storage related issue in PostgreSQL. There is very detailed explanation about how Buffer Manger works in an online PostgreSQL books for developers The Internals of PostgreSQL, one of the best PostgreSQL books I would recommend to the beginner of PostgreSQL development to read.

Buffer Tag, in simple words, is just five numbers. Why it is five numbers? First, all the objects including the storage files are managed by Object Identifiers, i.e. OID. For example, when user creates a table, the table name is mapped to an OID; when user creates a database, the name is mapped to an OID; when the corresponding data need to be persistent on disk, the files is also named using OID. Secondly, when a table requires more pages to store more tuples, then each page for the same table is managed by the page number in sequence. For example, when PostgreSQL needs to estimate the table size before decide what kind of scan should be used to find out the tuple faster, it need to know the number of blocks information. Thirdly, it is easy to understand that data tuples are the major user data need to be stored, but in order to better manage these data tuples, PostgreSQL needs others information, such as visibility to manage the status of these data tuples, and free space to optimize files usage. So this ends up with five numbers, i.e. Tablespace, Database, Table, ForkNumber, BlockNumber.

Given these five numbers, PostgreSQL can always find out where the data tuples are stored, which file is used, and what size the table is etc.

3. How Buffer Tag is used

Now, we have the buffer tag, five numbers, but how it is used in PosggreSQL. One typical use case of buffer tag is to help buffer manager to manage the location of memory blocks in the buffer pool/array. In this case, a hash table was introduced to resolve the mapping between buffer tag and the location of memory block in buffer pool/array. Here is a picture to show relationship among buffer tag, hashtable, buffer descriptor, and buffer pool/array.

For example, the first green buffer tag {(1663, 13589, 16387), 0, 0} indicates a table space 1663, a database 12709, and a table 16387 with forkNumber 0 (main fork for tuples) which stored in a file at block 0. The buffer tag has a hash value 1536704684 which has been assigned to the memory block 0 managed by buffer manager at this moment. Since buffer descriptor is addressed using the same slot number as memory block in buffer pool/array, so they both share the same slot number 0, in this case.

With the above relationship, PostgreSQL can find out the memory block location or assign buffer slot to a new memory block for particular buffer tag.

The other typical use case of buffer tag is to help the storage manager to manage the corresponding files. In this case, the blockNumber is always to 0 since it doesn’t require multiple blocks to store more data. Here, you can create your use case of buffer tag. For example, use the buf_id as the number of blocks for each database relation instead of using it to indicate the memory block location. Moreover, you can also use the buffer tag plus the hashtable to manage multiple information such as having both buf_id for location of the memory block and adding a new attribute total to manage the number of blocks. You can achieve this by define a different buffer lookup entry. For example,

The original buffer lookup entry using buffer tag to lookup the location of memory block.

/* entry for buffer lookup hashtable */
typedef struct
{
    BufferTag   key;            /* Tag of a disk page */
    int         id;             /* Associated buffer ID */
} BufferLookupEnt;

Below is an example of buffer lookup entry which can lookup both the location of memory block and the number of blocks used by this particular buffer tag.

typedef struct
{
    BufferTag   key;            /* Tag of a disk page    */
    int         id;             /* Associated buffer ID  */
    int         total;          /* the total number of X */
} RelLookupEnt;

4. Summary

In this blog, we discussed what is Buffer Tag, how it is used in PostgreSQL and some potential usage of Buffer Tag to address the mapping issues between tables to storage files as well as the lookup issues.

Reference:

The Internals of PostgreSQL

David Zhang

A software developer specialized in C/C++ programming with experience in hardware, firmware, software, database, network, and system architecture. Now, working in HighGo Software Inc, as a senior PostgreSQL architect.

idrawone.github.io

The post The Amazing Buffer Tag in PostgreSQL appeared first on Highgo Software Inc..

↧

Andreas 'ads' Scherbaum: Roman Druzyagin

July 26, 2021, 7:00 am

≫ Next: Joshua Drake: August PostgresWorld Webinars

≪ Previous: David Z: The Amazing Buffer Tag in PostgreSQL

PostgreSQL Person of the Week Interview with Roman Druzyagin: Name’s Roman. I’ve been born and lived most of my life in Russia. I grew up near Moscow, and since 2002 I’ve been residing in St. Petersburg, with plans to relocate to the European Union in the near future. I am currently 33 years old.

↧