Most people in the SQL and in the PostgreSQL community have used the LIMIT clause provided by many database engines. However, what many do not know is that LIMIT / OFFSET are off standard and are thus not portable. The proper way to handle LIMIT is basically to use SELECT … FETCH FIRST ROWS. However, there is more than meets the eye.
LIMIT vs. FETCH FIRST ROWS
Before we dig into some of the more advanced features we need to see how LIMIT and FETCH FIRST ROWS can be used. To demonstrate this feature, I have compiled a simple data set:
Our data set has 7 simple rows. Let’s see what happens if we use LIMIT:
test=# SELECT * FROM t_test LIMIT 3;
id
----
1
2
3
(3 rows)
In this case, the first three rows are returned. Note that we are talking about ANY rows here. Whatever can be found first is returned. There is no special order.
The ANSI SQL compatible way of doing things is as follows:
test=# SELECT *
FROM t_test
FETCH FIRST 3 ROWS ONLY;
id
----
1
2
3
(3 rows)
Many of you may never have used or seen this kind of syntax before, but this is actually the “correct” way to handle LIMIT.
However, there is more: What happens if NULL is used inside your LIMIT clause? The result might surprise you::
test=# SELECT * FROM t_test LIMIT NULL;
id
----
1
2
3
3
4
4
5
(7 rows)
The database engine does not know when to stop returning rows. Remember, NULL is undefined, so it does not mean zero. Therefore, all rows are returned. You have to keep that in mind in order to avoid unpleasant surprises…
FETCH FIRST … ROWS WITH TIES
WITH TIES has been introduced in PostgreSQL 13 and fixes a common problem: handling duplicates. If you fetch the first couple of rows, PostgreSQL stops at a fixed number of rows. However, what happens if the same data comes again and again? Here is an example:
test=# SELECT *
FROM t_test
ORDER BY id
FETCH FIRST 3 ROWS WITH TIES;
id
----
1
2
3
3
(4 rows)
In this case, we’ve actually got 4 rows, not just 3. The reason is that the last value shows up again after 3 rows, so PostgreSQL decided to include it as well. What is important to mention here is that an ORDER BY clause is needed, because otherwise, the result would be quite random. WITH TIES is therefore important if you want to include all rows of a certain kind – without stopping at a fixed number of rows.
Suppose one more row is added:
test=# INSERT INTO t_test VALUES (2);
INSERT 0 1
test=# SELECT *
FROM t_test
ORDER BY id
FETCH FIRST 3 ROWS WITH TIES;
id
----
1
2
2
(3 rows)
In this case, we indeed get 3 rows, because it is not about 3 types of values, but really about additional, identical data at the end of the data set.
WITH TIES: Managing additional columns
So far we have learned something about the simplest case using just one column. However, that’s far from practical. In a real work application, you will certainly have more than a single column. So let us add one:
test=# ALTER TABLE t_test
ADD COLUMN x numeric DEFAULT random();
ALTER TABLE
test=# TABLE t_test;
id | x
----+--------------------
1 | 0.258814135879447
2 | 0.561647200043165
3 | 0.340481941960185
3 | 0.999635345010109
4 | 0.467043266494571
4 | 0.742426363498449
5 | 0.0611112678267247
2 | 0.496917052156565
(8 rows)
In the case of LIMIT nothing changes. However, WITH TIES is a bit special here:
test=# SELECT *
FROM t_test
ORDER BY id
FETCH FIRST 4 ROWS WITH TIES;
id | x
----+-------------------
1 | 0.258814135879447
2 | 0.561647200043165
2 | 0.496917052156565
3 | 0.999635345010109
3 | 0.340481941960185
(5 rows)
What you can see here is that 5 rows are returned. The fifth row is added because id = 3 appears more than once. Mind the ORDER BY clause: We are ordering by id. For that reason, the id column is relevant to WITH TIES.
Let’s take a look at what happens when the ORDER BY clause is extended:
test=# SELECT *
FROM t_test
ORDER BY id, x
FETCH FIRST 4 ROWS WITH TIES;
id | x
----+-------------------
1 | 0.258814135879447
2 | 0.496917052156565
2 | 0.561647200043165
3 | 0.340481941960185
(4 rows)
We are ordering by two columns. Therefore WITH TIES is only going to add rows if both columns are identical, which is not the case in my example.
LIMIT… Or finally…
WITH TIES is a wonderful new feature provided by PostgreSQL. However, it is not only there to limit data. If you are a fan of windowing functions you can also make use of WITH TIES as shown in one of my other blog posts covering advanced SQL features provided by PostgreSQL.
Too often, web tiers are full of boilerplate that does nothing except convert a result set into JSON. A middle tier could be as simple as a function call that returns JSON. All we need is an easy way to convert result sets into JSON in the database.
PostgreSQL hasbuilt-in JSON generatorsthat can be used to create structured JSON output right in the database, upping performance and radically simplifying web tiers.
Fortunately, PostgreSQLhas such functions, that run right next to the data, for better performance and lower bandwidth usage.
So, some time ago, Pg devs added multi ranges – that is datatype that can be used to store multiple ranges in single column. The thing is that it wasn't really simple how to get list of ranges from within such multirange. There was no operator, no way to split it. A month ago Alexander … Continue reading "How to get list of elements from multiranges?"
We've already discussed some object-level locks (specifically, relation-level locks), as well as row-level locks with their connection to object-level locks and also explored wait queues, which are not always fair.
We have a hodgepodge this time. We'll start with deadlocks (actually, I planned to discuss them last time, but that article was excessively long in itself), then briefly review object-level locks left and finally discuss predicate locks.
Deadlocks
When using locks, we can confront a deadlock. It occurs when one transaction tries to acquire a resource that is already in use by another transaction, while the second transaction tries to acquire a resource that is in use by the first. The figure on the left below illustrates this: solid-line arrows indicate acquired resources, while dashed-line arrows show attempts to acquire a resource that is already in use.
To visualize a deadlock, it is convenient to build the wait-for graph. To do this, we remove specific resources, leave only transactions and indicate which transaction waits for which other. If a graph contains a cycle (from a vertex, we can get to itself in a walk along arrows), this is a deadlock.
The usability of the libpq feature to trace application's server/client communications has been enhanced in PostgreSQL 14, with an improved format and an option to control output.
Today I released pspg 5.1.0. Mostly this is bugfix and refactoring release, but there is one, I hope, interesting function. You can try to press Ctrl o for temporal switch to terminal's primary screen. In primary screen you can see psql session. After pressing any key, the terminal switch to alternative screen with pspg.
Thanks to Tomas Munro work, the psql\watch command will supports pagers (in PostgreSQL 15). In this time only pspg can do this work (in streaming mode). When you set environment variable PSQL_WATCH_PAGER, the \watch command redirects otputs to specified pager (for pspgexport PSQL_WATCH_PAGER="pspg --stream". Next you can run command:
select * from pg_stat_database \watch 5
or
select * from pg_stat_activity where state='active' \watch 1
In December 2020, you might have seen an article from CentOS about shifting their focus towards CentOS stream, which is the upstream version of the RHEL. CentOS also mentioned that the version 8 would be EOL (end of life) by the end of the 2021. This means that it will no longer receive any updated fixes from it's upstream version of RHEL. A few days after this announcement, Rocky Linux was announced by the CentOS founder, Gregory Kurtzeras, as a 100% bug-for-bug compatible with RHEL. The Rocky Linux project quickly gained so much of attention, and also got sponsors from the cloud vendors like AWS, Google Cloud and Microsoft. We wanted to take this opportunity and write an article about CentOS vs Rocky Linux Benchmark with PostgreSQL.
CentOS
As of now, CentOS is widely used in productions, because it was a downstream version of RHEL. This means that the CentOS was receiving all the RHEL criticial bug fixes for free, which makes CentOS as robust and reliable as RHEL. The future of CentOS project is it's Stream version, which is the upstream of RHEL. This means that CentOS Stream may not be receiving any such critical bug fixes from RHEL, instead CentOS Stream bug fixes will be pushed down to the RHEL project.
CentOS vs Rocky Linux Benchmark
After seeing the Rocky Linux Project announcement, we wanted to run some benchmark regarding PostgreSQL and see if we get the same performance as CentOS 8 with Rocky Linux 8. To run this benchmark, we have chosen the phoronix tool, which offers a big list of benchmarking test suites. By using this phoronix tool, we will be running a few set of general benchmarks besides to PostgreSQL's pgbench.
Phoronix test suites
Phoronix is a system benchmarking tool which offers a big list of test suites. This tool provides test suites for the CPU, MEMORY, DISK, COMPILE, etc. for most of the operating systems. We will be using this tool to perform the benchmarking on both Rocky Linux and CentOS. For the purpose of this article, we have considered run the following test suites.
Compiler
CPU
Memory
Disk
Stress
PostgreSQL (pgbench)
Stress & PostgreSQL
By doing the above benchmarkings on these 2 instances, we will be able to understand whether Rocky Linux is going to be a true CentOS replacement.
Install phoronix test suite
After spinning two dedicated instances hosted in Linode, we installed the phoronix tool on these new instances. No other additional software or configuration changes have been done, since we do not want to have any drift in the benchmarking results.
To install phoronix, we executed these 3 steps on Rocky Linux and the CentOS instances.
# tar -zxf phoronix-test-suite-10.4.0.tar.gz
# cd phoronix-test-suite
# ./install-sh
which: no xdg-mime in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
Phoronix Test Suite Installation Completed
Executable File: /usr/bin/phoronix-test-suite
Documentation: /usr/share/doc/phoronix-test-suite/
Phoronix Test Suite Files: /usr/share/phoronix-test-suite/
CentOS vs Rocky Hardware configuration
Before performing the benchmarking, let us see the system capacities of the both instances. As you see in the below specifications table, both systems has the same configuration and the only difference is the operating system.
CentOS
Spec
Rocky
Spec
CPU
PROCESSOR:
4 x AMD EPYC 7501 32-Core
PROCESSOR:
4 x AMD EPYC 7501 32-Core
Core Count:
4
Core Count:
4
Extensions:
SSE 4.2 + AVX2 + AVX + RDRAND + FSGSBASE
Extensions:
SSE 4.2 + AVX2 + AVX + RDRAND + FSGSBASE
Cache Size:
16 MB
Cache Size:
16 MB
Microcode:
0x1000065
Microcode:
0x1000065
Core Family:
Zen
Core Family:
Zen
Memory
MEMORY:
1 x 8 GB RAM QEMU
MEMORY:
1 x 8 GB RAM QEMU
Chipset
MOTHERBOARD:
QEMU Standard PC
MOTHERBOARD:
QEMU Standard PC
BIOS Version:
rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org
BIOS Version:
rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org
Chipset:
Intel 82G33/G31/P35/P31 + ICH9
Chipset:
Intel 82G33/G31/P35/P31 + ICH9
Network:
Red Hat Virtio device
Network:
Red Hat Virtio device
Disk
DISK:
171GB QEMU HDD + QEMU HDD
DISK:
171GB QEMU HDD + QEMU HDD
File-System:
ext4
File-System:
ext4
Mount Options:
relatime rw seclabel
Mount Options:
relatime rw seclabel
Disk Scheduler:
MQ-DEADLINE
Disk Scheduler:
MQ-DEADLINE
Disk Details:
Block Size: 4096
Disk Details:
Block Size: 4096
OS
OPERATING SYSTEM:
CentOS Linux 8
OPERATING SYSTEM:
Rocky Linux 8.4
Kernel:
4.18.0-305.3.1.el8.x86_64 (x86_64)
Kernel:
4.18.0-305.3.1.el8_4.x86_64 (x86_64)
Compiler:
GCC 8.4.1 20200928
Compiler:
GCC 8.4.1 20200928
System Layer:
KVM
System Layer:
KVM
Security:
SELinux
Security:
SELinux
Compiler Benchmarking
Phoronix provides compiler benchmarking test suite, and we will be using this suite to compare the benchmark results between CentOS and Rocky Linux. To perform the compiler benchmarking, let us build the kernel (linux-5.10.20) on both operating systems, and compare it's completion time.
OS: CentOS Linux 8, Kernel: 4.18.0-305.3.1.el8.x86_64 (x86_64), Compiler: GCC 8.4.1 20200928, File-System: ext4, Screen Resolution: 1024x768, System Layer: KVM
Rocky
546.10
OS: Rocky Linux 8.4, Kernel: 4.18.0-305.3.1.el8_4.x86_64 (x86_64), Compiler: GCC 8.4.1 20200928, File-System: ext4, Screen Resolution: 1024x768, System Layer: KVM
By default, phoronix tool will run the test 3 times, and will provide an average duration of the test times. The above values which are listed are the average duration among the 3 runs. As per the compiler test results, Rocky Linux completed the kernel build a little bit faster than CentOS with not a huge gap between the results.
CPU Benchmarking
As both of the instances are having the same CPU capacity, let us run the sysbench with CPU profile, and see the number of events a cpu is processing per second.
# phoronix-test-suite benchmark sysbench
Here are the results
Events Per Second
Details
CentOS
2687.02
Processor: 4 x AMD EPYC 7501 32-Core (4 Cores)
Rocky
2701.93
Processor: 4 x AMD EPYC 7501 32-Core (4 Cores)
Rocky linux processed a little more events (14.91 more) per second than the CentOS, and this value is the average from the 3 tests. Again, there is not a huge gap between Rocky Linux and CentOS.
Memory Benchmarking
Both hardware instances are having the same memory capacity, and this time let us use the sysbench memory profile to generate the load on the system. Here, the sysbench will initiate a block of memory in RAM, and will perform the seek tests.
# phoronix-test-suite benchmark sysbench
Following are the results -
Mb/sec
Details
CentOS
7907.09
Memory: 1 x 8 GB RAM QEMU
Rocky
7798.74
Memory: 1 x 8 GB RAM QEMU
From the results, CentOS is doing a bit higher operations than the Rocky Linux. And again, this is the average result from the 3 test runs.
Disk Benchmarking
When it comes to disk benchmarking, we have multiple factors to consider such as direct, buffered, block size, sequential, random reads, etc. The phoronix tool provides an advanced benchmarking suite like flexible io (fio). We have considered running it on these two machines. As there are many possibilities of benchmarking, we are only limiting it to the below set of test cases.
Seems that Rocky Linux with the same filesystem ext4 is giving some better throughput than CentOS, though the configuration is same.
Stress Benchmarking
This time, let us use the popular stress-ng test, to check the system stability. phoronix do provide the stress-ng benchmarking suite, which will initiate few benchmarking tests.
# phoronix-test-suite benchmark stress-ng
Following are the results -
op/sec
MMAP
NUMA
MEMFD
Atomic
Crypto
Malloc
Forking
SENDFILE
CPU Cache
CPU Stress
CentOS
9.03
34.69
59.03
320701.15
329.47
5394403.52
13126.39
22847.99
2.64
461.63
Rocky
10.22
33.70
61.09
317368.35
329.09
5886840.34
13116.39
22754.75
3.43
457.55
op/sec
Semaphores
Matrix Math
Vector Math
Memory Copying
Socket Activity
Context Switching
Glibc C String Functions
Glibc Qsort Data Sorting
System V Message Passing
CentOS
267137.47
8479.07
11581.05
1623.55
1460.90
787841.94
152539.11
21.85
1686145.73
Rocky
267230.52
8510.55
11578.17
1722.36
1519.78
797083.22
153499.17
19.23
1739696.00
As per the test results, Rocky Linux is almost matching with most of the CentOS results, and seems to be giving better performance in few of the areas like Memory Allocations, Socket Activity, Context Switches and in processing the Semaphores and System V Message Passing. And CentOS is in a bit leading position while doing the Forking, Atomic operations.
PGBench Benchmarking
Finally, we ran the much awaited PostgreSQL benchmarking on these two machines. Let us use the phoronix pgbench test suite, and run the both read , read write tests. Use the following command to run the pgbench and see the results. This suite will install the PostgreSQL-13, and will run the default pgbench test cases.
# phoronix-test-suite benchmark pgbench
Following are the results for Rocky Linux and CentOS with PostgreSQL -
Read Only
Avg Latency
Read Write
Avg Latency
Details
CentOS
17570 TPS
5.693 ms
3402 TPS
29.40 ms
Scaling Factor: 1000 - Clients: 100
Rocky
17751 TPS
5.635 ms
3464 TPS
28.88 ms
Scaling Factor: 1000 - Clients: 100
This benchmarking is done with the default installation of PostgreSQL, with default settings on the both machines. No special changes have done to any configuration parameter. Seems, Rocky Linux is matching (slightly better TPS than CentOS) the CentOS benchmark results.
Stress & PostgreSQL
This is an additional benchmarking case we performed to know how PostgreSQL will behave in Rocky Linux and CentOS, while the system is already having some load on it. For this test, we will be running 2 phoronix suites in parallel. One suite will be running the stress-ng and other will be running pgbench. While we are running these suites, we followed the same test parameters which we did in the earlier tests. These tests are performed on these two hosts almost in the same time.
Session1
# phoronix-test-suite benchmark stress-ng
Session2
# phoronix-test-suite benchmark pgbench
Following are the results -
Read Only
Avg Latency
Read Write
Avg Latency
Details
CentOS
12287 TPS
8.291 ms
2234 TPS
45.67 ms
Scaling Factor: 1000 - Clients: 100
Rocky
15051 TPS
6.781 ms
2606 TPS
39.27 ms
Scaling Factor: 1000 - Clients: 100
From the above results, it's a little bit surprise to see more TPS from Rocky linux than CentOS, with PostgreSQL and some load on the system.
Conclusion
After going through all these benchmarking results, Rocky Linux is giving a little bit better throughput when compared to CentOS but it's not a big difference in number. This proves that Rocky Linux is going to be a true replacement for CentOS, and the benchmarking numbers are already saying it. PostgreSQL is also pretty stable in Rocky Linux and producing the same (a bit higher) throughput as CentOS.
Looking forward for your valuable thoughts and inputs.
By the way, if you are interested in migrating your proprietary databases like Oracle or SQL Server to PostgreSQL, please feel free to contact us. We also provide Remote DBA services and Performance Assessments for PostgreSQL databases. You may fill the following form and our team will contact you soon.
Logical Replication was introduced in PostgreSQL-10 and since then it is being improved with each version. Logical Replication is a method to replicate the data selectively unlike physical replication where the data of the entire cluster is copied. This can be used to build a multi-master or bi-directional replication solution. One of the main differences as compared with physical replication was that it allows replicating the transaction only at commit time. This leads to apply lag for large transactions where we need to wait to transfer the data till the transaction is finished. In the upcoming PostgreSQL-14 release, we are introducing a mechanism to stream the large in-progress transactions. We have seen the replication performance improved by 2 or more times due to this for large transactions especially due to early filtering. See the performance test results reported on hackers and in another blog on same topic. This will reduce the apply lag to a good degree.
The first thing we need for this feature was to decide when to start streaming the WAL content. One could think if we have such a technology why not stream each change of transaction separately as and when we retrieve it from WAL but that would actually lead to sending much more data across the network because we need to send some additional transaction information with each change so that the apply-side can recognize the transaction to which the change belongs. To address this, in PostgreSQL-13, we have introduced a new GUC parameter logical_decoding_work_mem which allows users to specify the maximum amount of memory to be used by logical decoding, before which some of the decoded changes are either written to local disk or stream to the subscriber. The parameter is also used to control the memory used by logical decoding as explained in the blog.
The next thing that prevents incremental decoding was the delay in finding the association of subtransaction and top-level XID. During logical decoding, we accumulate all changes along with its (sub)transaction. Now, while sending the changes to the output plugin or stream to the other node, we need to combine all the changes that happened in the transaction which requires us to find the association of each top-level transaction with its subtransactions. Before PostgreSQL-14, we build this association at XLOG_XACT_ASSIGNMENT WAL record which we normally log after 64 subtransactions or at commit time because these are the only two times when we get such an association in the WAL. To find this association as it happened, we now also write the assignment info into WAL immediately, as part of the first WAL record for each subtransaction. This is done only when wal_level=logical to minimize the overhead.
Yet, another thing that is required for incremental decoding was to process invalidations at each command end. The basic idea of invalidations is that they make the caches (like relation cache) up-to-date to allow the next command to use up-to-date schema. This was required to correctly decode WAL incrementally as while decoding we will use the relation attributes from the caches. For this, when wal_level=logical, we write invalidations at the command end into WAL so that decoding can use this information. The invalidations are decoded and accumulated in top-transaction, and then executed during replay. This obviates the need to decode the invalidations as part of a commit record.
In previous paragraphs, the enhancements required in the server infrastructure to allow incremental decoding are explained. The next step was to provide APIs (stream methods) for out-of-core logical replication to stream large in-progress transactions. We added seven methods to the output plugin API to allow this. Those are: (stream_start_cb, stream_stop_cb, stream_abort_cb, stream_commit_cb and stream_change_cb) and two optional callbacks (stream_message_cb and stream_truncate_cb). For details about these APIs, refer to PostgreSQL docs.
When streaming an in-progress transaction, the changes (and messages) are streamed in blocks demarcated by stream_start_cb and stream_stop_cb callbacks. Once all the decoded changes are transmitted, the transaction can be committed using the stream_commit_cb callback (or possibly aborted using the stream_abort_cb callback). One example sequence of streaming transaction may look like the following:
/* Change logical_decoding_work_mem to 64kB in the session */ postgres=# show logical_decoding_work_mem; logical_decoding_work_mem --------------------------- 64kB (1 row) postgres=# CREATE TABLE stream_test(data text); CREATE TABLE
postgres=# SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding'); ?column? ---------- init (1 row) postgres=# INSERT INTO stream_test SELECT repeat('a', 6000) || g.i FROM generate_series(1, 500) g(i); INSERT 0 500 postgres=# SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '1', 'skip-empty-xacts', '1', 'stream-changes', '1'); data -------------------------------------------------- opening a streamed block for transaction TXN 741 streaming change for TXN 741 streaming change for TXN 741 streaming change for TXN 741 ... ... streaming change for TXN 741 streaming change for TXN 741 streaming change for TXN 741 closing a streamed block for transaction TXN 741 opening a streamed block for transaction TXN 741 streaming change for TXN 741 streaming change for TXN 741 streaming change for TXN 741 ... ... streaming change for TXN 741 streaming change for TXN 741 closing a streamed block for transaction TXN 741 committing streamed transaction TXN 741 (505 rows)
The actual sequence of callback calls may be more complicated depending on the server operations. There may be blocks for multiple streamed transactions, some of the transactions may get aborted, etc.
Note that streaming is triggered when the total amount of changes decoded from the WAL (for all in-progress transactions) exceeds the limit defined by the logical_decoding_work_mem setting. At that point, the largest top-level transaction (measured by the amount of memory currently used for decoded changes) is selected and streamed. However, in some cases we still have to spill to disk even if streaming is enabled because we exceed the memory threshold but still have not decoded the complete tuple e.g., only decoded toast table insert but not the main table insert or decoded speculative insert but not the corresponding confirm record. However, as soon as we get the complete tuple we stream the transaction including the serialized changes.
While streaming in-progress transactions, the concurrent aborts may cause failures when the output plugin (or decoding of WAL records) consults catalogs (both system and user-defined). Let me explain this with an example, suppose there is one catalog tuple with (xmin: 500, xmax: 0). Now, the transaction 501 updates the catalog tuple and after that we will have two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0). Now, if 501 is aborted and some other transaction say 502 updates the same catalog tuple then the first tuple will be changed to (xmin: 500, xmax: 502). So, the problem is that when we try to decode the tuple inserted/updated in 501 after the catalog update, we will see the catalog tuple with (xmin: 500, xmax: 502) as visible because it will consider that the tuple is deleted by xid 502 which is not visible to our snapshot. And when we will try to decode with that catalog tuple, it can lead to a wrong result or a crash. So, it is necessary to detect concurrent aborts to allow streaming of in-progress transactions. For detecting the concurrent abort, during catalog scan we can check the status of the xid and if it is aborted we will report a specific error so that we can stop streaming current transaction and discard the already streamed changes on such an error. We might have already streamed some of the changes for the aborted (sub)transaction, but that is fine because when we decode the abort we will stream the abort message to truncate the changes in the subscriber.
To add support for streaming of in-progress transactions into the built-in logical replication, we need to primarily do four things:
(a) Extend the logical replication protocol to identify in-progress transactions, and allow adding additional bits of information (e.g. XID of subtransactions). Refer to PostgreSQL docs for the protocol details.
(b) Modify the output plugin (pgoutput) to implement the new stream API callbacks, by leveraging the extended replication protocol.
(c) Modify the replication apply worker, to properly handle streamed in-progress transaction by spilling the data to disk and then replaying them on commit.
(d) Provide a new option for streaming while creating a subscription.
The below example demonstrates how to set up the streaming via built-in logical replication:
Publisher node:
Set logical_decoding_work_mem = '64kB'; # Set up publication with some initial data
CREATE TABLE test_tab (a int primary key, b varchar); INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar'); CREATE PUBLICATION tap_pub FOR TABLE test_tab;
Subscriber node:
CREATE TABLE test_tab (a int primary key, b varchar); CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost port=5432 dbname=postgres' PUBLICATION tap_pub WITH (streaming = on);
Publisher Node:
# Ensure the corresponding replication slot is created on publisher node select slot_name, plugin, slot_type from pg_replication_slots; slot_name | plugin | slot_type -----------+----------+----------- tap_sub | pgoutput | logical (1 row)
# Confirm there is no streamed bytes yet postgres=# SELECT slot_name, stream_txns, stream_count, stream_bytes FROM pg_stat_replication_slots; slot_name | stream_txns | stream_count | stream_bytes -----------+-------------+--------------+-------------- tap_sub | 0 | 0 | 0 (1 row)
# Insert, update and delete enough rows to exceed the logical_decoding_work_mem (64kB) limit. BEGIN; INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i); UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; DELETE FROM test_tab WHERE mod(a,3) = 0;
Subscriber Node: # The streamed data is still not visible. select * from test_tab; a | b ---+----- 1 | foo 2 | bar (2 rows)
Publisher Node: # Commit the large transactions Commit;
Subscriber Node: # The data must be visible on the subscriber select count(*) from test_tab; count ------- 3334 (1 row)
This feature was proposed in 2017 and committed in 2020 as part of various commits 0bead9af48, c55040ccd0, 45fdc9738b, 7259736a6e, and 464824323e. It took a long time to complete this feature because of the various infrastructure pieces required to achieve this. I would really like to thank all the people involved in this feature especially Tomas Vondra who has initially proposed it and then Dilip Kumar who along with me had completed various remaining parts and made it a reality. Then also to other people like Neha Sharma, Mahendra Singh Thalor, Ajin Cherian, and Kuntal Ghosh who helped throughout the project to do reviews and various tests. Also, special thanks to Andres Freund and other community members who have suggested solutions to some of the key problems of this feature. Last but not least, thanks to EDB and Fujitsu's management who encouraged me and some of the other members to work on this feature.
Some time ago I wrote about new options for explains – one that prints settings that were modified from default. This looks like this: Aggregate (cost=35.36..35.37 rows=1 width=8) -> Index Only Scan using pg_class_oid_index on pg_class (cost=0.27..34.29 rows=429 width=0) Settings: enable_seqscan = 'off' Finally, today, I pushed a change that displays them on explain.depesz.com. To … Continue reading "Display “settings” from plans on explain.depesz.com"
A quickstart guide to create a web map with the Python-based web framework Django using its module GeoDjango, the PostgreSQL database with its spatial extension PostGIS and Leaflet, a JavaScript library for interactive maps.
PostgreSQL Person of the Week Interview with Rafia Sabih: I am basically from India, currently living in Berlin, Germany. I started my PostgreSQL journey in my masters and continued it by joining EDB, India. Then, I increased my spectrum to Postgres on Kubernetes working at Zalando.
One of the less visible improvements coming in PostGIS 3.2 (via the GEOS 3.10 release) is a new algorithm for repairing invalid polygons and multipolygons.
Algorithms like polygon intersection, union and difference rely on guarantees that the structure of inputs follows certain rules. We call geometries that follow those rules "valid" and those that do not "invalid".
zheap has been designed as a new storage engine to handle UPDATE in PostgreSQL more efficiently. A lot has happened since my last report on this important topic, and I thought it would make sense to give readers a bit of a status update – to see how things are going, and what the current status is.
zheap: What has been done since last time
Let’s take a look at the most important things we’ve achieved since our last status report:
logical decoding
work on UNDO
patch reviews for UNDO
merging codes
countless fixes and improvements
zheap: Logical decoding
The first thing on the list is definitely important. Most people might be familiar with PostgreSQL’s capability to do logical decoding. What that means is that the transaction log (= WAL) is transformed back to SQL so that it can be applied on some other machine, leading to identical results on the second server. The capability to do logical decoding is not just a given. Code has to be written which can decode zheap records and turn them into readable output. So far this implementation looks good. We are not aware of bugs in this area at the moment.
Work on UNDO
zheap is just one part of the equation when it comes to new storage engines. As you might know, a standard heap table in PostgreSQL will hold all necessary versions of a row inside the same physical files. In zheap this is not the case. It is heavily based on a feature called “UNDO” which works similar to what Oracle and some other database engines do. The idea is to move old versions of a row out of the table and then, in case of a ROLLBACK, put them back in .
What has been achieved is that the zheap code is now compatible with the new UNDO infrastructure suggested by the community (which we hope to see in core by version 15). The general idea here is that UNDO should not only be focused on zheap, but provide a generic infrastructure other storage engines will be able to use in the future as well. That’s why preparing the zheap code for a future UNDO feature of PostgreSQL is essential to success. If you want to follow the discussion on the mailing list, here is where you can find some more detailed information about zheap and UNDO.
Fixing bugs and merging
As you can imagine, a major project such as zheap will also cause some serious work on the quality management front. Let’s look at the size of the code:
[hs@node1 zheap_postgres]$ cd src/backend/access/zheap/
[hs@node1 zheap]$ ls -l *c
-rw-rw-r--. 1 hs hs 14864 May 27 04:25 prunetpd.c
-rw-rw-r--. 1 hs hs 27935 May 27 04:25 prunezheap.c
-rw-rw-r--. 1 hs hs 11394 May 27 04:25 rewritezheap.c
-rw-rw-r--. 1 hs hs 96748 May 27 04:25 tpd.c
-rw-rw-r--. 1 hs hs 13997 May 27 04:25 tpdxlog.c
-rw-rw-r--. 1 hs hs 285703 May 27 04:25 zheapam.c
-rw-rw-r--. 1 hs hs 59175 May 27 04:25 zheapam_handler.c
-rw-rw-r--. 1 hs hs 62970 May 27 04:25 zheapam_visibility.c
-rw-rw-r--. 1 hs hs 61636 May 27 04:25 zheapamxlog.c
-rw-rw-r--. 1 hs hs 16608 May 27 04:25 zheaptoast.c
-rw-rw-r--. 1 hs hs 16218 May 27 04:25 zhio.c
-rw-rw-r--. 1 hs hs 21039 May 27 04:25 zmultilocker.c
-rw-rw-r--. 1 hs hs 16480 May 27 04:25 zpage.c
-rw-rw-r--. 1 hs hs 43128 May 27 04:25 zscan.c
-rw-rw-r--. 1 hs hs 27760 May 27 04:25 ztuple.c
-rw-rw-r--. 1 hs hs 55849 May 27 04:25 zundo.c
-rw-rw-r--. 1 hs hs 51613 May 27 04:25 zvacuumlazy.c
[hs@node1 zheap]$ cat *c | wc -l
29696
For those of you out there who are anxiously awaiting a productive version of zheap, I have to point out that this is really a major undertaking which is not trivial to do. You can already try out and test zheap. However, keep in mind that we are not quite there yet. It will take more time, and especially feedback from the community to make this engine production-ready, capable of handling any workload reliably and bug-free.
I won’t go into the details of what has been fixed, but we had a couple of issues including bugs, compiler warnings, and so on.
What has also been done was to merge the zheap code with current versions of PostgreSQL, to make sure that we’re up to date with all the current developments.
Next steps to improve zheap
As far as the next steps are concerned, there are a couple of things on the list. One of the first things will be to work on the discard worker. Now what is that? Consider the following listing:
test=# BEGIN
BEGIN
test=*# CREATE TABLE sample (x int) USING zheap;
CREATE TABLE
test=*# INSERT INTO sample SELECT * FROM generate_series(1, 1000000) AS x;
INSERT 0 1000000
test=*# SELECT * FROM pg_stat_undo_chunks;
logno | start | prev | size | discarded | type | type_header
--------+------------------+------+----------+-----------+------+-----------------------------------
000001 | 000001000021AC3D | | 57 | f | xact | (xid=745, dboid=16384, applied=f)
000001 | 000001000021AC76 | | 44134732 | f | xact | (xid=748, dboid=16384, applied=f)
(2 rows)
test=*# COMMIT;
COMMIT
test=# SELECT * FROM pg_stat_undo_chunks;
logno | start | prev | size | discarded | type | type_header
--------+------------------+------+----------+-----------+------+-----------------------------------
000001 | 000001000021AC3D | | 57 | f | xact | (xid=745, dboid=16384, applied=f)
000001 | 000001000021AC76 | | 44134732 | f | xact | (xid=748, dboid=16384, applied=f)
(2 rows)
What we see here is that the UNDO chunks do not go away. They keep piling up. At the moment, it is possible to purge them manually:
As you can see, the UNDO has gone away. The goal here is that the cleanup should happen automatically – using a “discard worker”. Implementing this process is one of the next things on the list.
Community feedback is currently one of the bottlenecks. We invite everybody with an interest in zheap to join forces and help to push this forward. Everything from load testing to feedback on the design is welcome – and highly appreciated! zheap is important for UPDATE-heavy workloads, and it’s important to move this one forward.
Trying it all out
If you want to get involved, or just try out zheap, we have created a tarball for you which can be downloaded from our website. It contains our latest zheap code (as of May 27th, 2021).
Simply compile PostgreSQL normally:
./configure --prefix=/your_path/pg --enable-debug --with-cassert
make install
cd contrib
make install
Then you can create a database instance, start the server normally and start playing. Make sure that you add “USING zheap” when creating a new table, because otherwise PostgreSQL will create standard “heap” tables (so not zheap ones).
Finally …
We want to say thank you to Heroic Labs for providing us with all the support we have to make zheap work. They are an excellent partner and we recommend checking out their services. Their commitment has allowed us to allocate so many resources to this project, which ultimately benefits the entire community. A big thanks goes out to those guys.
If you want to know more about zheap, we suggest checking out some of our other posts on this topic. Here is more about zheap and storage consumption.
My colleague Kat Batuigas recentlywrote aboutusing the powerful open-sourceQGISdesktop GIS to import data intoPostGISfrom an ArcGIS Feature Service. This is a great first step toward moving your geospatial stack onto the performant, open source platform provided by PostGIS. And there's no need to stop there! Crunchy Data has developed a suite of spatial web services that work natively with PostGIS to expose your data to the web, using industry-standard protocols. These include:
pg_tileserv- allows mapping spatial data using theMVTvector tile format
In order to get a better idea about how WAL settings can change the situation within the WAL management, I decided to run a kind of automated test and store the results into a table, so that I can query them back later.
The idea is the same of my previous article: produce some workload, meausere the differences in the Log Sequence Numbers, and see how the size of WALs change depending on some setting. This is not an accurate research, it’s just a quick and dirty experiment.
At the end, I decided to share my numbers so that you can have a look at them and elaborate a bit more. For example, I’m no good at all at doing graphs (I know only the very minimum about gnuplot!).
!!! WARNING !!!
WARNING: this is not a guide on how to tune WAL settings!This is not even a real and comprhensive set of experiments, it is just what I’ve played with to see how much traffic can be generated for certain amount of workloads.
Your case and situation could be, and probably is, different from the very simple test I’ve done, and I do not pretend to be right about the small and obvious conclusions I come up at the end. In the case you see or know something that can help making more clear what I write in the following, please comment or contact me!
Set up
First of all I decided to run an INSERT only workload, so that the size of the resulting table does not include any bloating and is therefore comparable to the effort about the WAL records.
No other database activity was ongoing, so that the only generated WAL traffic was about my own workload.
Each time the configuration was changed, the system was restarted, so that every workload started with the same (empty) clean situation and without any need to reason about ongoing checkpoints. Of course, checkpoints were happening as usual, but not at the beginning of the workload.
I used two tables to run the test:
wal_traffic stores the results of each run;
wal_traffic_data is used to store the data about every workload, that is tuples inserted in the database.
The wal_traffic_data was dropped and re-created every time a new run was started, so to avoid data bloating
It is interesting to note that any workload setup activity is performed before the server is restarted, so that the only WAL traffic measured is as close as possible to the workload only.
The wal_traffic table is defined as follows:
The workload field stores the text string about the executed query.
The lsn_xxx fields store the location within the WAL, in particular:
lsn_start and lsn_end store the result of pg_current_wal_lsn() function invoked at the begin and at the end of the workload;
lsn_insert_start and lsn_insert_end store the result of pg_current_wal_insert_lsn() function invoked at the beginning and ending of the workload.
I decided to store both the information to be able to examine differences in a more accurate way, however for this kind of experiment the differences between the values are pretty much useless.
The data_size column contains the result of pg_relation_size(), that is a rough estimation of the volumen of data produced during the workload.
The columns wal_size, wal_data_ratio, and wal_insert_data_ratioare generated, and contain repsectively the amount of generated WAL records and the ratio between the size of the actual data and that of the WAL records.
Last, the settings column contains a jsonb representation of the settings used to run the test, like for example the value for wal_level, wal_compression and so on.
There is also a view to quickly get results about the workload size:
I’ve prepared two different workload, both based on INSERTs.
The first workload does two transactions: the first one inserts a certain amount of tuples, while the second inserts a smaller amount of tuples. In particular, the first transaction inserts a number of tuples specified by $workload_scale, while the second transaction inserts 1/5 of the same value.
The $workload_scale variable assumes the values ranging from 100 to 10 million growing by a factor of ten (e.g., 100, 1000, 10000 and so on).
The second workload type is shorter, and does the following:
Therefore performs the same number of tuple insertions as in the previous transaction, but it does by looping.
The final effect is that the first workload executes a single INSERT statetement, while the second workload executes several INSERT statements.
The usage of random() within the INSERT statements is to generate some more traffic on logical decoding.
The Workload Workflow
In order to do the tests, I wrote an ugly shell script with the following workflow:
truncate the wal_traffic_data table, so that its size on disk does not include previous experiments;
execute a few ALTER SYSTEM to set some configuration on WAL related parameters (wal_level, full_page_writes, wal_compression and so on);
restart the PostgreSQL system, so to ensure every test has a clean and clear situation;
get the current WAL position (pg_current_wal_lsn() and pg_current_wal_insert_lsn());
execute the workload with the right scale;
get the current WAL position (pg_current_wal_lsn() and pg_current_wal_insert_lsn());
insert the result tuple with WAL differences into wal_traffic;
loop with a different scaling factor.
The Results
It is now time to have a look at the test results.
As you can see, for 1,2 GB of data the system has produced roughly 2,1 GB of WAL records. And the situation is even worst when there is no wal_compression (as you could expect):
this time, for the same amount of data, the WAL size is almost double that of the real data.
Changing the setting of wal_level to logical or replicat does not change very much the situation,
It is possible to get the best ratio between the WAL produced and the data stored:
Please note that I’ve split the jsonb field into a set of columns with a query like the following one, that produced the CSV file:
%psql-A--csv -h miguel -c'select run, workload, wal_size, data_size, wal_data_ratio, ts_end - ts_start as wall_clock, x.* from wal_traffic
cross join lateral jsonb_to_record( settings ) as x( wal_level text, wal_log_hints text, wal_compression text, full_page_writes text );'testdb>!wal_traffic.csv
More Results
From the de-jsonb representation of the results, it is easier to get a glance at the WAL ratio by workload type
testdb=>selectworkload,min(wal_data_ratio),max(wal_data_ratio),max(wal_data_ratio)-min(wal_data_ratio)asdifffromwal_traffic_resultsgroupbyworkloadorderby4asc;-[RECORD1]-------------------------------------------------------------------------------workload|'BEGIN; +
| INSERT INTO wal_traffic_workload SELECT v, md5( v::text )::text || random()::text+
| FROM generate_series( 1, 1000000 ) v; +
| COMMIT; +
| +
| BEGIN; +
| INSERT INTO wal_traffic_workload +
| SELECT v + v, t || '' - '' || t || random()::text +
| FROM wal_traffic_workload +
| WHERE v % 5 = 0; +
| COMMIT;'min|125.55max|125.60diff|0.05-[RECORD2]-------------------------------------------------------------------------------workload|'DO $wl$ DECLARE +
| i int; +
| BEGIN +
| FOR i IN 1 .. 1000000 LOOP +
| INSERT INTO wal_traffic_workload SELECT 1, md5( random()::text )::text; +
| END LOOP; +
| END $wl$;'min|125.76max|125.82diff|0.06-[RECORD3]-------------------------------------------------------------------------------workload|'DO $wl$ DECLARE +
| i int; +
| BEGIN +
| FOR i IN 1 .. 10000000 LOOP +
| INSERT INTO wal_traffic_workload SELECT 1, md5( random()::text )::text; +
| END LOOP; +
| END $wl$;'min|125.76max|125.92diff|0.16-[RECORD4]-------------------------------------------------------------------------------workload|'BEGIN; +
| INSERT INTO wal_traffic_workload SELECT v, md5( v::text )::text || random()::text+
| FROM generate_series( 1, 100000 ) v; +
| COMMIT; +
| +
| BEGIN; +
| INSERT INTO wal_traffic_workload +
| SELECT v + v, t || '' - '' || t || random()::text +
| FROM wal_traffic_workload +
| WHERE v % 5 = 0; +
| COMMIT;'min|125.49max|125.73diff|0.24-[RECORD5]-------------------------------------------------------------------------------workload|'DO $wl$ DECLARE +
| i int; +
| BEGIN +
| FOR i IN 1 .. 100000 LOOP +
| INSERT INTO wal_traffic_workload SELECT 1, md5( random()::text )::text; +
| END LOOP; +
| END $wl$;'min|125.72max|125.97diff|0.25-[RECORD6]-------------------------------------------------------------------------------workload|'BEGIN; +
| INSERT INTO wal_traffic_workload SELECT v, md5( v::text )::text || random()::text+
| FROM generate_series( 1, 10000 ) v; +
| COMMIT; +
| +
| BEGIN; +
| INSERT INTO wal_traffic_workload +
| SELECT v + v, t || '' - '' || t || random()::text +
| FROM wal_traffic_workload +
| WHERE v % 5 = 0; +
| COMMIT;'min|124.99max|126.55diff|1.56-[RECORD7]-------------------------------------------------------------------------------workload|'DO $wl$ DECLARE +
| i int; +
| BEGIN +
| FOR i IN 1 .. 10000 LOOP +
| INSERT INTO wal_traffic_workload SELECT 1, md5( random()::text )::text; +
| END LOOP; +
| END $wl$;'min|125.14max|127.47diff|2.33-[RECORD8]-------------------------------------------------------------------------------workload|'BEGIN; +
| INSERT INTO wal_traffic_workload SELECT v, md5( v::text )::text || random()::text+
| FROM generate_series( 1, 10000000 ) v; +
| COMMIT; +
| +
| BEGIN; +
| INSERT INTO wal_traffic_workload +
| SELECT v + v, t || '' - '' || t || random()::text +
| FROM wal_traffic_workload +
| WHERE v % 5 = 0; +
| COMMIT;'min|178.27max|199.46diff|21.19-[RECORD9]-------------------------------------------------------------------------------workload|'BEGIN; +
| INSERT INTO wal_traffic_workload SELECT v, md5( v::text )::text || random()::text+
| FROM generate_series( 1, 1000 ) v; +
| COMMIT; +
| +
| BEGIN; +
| INSERT INTO wal_traffic_workload +
| SELECT v + v, t || '' - '' || t || random()::text +
| FROM wal_traffic_workload +
| WHERE v % 5 = 0; +
| COMMIT;'min|121.58max|152.01diff|30.43-[RECORD10]------------------------------------------------------------------------------workload|'DO $wl$ DECLARE +
| i int; +
| BEGIN +
| FOR i IN 1 .. 1000 LOOP +
| INSERT INTO wal_traffic_workload SELECT 1, md5( random()::text )::text; +
| END LOOP; +
| END $wl$;'min|118.45max|167.37diff|48.92-[RECORD11]------------------------------------------------------------------------------workload|'BEGIN; +
| INSERT INTO wal_traffic_workload SELECT v, md5( v::text )::text || random()::text+
| FROM generate_series( 1, 100 ) v; +
| COMMIT; +
| +
| BEGIN; +
| INSERT INTO wal_traffic_workload +
| SELECT v + v, t || '' - '' || t || random()::text +
| FROM wal_traffic_workload +
| WHERE v % 5 = 0; +
| COMMIT;'min|101.56max|247.46diff|145.90-[RECORD12]------------------------------------------------------------------------------workload|'DO $wl$ DECLARE +
| i int; +
| BEGIN +
| FOR i IN 1 .. 100 LOOP +
| INSERT INTO wal_traffic_workload SELECT 1, md5( random()::text )::text; +
| END LOOP; +
| END $wl$;'min|124.02max|289.65diff|165.63
There are certain workload (by type and size) that do not produce any sensible variation in the WAL produced, while for example the last workload for a small amount of tuples produces a very wide range of WAL record writes.
We could also query to search for a trend in the ratio:
testdb=>selectwal_data_ratio,wal_level,wal_log_hints,wal_compression,full_page_writesfromwal_traffic_resultswhereworkloadlike'%FOR i IN 1 .. 100 LOOP%'orderby1desc;-[RECORD1]----|--------wal_data_ratio|289.65wal_level|replicawal_log_hints|offwal_compression|offfull_page_writes|on...-[RECORD6]----|--------wal_data_ratio|256.35wal_level|logicalwal_log_hints|onwal_compression|offfull_page_writes|on...-[RECORD19]---|--------wal_data_ratio|150.78wal_level|replicawal_log_hints|onwal_compression|onfull_page_writes|off...-[RECORD36]---|--------wal_data_ratio|124.02wal_level|replicawal_log_hints|onwal_compression|onfull_page_writes|on
The above confirms how much wal_compression is going to reduce the WAL traffic.
And again, the wal_level is not going to influence the WAL size too much:
testdb=>selectmin(wal_data_ratio),max(wal_data_ratio),wal_levelfromwal_traffic_resultswhereworkloadlike'%FOR i IN 1 .. 100 LOOP%'groupbywal_levelorderby1desc,2desc;-[RECORD1]------min|146.00max|289.16wal_level|minimal-[RECORD2]------min|124.12max|284.47wal_level|logical-[RECORD3]------min|124.02max|289.65wal_level|replica
Conclusions
Even a small amount of real data can produce quite a lot amount of WAL records, and this is good because within those records there are all the information PostgreSQL needs to keep our data at safe, that after all its our final goal.
WAL related settings can, of course, influence the amount of generated data and the idea behind this article is not to provide an exhaustive guide to tune WALs, rather to show how you can measure your WAL traffic depending on the workload you are facing.
This should then help you to decide the right way to tune your WALs.
In the case you find something wrong in the approach described above, or want to integrate or share your experience, please comment on contact me.
The former one, pg_extension provides information about which extensions are installed in the current database, while the latter, pg_available_extensions provides information about which extensions are available to the cluster.
The difference is simple: to be used an extension must appear first on pg_available_extensions, that means it has been installed on the cluster (e.g., via pgxnclient). From this point on, the extension can be installed into the database by means of a CREATE EXTENSION statement; as a result the extension will appear into the pg_extension catalog.
The above list represents all the available extensions installed on the cluster, thus those I can execute a CREATE EXTENSION against.
The pg_available_extensions has an installed_version field that provides the version number of the extension installed in the current database, or NULL if the extension is not installed in the current database. Therefore, in order to know if an extension is installed or not in a database, you can run a query like the following:
This is a little too much effort, and since extension could have been installed with different flags in different database, the pg_extension catalog provides a more detailed and narrowed information: it lists all extensions that have been installed on the current database.
Therefore, to see what a database can use, that means which extensions it has access to, I need to use the pg_extension catalog:
The current database has a much smaller list of available extensions.
Extension Version Numbers
As you know, an extension can come with different version number and the beauty of this mechanism is that it is easy to upgrade an extension from one version to another.
The pg_available_extensions catalog provides only the last (i.e., newest) version of an available extension. Let’s try with a very popular extension: pg_stat_statements:
The extension could be installed to the version 1.8 and is currently not available in the current database.
But what about other version numbers?
The catalog pg_available_extension_versions provides a list of all available versions an extension is currently available:
As you can see, the extension is available in five different versions, and I can choose the version that fit the best my requirements.
This catalog provides different information, in particular it can give you an idea if the extension can be installed only by superusers (field superuser) or by a user with appropriate privileges (field trusted), as well as other required extensions (field requires_name), and relocatability.
Having explored one fork in the path (Elasticsearch and Kibana) in the previous pipeline blog series (here is part 5), in this blog we backtrack to the junction to explore the alternative path (PostgreSQL and Apache Superset). But we’ve lost our map (aka the JSON Schema)—so let’s hope we don’t get lost or attacked by mutant radioactive monsters.
Fork in an abandoned uranium mine (Source: Shutterstock)
Just to recap how we got here, here’s the Kafka source connector -> Kafka -> Kafka sink connector -> Elasticsearch -> Kibana technology pipeline we built in the previous blog series:
And here’s the blueprint for the new architecture, with PostgreSQL replacing Elasticsearch as the target sink data store, and Apache Superset replacing Kibana for analysis and visualization:
Well that’s the plan, but you never know what surprises may be lurking in the unmapped regions of a disused uranium mine!
1. Step 1: PostgreSQL database
In the news recently was Instaclustr’s acquisition of Credativ, experts in the open source PostgreSQL database (and other technologies). As a result, Instaclustr has managed PostgreSQL on its roadmap, and I was lucky to get access to the internal preview (for staff only) a few weeks ago. This has all the features you would expect from our managed services, including easy configuration and provisioning (via the console and REST API), server connection information and examples, and built-in monitoring with key metrics available (via the console and REST API).
Having a fully functional PostgreSQL database available in only a few minutes is great, but what can you do with it?
The first thing is to decide how to connect to it for testing. There are a few options for client applications including psql (a terminal-based front-end), and a GUI such as pgAdmin4 (which is what I used). Once you have pdAdmin4 running on your desktop you can easily create a new connection to the Instaclustr managed PostgreSQL server as described here, using the server IP address, username, and password, all obtained from the Instaclustr console.
You also need to ensure that the IP address of the machine you are using is added to the firewall rules (and saved) in the console (and update this if your IP address changes, which mine does regularly when working from home). Once connected you can create tables with the GUI, and run arbitrary SQL queries to write and read data to test things out.
I asked my new Credativ colleagues what else you can do with PostgreSQL, and they came up with some unexpectedly cool things to try—apparently PostgreSQL is Turing Equivalent, not “just” a database:
GTS2— A whole game in PostgreSQL and pl/python where you can drive around in python-rendered OSM-Data with a car, with collision detection and all the fun stuff!
A Turing Machine implemented in PostgreSQL. This is an example I found, just to prove the Turing machine point. It was probably easier to build than the one below, which is an actual working Turing machine, possibly related to this one. Although the simplest implementation is just a roll of toilet paper and some rocks.
So that’s enough PostgreSQL fun for the time being, let’s move onto the real reason we’re using it, which is to explore the new pipeline architecture.
2. Step 2: Kafka and Kafka Connect Clusters
The next step was to recreate the Kafka and Kafka Connect setup that I had for the original pipeline blogs, as these are the common components that are also needed for the new experiment.
So, first I created a Kafka cluster. There’s nothing special about this cluster configuration (although you should ensure that all the clusters are created in the same AWS region to reduce latency and costs—all Instaclustr clusters are provisioned so they are spread over all AWS availability zones for high availability).
Then I created a Kafka Connect Cluster targeting the Kafka cluster. There are a couple of extra configuration steps required (one before provisioning, and one after).
If you are planning on bringing your own (BYO) connectors, then you have to tick the “Use Custom Connectors” checkbox and add the details for the S3 bucket where your connectors have been uploaded to. You can find the bucket and details in your AWS console, and you need the AWS access key id, AWS secret access key, and the S3 bucket name. Here are the instructions for using AWS S3 for custom Kafka connectors.
Because we are going to use sink connectors that connect to PostgreSQL, you’ll also have to configure the Kafka Connect cluster to allow access to the PostgreSQL server we created in Step 1, using the “Connected Clusters” view as described here.
Finally, ensure that the IP address of your local computer is added to the firewall rules for the Kafka and Kafka Connect clusters, and remember to keep a record of the usernames/passwords for each cluster (as the Instaclustr console only holds them for a few days for security reasons).
3. Step 3: Kafka Connectors
This sink-hole in Australia (Umpherston Sinkhole, Mount Gambier) leads to something more pleasant than an abandoned uranium mine. (Source: Shutterstock)
Before we can experiment with streaming data out of Kafka into PostgreSQL, we need to replicate the mechanism we used in the earlier blogs to get the NOAA tidal data into it, using a Kafka REST source connector as described in section 5 of this blog. Remember that you need to run a separate connector for every station ID that you want to collect data from. I’m just using a small subset for this experiment. I checked (using the kafka-console-consumer, you’ll need to set up the kafka properties file with the Kafka cluster credentials from the Instaclustr console for this to work), and the sensor data was arriving in the Kafka topic that I’d set up for this purpose.
But now we need to select a Kafka Connect sink connector. This part of the journey was fraught with some dead ends, so if you want to skip over the long and sometimes dangerous journey to the end of the tunnel, hop in a disused railway wagon for a short cut to the final section (3.5) which reveals “the answer”!
3.1 Open Source Kafka Connect PostgreSQL Sink Connectors
Previously I used an open source Kafka Connect Elasticsearch sink connector to move the sensor data from the Kafka topic to an Elasticsearch cluster. But this time around, I want to replace this with an open source Kafka Connect sink connector that will write the data into a PostgreSQL server. However, such connectors appear to be as rare as toilet paper on shop shelves in some parts of the world in 2020 (possibly because some monster Turing Machine needed more memory). However, I did finally track (only) one example down:
3.2 Open Source Kafka Connect JDBC Sink Connectors
Why is there a shortage of PostgreSQL sink connectors? The reason is essentially that PostgreSQL is just an example of the class of SQL databases, and SQL databases typically have support for Java Database Connectivity (JDBC) drivers. Searching for open source JDBC sink connectors resulted in more options.
So, searching in the gloom down the mine tunnel I found the following open source JDBC sink connector candidates, with some initial high-level observations:
How should you go about selecting a connector to trial? I had two criteria in mind. The first factor was “easy to build and generate an uber jar file” so I could upload it to the AWS S3 bucket to get it into the Instaclustr Managed Kafka Connect cluster I was using. The second factor relates to how the connectors map the data from the Kafka topic to the PostgreSQL database, tables, and columns, i.e. the data mappings. Luckily I had a close look at the PostgreSQL JSON side of this puzzle in my previous blog, where I discovered that you can store a schemaless JSON object into PostgreSQL in a single column, of type jsonb, with a gin index.
Taking another look at my Tidal sensor data again, I was reminded that it is structured JSON, but without an explicit JSON schema. Here’s an example of one record:
It has two fields, metadata and data. The data field is an array with one element (although it can have more elements if you request a time range rather than just the latest datum). However, it doesn’t have an explicit schema.
What are the theoretically possible options for where to put JSON schemas for Kafka sink connectors to use?
No schema
Explicit schema in each Kafka record (this wastes space and requires the producer to write a record with the correct schema as well as the actual data)
Explicit schema in the connector configuration. This is a logical possibility, but would potentially limit the connector to working with only one topic at a time. You’d need multiple configurations and therefore connectors to work with topics with different schemas.
Explicit schema stored somewhere else, for example, in a schema registry
But why/when is a schema needed? Looking at the code of some of the connectors, it appears that the schema is primarily used to auto-create a table with the correct columns and types, but this is assuming that you want to transform/extract the JSON fields to multiple columns.
And is the data mapping separate from the schema? Does it use the schema? Let’s look closer at a typical example, the IBM and Aiven connectors which are based on the Confluent approach which (1) requires an explicit schema in each Kafka record, and (2) requires Kafka record values to be structs with primitive fields.
Assuming we could “flatten” and remove some of the unwanted fields, then the JSON tidal data would look like this:
So we’ve essentially drawn our own schema/map; let’s see if it helps us to find a way out of the mine!
I constructed some example records with this format and put them into a test Kafka topic with the kafka-console-producer.
The connector configuration requires a PostgreSQL connection URL, user and password, the topic to read from and the table name to write to. The schema is used to auto-create a table with the required columns and types if one doesn’t exist, and then the payload values are just inserted into the named columns.
Here’s an example PostgreSQL configuration from the IBM connector documentation:
Note that with PostgreSQL you need the IP address, port number, and the database name, postgres (in this example) for the URL.
Building the IBM connector was easy, and I was able to upload the resulting uber jar to my AWS S3 bucket and sync the Instaclustr managed Kafka Connect cluster to see that it was available in the cluster (just remember to refresh the browser to get the updated list of connectors).
Here’s the CURL command used to configure and run this connector on the Instaclustr Managed Kafka Connect (KC) cluster (note that you need the credentials for both the Kafka Connect cluster REST API and the PostgreSQL database):
You can easily check that the connector is running in the Instaclustr console (Under “Active Connectors”), and you should ensure that the Kafka Connect logs are shipped to an error topic in case something goes wrong (it probably will, as it often takes several attempts to configure Kafka connector correctly).
The IBM connector was easy to build, configure, and start running, however, it had a few issues. As a result of the first error I had a look at the code, and discovered that a schema is also required for the table name. So the correct value for “table.name.format” above is “public.tides_table” (for my example). Unfortunately, the next error was impossible to resolve (“org.apache.kafka.connect.errors.DataException: Cannot list fields on non-struct type”) and we concluded it was a problem with the Kafka record deserialization.
I also tried the Aiven sink connector (which has a very similar configuration to the IBM one), but given the lack of details on how to build it, a dependency on Oracle Java 11, the use of Gradle, which I’m not familiar with, and finally this error (“java.sql.SQLException: No suitable driver found for jdbc:postgresql:///…”, possibly indicating that the PostgreSQL driver needs to be included in the build and/or an uber jar generated to include the driver), I didn’t have any success with it either.
3.4 The Schemaless Approach
Time to burn our hand drawn map (schema) (Source: Shutterstock)
So that’s two out of four connectors down, two to go.
Given the requirement to have an explicit schema in the Kafka records, the IBM and Aiven connectors really weren’t suitable for my use case anyway, so no great loss so far. And even though I’d had success with the Apache Camel connectors in the previous blogs, this time around the documentation for the Camel JDBC sink connector didn’t have any configuration examples, so it wasn’t obvious how it would work and if it needed a schema or not. This left the first(PostgreSQL-specific) connector as the only option remaining, so, let’s throw away our hand-drawn map and try the schemaless idea out.
The idea behind this connector is that elements from a JSON Kafka record message are parsed out into column values, specified by a list of columns, and a list of parse paths in the connector configuration.
Here’s an example configuration for my JSON tidal data to extract the name, t and v values from the JSON record, and insert them into name, time, and value columns:
This connector was also hard to build (it was missing some jars, which I found in an earlier release). The documentation doesn’t say anything about how the table is created, so given the lack of schema and type information, I assumed the table had to be manually created before use. Unfortunately, I wasn’t even able to start this connector running in the Instaclustr Kafka Connect cluster, as it resulted in this error when trying to configure and run it:
ERROR Uncaught exception in REST call to /connectors (org.apache.kafka.connect.runtime.rest.errors.ConnectExceptionMapper:61)
3.5 Success With a Customized IBM Connector
So with our last candle burning low and the sounds of ravenous monsters in the darkness of the mine growing louder by the second, it was time to get creative.
The IBM connector had been the easiest to build, so it was time to rethink my requirements. Based on my previous experience with JSON and PostgreSQL, all I really needed the connector to do was to insert the entire JSON Kafka record value into a single jsonb column type in a specified table. The table would also need a single automatically incremented id field, and a gin index on the jsonb table. It turned out to be straightforward to modify the behavior of the IBM connector to do this.
The table and gin index (with a unique name including the table name) is created automatically if it doesn’t already exist, and the column names are currently hardcoded to be “id” (integer) and “json_object” (type jsonb). I did have to modify the SQL code for creating the table for it to work with PostgreSQL correctly, and customize the insert SQL.
Based on my previous experiences with sink connectors failing due to badly formed JSON or JSON with a different schema to expected, I was pleasantly surprised to find that this connector was very robust. This is due to the fact that PostgreSQL refuses to insert badly formed JSON into a jsonb column type and throws an exception, and the IBM connector doesn’t fail under these circumstances; it just logs an error to the Kafka error topic and moves onto the next available record.
However, my previous experiments failed to find a way to prevent the JSON error messages from being indexed into Elasticsearch, so I wondered if there was a solution for PostgreSQL that would not cause the connector to fail.
The solution was actually pretty simple, as PostgreSQL allows constraints on json columns. This constraint (currently executed manually after the table is created, but it could be done automatically in the connector code by analyzing the fields in the first record found) uses one of the PostgreSQL JSON existence operators (‘?&’) to add a constraint to ensure that ‘metadata’ and ‘data’ exist as top-level keys in the json record:
This excludes error records like this: {“error”:”error message”}, but doesn’t exclude records which have ‘metadata’ and ‘data’ and superfluous records. However, extra fields can just be ignored when processing the jsonb column later on, so it looks like we have found light at the end of the mine tunnel and are ready to try out the next part of the experiment, which includes building. deploying, and running Apache Superset, configuring it to access PostgreSQL, and then graphing the tidal data.
One theme of the 3.2 release is new analytical functionality in the raster module, andaccess to cloud-based rastersvia the "out-db" option for rasters. Let's explore two new functions and exercise cloud raster support at the same time.
I was working on the PostgreSQL storage related features recently, and I found PostgreSQL has designed an amazing storage addressing mechanism, i.e. Buffer Tag. In this blog, I want to share with you my understanding about the Buffer Tag and some potential usage of it.
2. Buffer Tag
I was always curious to know how PostgreSQl can find out the tuple data blocks so quickly when the first time I started to use PostgreSQL in an IoT project, but I never got the chance to look it into details even though I knew PostgreSQL is an very well organized open-source project. Until I recently got a task which needs to solve some storage related issue in PostgreSQL. There is very detailed explanation about how Buffer Manger works in an online PostgreSQL books for developers The Internals of PostgreSQL, one of the best PostgreSQL books I would recommend to the beginner of PostgreSQL development to read.
Buffer Tag, in simple words, is just five numbers. Why it is five numbers? First, all the objects including the storage files are managed by Object Identifiers, i.e. OID. For example, when user creates a table, the table name is mapped to an OID; when user creates a database, the name is mapped to an OID; when the corresponding data need to be persistent on disk, the files is also named using OID. Secondly, when a table requires more pages to store more tuples, then each page for the same table is managed by the page number in sequence. For example, when PostgreSQL needs to estimate the table size before decide what kind of scan should be used to find out the tuple faster, it need to know the number of blocks information. Thirdly, it is easy to understand that data tuples are the major user data need to be stored, but in order to better manage these data tuples, PostgreSQL needs others information, such as visibility to manage the status of these data tuples, and free space to optimize files usage. So this ends up with five numbers, i.e. Tablespace, Database, Table, ForkNumber, BlockNumber.
Given these five numbers, PostgreSQL can always find out where the data tuples are stored, which file is used, and what size the table is etc.
3. How Buffer Tag is used
Now, we have the buffer tag, five numbers, but how it is used in PosggreSQL. One typical use case of buffer tag is to help buffer manager to manage the location of memory blocks in the buffer pool/array. In this case, a hash table was introduced to resolve the mapping between buffer tag and the location of memory block in buffer pool/array. Here is a picture to show relationship among buffer tag, hashtable, buffer descriptor, and buffer pool/array.
For example, the first green buffer tag {(1663, 13589, 16387), 0, 0} indicates a table space 1663, a database 12709, and a table 16387 with forkNumber 0 (main fork for tuples) which stored in a file at block 0. The buffer tag has a hash value 1536704684 which has been assigned to the memory block 0 managed by buffer manager at this moment. Since buffer descriptor is addressed using the same slot number as memory block in buffer pool/array, so they both share the same slot number 0, in this case.
With the above relationship, PostgreSQL can find out the memory block location or assign buffer slot to a new memory block for particular buffer tag.
The other typical use case of buffer tag is to help the storage manager to manage the corresponding files. In this case, the blockNumber is always to 0 since it doesn’t require multiple blocks to store more data. Here, you can create your use case of buffer tag. For example, use the buf_id as the number of blocks for each database relation instead of using it to indicate the memory block location. Moreover, you can also use the buffer tag plus the hashtable to manage multiple information such as having both buf_id for location of the memory block and adding a new attribute total to manage the number of blocks. You can achieve this by define a different buffer lookup entry. For example,
The original buffer lookup entry using buffer tag to lookup the location of memory block.
/* entry for buffer lookup hashtable */
typedef struct
{
BufferTag key; /* Tag of a disk page */
int id; /* Associated buffer ID */
} BufferLookupEnt;
Below is an example of buffer lookup entry which can lookup both the location of memory block and the number of blocks used by this particular buffer tag.
typedef struct
{
BufferTag key; /* Tag of a disk page */
int id; /* Associated buffer ID */
int total; /* the total number of X */
} RelLookupEnt;
4. Summary
In this blog, we discussed what is Buffer Tag, how it is used in PostgreSQL and some potential usage of Buffer Tag to address the mapping issues between tables to storage files as well as the lookup issues.
A software developer specialized in C/C++ programming with experience in hardware, firmware, software, database, network, and system architecture. Now, working in HighGo Software Inc, as a senior PostgreSQL architect.
PostgreSQL Person of the Week Interview with Roman Druzyagin: Name’s Roman. I’ve been born and lived most of my life in Russia. I grew up near Moscow, and since 2002 I’ve been residing in St. Petersburg, with plans to relocate to the European Union in the near future. I am currently 33 years old.