Jeremy Schneider: Researching the Performance Puzzle

July 31, 2022, 5:02 pm

≫ Next: Andreas 'ads' Scherbaum: Beena Emerson

≪ Previous: Federico Campoli: Comet and Shadow

The PostgreSQL Performance Puzzle was, perhaps, too easy – it didn’t take long for someone to guess the correct answer!

But I didn’t see much discussion about why the difference or what was happening. My emphasis on the magnitude of the difference was a tip-off that there’s much more than meets the eye with this puzzle challenge – and one reason I published it is that I’d looked and I thought there were some very interesting things happening beneath the surface.

So let’s dig!

There are several layers. But – as with all SQL performance questions always – we begin simply with EXPLAIN.

[root@ip-172-31-36-129 ~]# sync; echo 1 > /proc/sys/vm/drop_caches
[root@ip-172-31-36-129 ~]# service postgresql-14 restart;
Redirecting to /bin/systemctl restart postgresql-14.service

pg-14.4 rw root@db1=# select 1;
...
The connection to the server was lost. Attempting reset: Succeeded.
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)

pg-14.4 rw root@db1=# set track_io_timing=on;
SET
pg-14.4 rw root@db1=# set log_executor_stats=on;
SET
pg-14.4 rw root@db1=# set client_min_messages=log;
SET
pg-14.4 rw root@db1=# \timing on
Timing is on.

pg-14.4 rw root@db1=# explain (analyze,verbose,buffers,settings) select count(mydata) from test where mynumber1 < 500000;
LOG:  00000: EXECUTOR STATISTICS
DETAIL:  ! system usage stats:
!	0.107946 s user, 0.028062 s system, 0.202054 s elapsed
!	[0.113023 s user, 0.032292 s system total]
!	7728 kB max resident size
!	89984/0 [92376/360] filesystem blocks in/out
!	0/255 [1/1656] page faults/reclaims, 0 [0] swaps
!	0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!	197/0 [260/0] voluntary/involuntary context switches
LOCATION:  ShowUsage, postgres.c:4898
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=20080.19..20080.20 rows=1 width=8) (actual time=190.273..190.278 rows=1 loops=1)
   Output: count(mydata)
   Buffers: shared hit=4 read=5533
   I/O Timings: read=82.114
   ->  Index Scan using test_mynumber1 on public.test  (cost=0.57..18803.24 rows=510781 width=18) (actual time=0.389..155.537 rows=499999 loops=1)
         Output: mydata, mynumber1, mynumber2
         Index Cond: (test.mynumber1 < 500000)
         Buffers: shared hit=4 read=5533
         I/O Timings: read=82.114
 Settings: effective_cache_size = '8GB', max_parallel_workers_per_gather = '0'
 Query Identifier: 970248584308223886
 Planning:
   Buffers: shared hit=58 read=29
   I/O Timings: read=9.996
 Planning Time: 10.529 ms
 Execution Time: 190.819 ms
(16 rows)

Time: 213.768 ms
pg-14.4 rw root@db1=#

That’s the EXPLAIN output for the first (ordered) column. It’s worth noting that even though the getrusage() data reported by log_executor_stats is gathered at a different level (by the operating system for the individual database PID), this data actually correlates very closely with the direct output of EXPLAIN ANALYZE on several dimensions:

Explain reports Execution Time: 190.819 ms and getrusage() reports 0.202054 s elapsed (very close)
Explain reports total I/O Timings: read=82.114 (subtracted from execution time, suggests 108.705 ms not doing I/O) and getrusage() reports 0.107946 s user (very close, suggesting we didn’t spend time off CPU for anything other than I/O)
Explain reports Buffers: shared ... read=5533 (database blocks are 8k and filesystem blocks are 512 bytes, so this suggests 88528 filesystem blocks) and getrusage() reports 89984/... filesystem blocks in/... (again very close)
Average latency of a PostgreSQL read request (based on EXPLAIN): 82.114 ms / 5533 blocks = 0.0148 ms

On a side note… this is one reason I appreciate that PostgreSQL uses a per-process model rather than the threaded model of MySQL. There are other advantages with the threaded model – but as far as I’m aware, operating systems don’t track resources usage data per thread.

Repeating the same steps with the second (random) column:

pg-14.4 rw root@db1=# explain (analyze,verbose,buffers,settings) select count(mydata) from test where mynumber2 < 500000;
LOG:  00000: EXECUTOR STATISTICS
DETAIL:  ! system usage stats:
!	1.117137 s user, 3.740477 s system, 117.880721 s elapsed
!	[1.124219 s user, 3.742426 s system total]
!	7716 kB max resident size
!	11550752/0 [11553144/360] filesystem blocks in/out
!	0/1710 [1/3112] page faults/reclaims, 0 [0] swaps
!	0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!	254529/2 [254593/2] voluntary/involuntary context switches
LOCATION:  ShowUsage, postgres.c:4898
LOG:  00000: duration: 117891.942 ms  statement: explain (analyze,verbose,buffers,settings) select count(mydata) from test where mynumber2 < 500000;
LOCATION:  exec_simple_query, postgres.c:1306
                                                                       QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1608908.23..1608908.24 rows=1 width=8) (actual time=117869.626..117869.633 rows=1 loops=1)
   Output: count(mydata)
   Buffers: shared hit=123528 read=378035
   I/O Timings: read=116996.143
   ->  Index Scan using test_mynumber2 on public.test  (cost=0.57..1607599.58 rows=523462 width=18) (actual time=0.417..117799.607 rows=500360 loops=1)
         Output: mydata, mynumber1, mynumber2
         Index Cond: (test.mynumber2 < 500000)
         Buffers: shared hit=123528 read=378035
         I/O Timings: read=116996.143
 Settings: effective_cache_size = '8GB', max_parallel_workers_per_gather = '0'
 Query Identifier: 6487474355041514631
 Planning:
   Buffers: shared hit=58 read=29
   I/O Timings: read=9.137
 Planning Time: 9.673 ms
 Execution Time: 117870.258 ms
(16 rows)

Time: 117892.603 ms (01:57.893)
pg-14.4 rw root@db1=#

Again, we see that the execution time and I/O timings are very close between getrusage() and EXPLAIN output. It takes over 100 seconds, which I mentioned in my previous article.

And immediately, a huge difference jumps out: the number of blocks being read.

Our first query only needed to read 5,533 database blocks to find the 500,000 rows it was looking for. But – in the same table – this second query needs to read 378,035 blocks to find the 500,000 rows it’s looking for!

This, of course, is simply due to the way the data is physically laid out on the disk – which is a product of the query we used to populate the table. Note that in PostgreSQL, a whole new row is created every time you update even a single byte, so you generally won’t see neat and clean sequential data like this on real-world production systems.

I’d be remiss right now if I didn’t mention great articles already written by other people on this topic. Two come to mind right away. Two years ago, Ryan Lambert at RustProof Labs wrote an article about almost exactly the test case I’ve reproduced here. His table also has one text column and two numeric columns, populated with ordered and non-ordered data. Ryan’s article does a great job of walking through the fundamentals on this. And just last week, Michael Christofides at pgMustard wrote an excellent technical article about the importance of the BUFFERS metric when studying query performance. I think it’s spot on and well worded, and I pretty much agree with the whole article. I certainly agree that it would be great if BUFFERS were enabled by default for EXPLAIN output in a future version of PostgreSQL.

Side Note: PostgreSQL Optimization on Repeat Buffer Visits

Grant me one brief digression for a moment. Speaking of BUFFERS, there’s a glaring inconsistency in the two explain plans above. The second (random) query shows 123,528 shared hits and 378,035 shared reads. That adds up to 501,563 buffer accesses and this makes perfect sense for 500,000 rows.

But the first (ordered) query shows only 5,537 buffers in total. At first, I was a little surprised to see “shared hit” was not incremented when buffers were re-visited for multiple rows. Given PostgreSQL’s recursive node-based executor implementation, I wasn’t sure how easy it would be to code optimizations that span multiple nodes of a plan – but the obvious explanation seemed to be a PostgreSQL optimization when doing an index range scan and accessing same block repeatedly for multiple tuples.

I did a little searching and found that this is exactly the case. It’s an optimization that applies to “heap” type relations when they are accessed through an index. More specifically, the heapam_index_getch_tuple() function uses the ReleaseAndReadBuffer() call to switch buffers. That call has the optimization to check if the block being requested is the same one being released, and saves considerable work by short-circuiting much code that normally needs to be called. This is part of the “why” behind Ryan Lambert’s observation (which I also corroborated) that even for fully cached queries accessing the same number of blocks, sequential access patterns are faster – using less CPU – than random access patterns.

Comparing with Equal Number of Blocks

So we expect a little less time on CPU for sequential access patterns, but of course my test here is designed to spend most of the time in I/O. So in the interest of research, let’s reduce the number of rows in the second query so that it accesses roughly the same number of blocks as the first one. I’m curious what we’ll see.

Note that when I reduced the number of rows to 5400, PostgreSQL switched to a bitmap heap scan. In the interest of a more apples-to-apples comparison, I’m going to disable bitmap scans so that PostgreSQL stays with the index range scan.

pg-14.4 rw root@db1=# set enable_bitmapscan=off;
SET

pg-14.4 rw root@db1=# explain (analyze,verbose,buffers,settings) select count(mydata) from test where mynumber2 < 5400;
LOG:  00000: EXECUTOR STATISTICS
DETAIL:  ! system usage stats:
!	0.020220 s user, 0.056453 s system, 2.208467 s elapsed
!	[0.030746 s user, 0.060501 s system total]
!	7712 kB max resident size
!	92352/0 [94744/360] filesystem blocks in/out
!	0/254 [1/1657] page faults/reclaims, 0 [0] swaps
!	0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!	5489/0 [5553/0] voluntary/involuntary context switches
LOCATION:  ShowUsage, postgres.c:4898
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=22701.59..22701.60 rows=1 width=8) (actual time=2197.749..2197.755 rows=1 loops=1)
   Output: count(mydata)
   Buffers: shared hit=19 read=5477
   I/O Timings: read=2185.404
   ->  Index Scan using test_mynumber2 on public.test  (cost=0.57..22687.46 rows=5652 width=18) (actual time=0.446..2196.772 rows=5479 loops=1)
         Output: mydata, mynumber1, mynumber2
         Index Cond: (test.mynumber2 < 5400)
         Buffers: shared hit=19 read=5477
         I/O Timings: read=2185.404
 Settings: effective_cache_size = '8GB', enable_bitmapscan = 'off', max_parallel_workers_per_gather = '0'
 Query Identifier: 6487474355041514631
 Planning:
   Buffers: shared hit=58 read=29
   I/O Timings: read=9.359
 Planning Time: 9.892 ms
 Execution Time: 2197.851 ms
(16 rows)

Time: 2220.479 ms (00:02.220)
pg-14.4 rw root@db1=#

Now that is interesting!

The getrusage() CPU numbers are very close, and the operating system (512 byte) block count is pretty close to the database block read count of 5477. But look at the time! We’re reading 5,477 blocks in 2,185 ms which means the average latency of a PostgreSQL read request is 0.4 ms.

Now let me be honest. I find 0.4 ms read latency from a GP2 volume to be a completely reasonable number. On the other hand, our first (sequential) query reported a PostgreSQL read request latency of 0.0148 ms. I don’t believe that for a second. POSTRESQL I’M CALLING YOUR BLUFF, YOU LIAR.

How do I call the bluff? Easy, just take a quick look at iostat output while we run the query on my otherwise idle test system.

First, lets get the numbers for the random access pattern to make sure they check out.

pg-14.4 rw root@db1=# explain (analyze,verbose,buffers,settings) select count(mydata) from test where mynumber2 < 5400;

[root@ip-172-31-36-129 ~]# iostat -x 4
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.63    0.00    7.51   23.53    0.00   66.33

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme1n1           0.00     0.00 1383.75    0.00 11644.00     0.00    16.83     0.48    0.34    0.34    0.00   0.36  49.52

That was the second (random) query. The iostat numbers are per-second, so multiplying by 4 (seconds) I see that iostat reports a total of 5535 I/O requests with await of 0.34 ms. Above, PostgreSQL claimed 5,477 database blocks at 0.4 ms. Survey says… PostgreSQL is telling the truth!

pg-14.4 rw root@db1=# explain (analyze,verbose,buffers,settings) select count(mydata) from test where mynumber1 < 500000;

[root@ip-172-31-36-129 ~]# iostat -x 4
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.38    0.00    0.38    0.88    0.00   97.38

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme1n1           0.00     3.50  104.75    0.50 11344.00    16.00   215.87     0.08    0.80    0.80    1.50   0.51   5.38

Now this was the first (sequential) query. The iostat numbers show await of 0.80 ms. Above, PostgreSQL claimed 0.0148 ms – which as I said was clearly unrealistic.

There you have it, PostgreSQL clearly LIED to us!

But there’s something interesting here. Multiplying by 4, why only 419 I/O requests? The total rkB for all 4 seconds would be 45376 (11344 times 4) which translates to 5672 database blocks at 8k each (45376 divided by 8), so that seems right – but the number of I/O requests is strangely small.

What’s going on here?

In fact, Frits Hoogland nailed it on the head when I first posted this performance puzzle on Twitter:

Because if the access pattern on the second index is non sequential for the os, the os cannot perform read-ahead?
— Frits Hoogland (@fritshoogland) July 17, 2022

Linux Readahead

Of course, PostgreSQL isn’t actually lying. It is telling the truth from its perspective, but PostgreSQL reads are not the same thing as physical device reads. There is an operating system in the middle – Linux – which does an awful lot more than many people realize. Linux maintains its own “page” cache and applies its own optimizations to reads and writes in the I/O path. (I made a simple sketch of this architecture for the 2019 article about PostgreSQL invalid pages.) One of those optimizations is called readahead.

My favorite technical write-up that I’ve found so far about Linux readahead is the article that Neil Brown contributed to LWN.net just three months ago. I think PostgreSQL developers will especially appreciate this author’s take, with his emphasis on the importance in the code of clear concepts, language & naming and his emphasis on the value of documentation-writing that’s intertwined and integrated with code-writing.

A core idea in readahead is to take a risk and read more than was requested. If that risk brings rewards and the extra data is accessed, then that justifies a further risk of reading even more data that hasn’t been requested.
Neil Brown

Linux readahead is a heuristic process. The operating system does not have any direct visibility into which blocks PostgreSQL will request in the future (whereas the database itself could know this, for example looking ahead in the index to know heap blocks that will later be requested), but Linux has an algorithm to make guesses and factor in the success rate for future guessing. I’m only going to scratch the surface of interactions between PostgreSQL and the Linux readahead algorithm, but there is still a surprise or two (for me) herein.

As a starting point, I’ll use a couple very simple techniques to look at readahead behavior with individual PostgreSQL queries on my dedicated & idle test system:

Generate histograms of read request sizes at the operating system level with blktrace | blkiomon
Generate histograms at the PostgreSQL level with Bertrand Drouvot’s BCC-based pg_ebpf script (plus a few tweaks I made)
Use blockdev to set the readahead value to zero and compare that behavior with the default value of 256.

Note: Andres Freund also has published a few eBPF scripts for PostgreSQL, including one that directly reports pagecache hits – but they rely on bpftrace which isn’t yet available in the standad yum repositories for RHEL7, as far as I can tell at the moment. I started with Bertrand’s script simply because I’m lazy and BCC was available from the standard yum repos… but I do want to look closer at Andres’ scripts too.

Linux Readahead Histograms

Bertrand’s script will take any function in the PostgreSQL code base as an input, and for that function it will track every single call, and how long the function takes to run. It will then generate a histogram visualizing the data. I’m going to tap into the PostgreSQL function FileRead which, in this case, captures all of the block reads that I’m interested in. One of my tweaks adds an option to NOT zero the call counts between display updates (a la perf-top), which is convenient for this particular test.

[root@ip-172-31-36-129 pg_ebpf]# ./get_pg_calls_durations.py -i 2 -f FileRead -x /usr/pgsql-14/bin/postgres -p 2033 -n

I don’t have an option with blkiomon to avoid resetting call counts between display updates, so I’m going to set an update frequency a little over my known query runtime and run the query right after an update, so that I can get a single display with all the information about the query. (And I’ll make sure the next display has all zeros.) Four second refresh rate should do the trick.

[root@ip-172-31-36-129 ~]# blktrace /dev/nvme1n1 -a issue -a complete -o- | blkiomon -I 4 -h -

Let’s take a look at the first (sequential) query:

A few observations from the block device stats (on the right):

The “d2c” histogram shows only 3 reads in the 128-256 microsecond bucket, and the vast majority of reads seem to complete around a half millisecond (either 256-512 microseconds or 512-1024 microseconds). None were less than 128 microseconds.
There are 347 read requests issued by Linux at a size of 128k which matches the default readahead setting of 256 sectors (a sector is 512 bytes). Contrast that with only 38 read requests for 8k (DB block size) and 66 read requests for 4k (OS page size).
In the “sizes read (bytes)” summary line, we see sum=46718976 bytes, which very closely matches the total rKB that we saw from iostat earlier of 45376 kB.
In the “d2c read (usec)” summary line, we see avg=787.1 microseconds, which very closely matches the await time that we saw from iostat earlier of 0.80 ms.
Both summary lines show num=484 which is close to the 419 I/O requests we saw from iostat in the earlier test.

A few observations from the eBPF histogram (on the left):

The total for the top (count) histogram means that the FileRead() function was called 5,599 times. This is very close to the 5,533 “shared read” statistic in the EXPLAIN output.
The total for the bottom (time) histogram means all executions of the FileRead() function ran for a cumulative total of 119,134 microseconds. This invocation reported I/O Timings: read=108.129 in the EXPLAIN output. These numbers are very close. (The original run at the beginning of this article reported 82.114 ms.)
The top histogram shows that exactly 152 calls to FileRead() completed in more than 256 microseconds, whereas 5447 calls to FileRead() completed in less than 256 microseconds.

When you think about it, it’s amazing how fast this happens. This query starts with a cold pagecache and completes in one tenth of a second. And yet, in the flash of an eye, Linux figures out what’s happening and in the background it asynchronously pre-populates its page cache with 98% of the data that PostreSQL asks for – before PosgreSQL asks for it. We see more than 5000 reads that PostgreSQL issues (to an ostensibly “cold” page cache), served back in single-digit microseconds.

The other thing that amazes me: the tiny percentage (2%) of reads that seem to require physical OS reads (over 256 microseconds) – they account for a cumulative total of 68974 microseconds – that’s 58% of the execution time.

Just imagine how slow this would be without readahead!

Linux Readahead Disabled

Actually, scratch that, instead of imagining it lets just try it. (And it’s not as bad as I made it sound.)

[root@ip-172-31-36-129 pg_ebpf]# blockdev --report /dev/nvme1n1
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0    107374182400   /dev/nvme1n1
[root@ip-172-31-36-129 pg_ebpf]# blockdev --setra 0 /dev/nvme1n1
[root@ip-172-31-36-129 pg_ebpf]# blockdev --report /dev/nvme1n1
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw     0   512  4096          0    107374182400   /dev/nvme1n1
[root@ip-172-31-36-129 pg_ebpf]#

Gathering the same histograms:

Overall execution time of 2847 ms – which is close to the execution time for our query that ran with randomly ordered data! (Not surprising since it did the same amount of I/O, presumably without benefiting from readahead.)

Furthermore:

Nearly all block-device I/O requests are made at the OS page size of 4k
At both the block device and PostgreSQL layers, the majority of I/O requests are completing in less than a half millisecond. Interestingly, from Linux’s perspective this GP2 volume seemed to serve most requests in less than a quarter millisecond for this test – and it claims one (likely 4k page) request was served in less than 128 microseconds! (Dang.)
The PostgreSQL histogram shows that not a single 8k block I/O completed in less than 256 microseconds, which corroborates our interpretation of the earlier histogram involving readahead.

The summary lines from blkiomon both show a total of 11,198 I/O requests. Of course these are 4k OS pages, and that number is double the number of read() calls that PostgreSQL is making to the operating system.

And this is the first thing that surprised me. This means that on all workloads – including completely random patterns – something related to Linux readahead code is what’s preserving 8k-at-a-time reads.

Linux Readahead Wastefulness and Dumb Luck

So clearly, Linux readahead code isn’t just a “nice-to-have” but rather it’s integral to an efficient I/O path with any PostgreSQL workload.

But there’s another thing that I thought was surprising.

You may not have noticed, but I glossed over one little detail back in the EXPLAIN output of the original query with a random data layout. I quietly left out any comments about the getrusage() reported filesystem block reads. I explicitly called this out on the sequential query. Go back up and look again… I’m sneaky, right?

pg-14.4 rw root@db1=# explain (analyze,verbose,buffers,settings) select count(mydata) from test where mynumber2 < 500000;
...
...
!	11550752/0 [11553144/360] filesystem blocks in/out
...
...
   Buffers: shared hit=123528 read=378035

Explain reports Buffers: shared ... read=378036 (database blocks are 8k and filesystem blocks are 512 bytes, so this suggests 6048576 filesystem blocks) and getrusage() reports 11553144/... filesystem blocks in/...

That is NOT close. In fact, getrusage() is telling us the operating system reads almost DOUBLE the amount of data being returned to PosgreSQL.

Now that was a surprise to me, since this is a completely random read pattern for the bulk of the I/O requests going to the table/heap.

Lets take a look at the histograms. (It kills me to wait… but I’ll increase the blkiomon interval to 150!)

Observations:

PostgreSQL reports 378,060 calls to FileRead() and the summary line in blkiomon shows num=355373 which is about the same number of I/O requests.
There were more than 100,000 block device read requests in the 16k, 32k, 64k, 128k and 256k buckets. (Readahead doing more than 8k.)
The summary line “sizes read(bytes)” shows sum=5914181632 which corroborates the earlier iostat numbers (divided by 512, that would be 11551136 filesystem blocks).
The 378,060 blocks (8k each) that PostgreSQL read = a total of 3097067520 bytes.

So it appears that – yes – Linux readahead is wastefully reading almost double the amount of data into the page cache, and almost half of that data is not returned to PosgreSQL.

We also see that the SQL command “SET log_executor_stats” can be used in PostgreSQL to infer [with getrusage() data] if Linux readahead happens to be badly polluting your OS page cache.

But here’s what surprised me the most: look at the PostgreSQL eBPF histogram. On this random workload, more than a quarter of PostgreSQL FileRead() calls [well over 100,000] completed in single or double-digit microsecond time, indicating pagecache hits due to readahead. And with a minor tweak of the way I populate this table, I saw Linux readahead serve almost 90% of random reads from page cache, starting cold!! (New puzzle… I challenge you to reproduce that! Click the thumbnail on the right for a screenshot!)

Talk about dumb luck.

I don’t care what anyone else says about you, Linux Readahead. Next time I go to Vegas, I’m taking you with me, friend.

↧

Andreas 'ads' Scherbaum: Beena Emerson

August 1, 2022, 7:00 am

≫ Next: Dave Page: pgAdmin User Survey 2022

≪ Previous: Jeremy Schneider: Researching the Performance Puzzle

PostgreSQL Person of the Week Interview with Beena Emerson: I am an Indian who grew up in Mumbai and now settled near Chennai.

↧

Dave Page: pgAdmin User Survey 2022

August 2, 2022, 8:28 am

≫ Next: Federico Campoli: Milo

≪ Previous: Andreas 'ads' Scherbaum: Beena Emerson

On Monday 11th July the pgAdmin Development Team opened the first pgAdmin user survey which we then ran for three weeks, closing it on Monday 1st August. The aim of the survey was to help us understand how users are using pgAdmin to help us shape and focus our future development efforts.

We had a fantastic response with 278 people taking the time to complete the survey - far exceeding our expectations. Responses were generally positive as well, with a number of people expressing their appreciation for the work of the development team, which is always nice to hear.

In this blog post I'll go through the high level topics of the survey and attempt to summarise what has been reported, and draw some initial conclusions. If you would like to take a look at the results yourself... [Continue reading...]

↧

Federico Campoli: Milo

August 5, 2022, 3:30 am

≫ Next: cary huang: How to Set Up NFS and Run PG on it

≪ Previous: Dave Page: pgAdmin User Survey 2022

This week the PostgreSQL pet is Milo the ginger cat adopted by Dave Page.

↧

cary huang: How to Set Up NFS and Run PG on it

August 5, 2022, 3:51 pm

≫ Next: Regina Obe: PostGIS 3.3.0rc1

≪ Previous: Federico Campoli: Milo

Introduction

Network File System (NFS) is a distributed file system protocol that allows a user on a client node to access files residing on a server node over network much like local storage is accessed. Today in this blog, I will share how to set up both NFSv4 server and client on CentOS7 and run PG on it.

NFS Server

First, install the NFS server components by

$ yum install nfs-utils

This will install nfs process on the server machine in which we can go ahead to enable and start the NFS server

$ systemctl enable nfs
$ systemctl start nfs

Create a directory that will be mounted by NFS clients

mkdir /home/myserveruser/sharedir

In /etc/exports, add a new record like below

/home/myserveruser/sharedir    X.X.X.X(rw,sync,no_subtree_check,no_root_squash)

This line is allowing a client having IP address of X.X.X.X to mount the directory at /home/myserveruser/sharedir and can do read and write as specified by rw option. For all possible options, refer to the blog here for definition of each options used. This directory will be used to initialized PostgreSQL database cluster by a NFS client over the network.

If more than one client/host will mount the same directory, you will need to include them in /etc/exports as well.

/home/myserveruser/sharedir    X.X.X.X(rw,sync,no_subtree_check,no_root_squash)
/home/myserveruser/sharedir    A.A.A.A(rw,sync,no_subtree_check,no_root_squash)
/home/myserveruser/sharedir    B.B.B.B(rw,sync,no_subtree_check,no_root_squash)

Then, we are ready to restart the NFS service to take account the new changes in /etc/exports

$ systemctl restart nfs

Record the userid and groupid of the user associated with the directory to be exported to client. For example, userid = 1009, groupid = 1009

$ id myserveruser

You may also want to ensure that the firewall on centos7 is either disabled or set to allow the NFS traffic to passthrough. You can check the firewall status and add new port to be allowed with the following commands:

$ sudo firewall-cmd --zone=public --list-all
$ firewall-cmd --permanent --zone=public --add-port=2049/tcp
$ firewall-cmd --permanent --zone=public --add-port=2049/udp
$ firewall-cmd --reload

NFS Client

First, install the NFS client components by

$ yum install nfs-utils

Create a directory on the client side that will be used as the mount point to the NFS server

$ mkdir /home/myserveruser/sharedir

Since PostgreSQL cannot be run as root user, we need a way to mount the remote NFS without root privileges. To do that, we need to add one more entry in /etc/fstab

Y.Y.Y.Y:/home/myserveruser/sharedir        /home/myclientuser/mountpoint   nfs     rw,noauto,user 0 0

where Y.Y.Y.Y is the IP address of the NFS server that client will connect to, nfs is the type of file system we will mount as, rw means read and write, noauto means the mount point will not be mounted automatically at system startup and user means this mount point can be mounted by a non-root user.

With this new line added to /etc/fstab, you could try mounting it using mount -a command or simply do a reboot

Now, we need to match the userid and groupid of myserveruser on the NFS server side and myclientuser on the NFS side. The user names do not have to be equal, but the userid and groupid must be equal to have the correct permission to access the mounted directory,

To set the userid and groupid of the myclientuser:

$ usermod -u 1009 myclientuser
$ groupmod -u 1009 myclientuser

where 1009 (as an example) is recorded on the NFS server side by the id command.

Please note that the above commands will change the property of the specified user and that will also cause the tool to iteratively change all the files under /home/myclientuser to be the same property given that it is owned by the specified user. This process may take a long time to complete. if you have a huge amount of data there, I would recommend creating a new user instead of changing the userid and groupid. Please note that files outside of specified user’s /home directory will not be changed the properly, which means current user can no longer access them once userid and groupid have been changed.

With all the setup in place, we are now ready to mount NFS:

$ mount /home/clientuser/mountpoint

and the folder mountpoint on NFS client side will now reflect to sharedir folder on the NFS server

Finally, we can initialize a new database cluster on the mountpoint on the NFS client side

$ initdb -D /home/clientuser/mountpoint

and the initial database files should be physically stored on the NFS server side instead of NFS client. This may take longer than usual because the file I/Os are now done over the network via NFS.

Cary Huang

Cary is a Senior Software Developer in HighGo Software Canada with 8 years of industrial experience developing innovative software solutions in C/C++ in the field of smart grid & metering prior to joining HighGo. He holds a bachelor degree in Electrical Engineering from University of British Columnbia (UBC) in Vancouver in 2012 and has extensive hands-on experience in technologies such as: Advanced Networking, Network & Data security, Smart Metering Innovations, deployment management with Docker, Software Engineering Lifecycle, scalability, authentication, cryptography, PostgreSQL & non-relational database, web services, firewalls, embedded systems, RTOS, ARM, PKI, Cisco equipment, functional and Architecture Design.

The post How to Set Up NFS and Run PG on it appeared first on Highgo Software Inc..

↧

Regina Obe: PostGIS 3.3.0rc1

August 7, 2022, 5:00 pm

≫ Next: Andreas 'ads' Scherbaum: Adam Wright

≪ Previous: cary huang: How to Set Up NFS and Run PG on it

The PostGIS Team is pleased to release PostGIS 3.3.0rc1! Best Served with PostgreSQL 15 beta2 ,GEOS 3.11.0 , and SFCGAL 1.4.1

Lower versions of the aforementioned dependencies will not have all new features.

This release supports PostgreSQL 11-15.

3.3.0rc1

This release is a release candidate of a major release, it includes bug fixes since PostGIS 3.2.2 and new features.

↧

Andreas 'ads' Scherbaum: Adam Wright

August 8, 2022, 7:00 am

≫ Next: Elizabeth Garrett Christensen: Crunchy Bridge Terraform Provider

≪ Previous: Regina Obe: PostGIS 3.3.0rc1

PostgreSQL Person of the Week Interview with Adam Wright: I grew up as an Army brat - someone who moves from base to base. I am settled with my family now and live in a coastal town between Boston and Cape Cod.

↧

Elizabeth Garrett Christensen: Crunchy Bridge Terraform Provider

August 8, 2022, 8:00 am

≫ Next: Egor Rogov: Queries in PostgreSQL: 6. Hashing

≪ Previous: Andreas 'ads' Scherbaum: Adam Wright

<p>In a world where everything is stored in <code>git</code> following IaC (infrastructure as code) you may want the same from your database. For many following this style of engineering modernization we see a focus on IaC and K8s. We have many users standardizing on our <a href="https://www.crunchydata.com/products/crunchy-postgresql-for-kubernetes">PGO</a> Kubernetes Operator to help. But following an IaC approach doesn’t mean you always want to manage your database and be in Kubernetes. For those wanting to forget about their database and trust the uptime, safety, and security of it to someone else - but still want to evolve their development practices - you can have your cake and eat it too.</p> <p><a href="https://www.crunchydata.com/products/crunchy-bridge">Crunchy Bridge</a> is excited to announce that we have released a verified <a href="https://registry.terraform.io/providers/CrunchyData/crunchybridge/latest">Terraform provider</a>. Many of our Crunchy Bridge customers are already using Terraform to manage their cloud infrastructure. This addition to our platform will allow Crunchy Bridge PostgreSQL resources to be integrated into infrastructure as code and DevOps workflows for our customers.</p> <p>Working with terraform allows our customers to manage their cloud infrastructure with configuration files that can be shared, reused, and version controlled across their organization. To get started with Terraform, see <a href="https://learn.hashicorp.com/terraform">HashiCorp’s documentation</a>. The Crunchy Bridge Terraform provider is intended for use with both open-source Terraform and Terraform Cloud.</p> <h2>Get started with our Terraform provider</h2> <ul> <li>If you do not already have a Crunchy Bridge account, you will need to <a href="https://crunchybridge.com/start">register for an account</a></li> <li>If you’re working within a larger organization, <a href="https://docs.crunchybridge.com/concepts/teams/">create that team and team members</a></li> <li>Create an <a href="https://docs.crunchybridge.com/api-concepts/getting-started/">API key</a> for use within the Terraform application</li> <li>Follow the <a href="https://registry.terraform.io/providers/CrunchyData/crunchybridge/latest">provider docs</a> for installation</li> </ul> <h3>Sample Provisioning</h3> <p>Add this code to your <code>main.tf</code> Terraform configuration file</p> <pre><code class="language-json">terraform { required_providers { crunchybridge = { source = "CrunchyData/crunchybridge" version = "0.1.0" } } } provider "crunchybridge" { // Set these to your API key application_id = "pato7ykzflmwsex54pbvym" application_secret = "ZZbjH3LZWhP1be2Nt5KX-I3g79aGeK8JNQFHn8" } data "crunchybridge_account" "user" {} resource "crunchybridge_cluster" "demo" { team_id = data.crunchybridge_account.user.default_team name = "famously-fragile-impala-47" provider_id = "aws" region_id = "us-east-1" plan_id = "standard-8" is_ha = "false" storage = "100" major_version = "14" } data "crunchybridge_clusterstatus" "status" { id = crunchybridge_cluster.demo.id } output "demo_status" { value = data.crunchybridge_clusterstatus.status } </code></pre> <p>The provider section above calls out the Crunchy Bridge provider. The resource specification will allow you to input specifications for the database provision you want to create like the cloud provider, region, plan, storage, and Postgres version. The data specifications define data which will be queried to identify the user's default team and to query the cluster status for the output.</p> <p>To get your database up and running, do a <code>terraform init</code> and then <code>terraform apply</code> and you're off and running.</p> <h2>Sample Deprovisioning</h2> <p>You’ll just keep your terraform and provider code in there, but remove the data and resource definitions. In the declarative model, by removing definitions we are declaring that we want any previously created definitions deprovisioned.</p> <p>Update <code>main.tf</code>:</p> <pre><code class="language-jsons">terraform { required_providers { crunchybridge = { source = "CrunchyData/crunchybridge" version = "0.1.0" } } } provider "crunchybridge" { application_id = "pato7ykzflmwsex54pbvym" application_secret = "ZZbjH3LZWhP1be2Nt5KX-I3g79aGeK8JNQFHn8" } </code></pre> <p>Now a <code>terraform apply</code> and you’ll destroy the cluster you just created.</p> <h2>Have at it!</h2> <p>See our <a href="https://docs.crunchybridge.com/concepts/terraform/">docs</a> for additional details about Terraform and working with the Crunchy Bridge platform. Feel free to throw in an issue to the <a href="https://github.com/CrunchyData/terraform-provider-crunchybridge">GitHub repo</a> if you see something you’d like to fix with the provider or the documentation. Have fun out there!</p>

↧

Egor Rogov: Queries in PostgreSQL: 6. Hashing

August 10, 2022, 5:00 pm

≫ Next: Nikolay Sivko: Chaos testing of a Postgres cluster managed by the Zalando Postgres Operator

≪ Previous: Elizabeth Garrett Christensen: Crunchy Bridge Terraform Provider

So far we have covered query execution stages, statistics, sequential and index scan, and have moved on to joins.

The previous article focused on the nested loop join, and in this one I will explain the hash join. I will also briefly mention group-bys and distincs.

One-pass hash join

The hash join looks for matching pairs using a hash table, which has to be prepared in advance. Here's an example of a plan with a hash join: EXPLAIN (costs off) SELECT * FROM tickets t JOIN ticket_flights tf ON tf.ticket_no = t.ticket_no;QUERY PLAN −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Hash Join Hash Cond: (tf.ticket_no = t.ticket_no) −> Seq Scan on ticket_flights tf −> Hash −> Seq Scan on tickets t (5 rows)

First, the Hash Join node calls the Hash node. The Hash node fetches all the inner set rows from its child node and arranges them into a hash table.

A hash table stores data as hash key and value pairs. This makes key-value search time constant and unaffected by hash table size. A hash function distributes hash keys randomly and evenly across a limited number of buckets. The total number of buckets is always two to the power N, and the bucket number for a given hash key is the last N bits of its hash function result.

So, during the first stage, the hash join begins by scanning all the inner set rows. The hash value of each row is computed by applying a hash function to the join attributes (Hash Cond), and all the fields from the row required for the query are stored in a hash table.

...

↧

Nikolay Sivko: Chaos testing of a Postgres cluster managed by the Zalando Postgres Operator

August 11, 2022, 6:30 am

≫ Next: Dan Garcia: PostgreSQL Extensions Impacted by CVE-2022-2625 with Privilege Escalation

≪ Previous: Egor Rogov: Queries in PostgreSQL: 6. Hashing

Checking how well a Postgres cluster running in Kubernetes tolerates failures

↧

Dan Garcia: PostgreSQL Extensions Impacted by CVE-2022-2625 with Privilege Escalation

August 11, 2022, 11:38 am

≫ Next: Federico Campoli: Gustaw, Stefan and the mean parrot

≪ Previous: Nikolay Sivko: Chaos testing of a Postgres cluster managed by the Zalando Postgres Operator

The PostgreSQL Global Development group has released the latest update to the community, crushing 40 bugs that had been previously disclosed. Critical security vulnerability CVE-2022-2625 has been announced and an associated remediation has been included in this quarterâ€™s release.

↧

Federico Campoli: Gustaw, Stefan and the mean parrot

August 12, 2022, 3:30 am

≫ Next: gabrielle roth: How to fix ‘data type [x] has no default operator class for access method [y]’ error messages

≪ Previous: Dan Garcia: PostgreSQL Extensions Impacted by CVE-2022-2625 with Privilege Escalation

This week the PostgreSQL pets are two lovely dogs, Gustaw, Stefan that live with a mean parrot. They are presented by Alicja Kucharczyk.

↧

gabrielle roth: How to fix ‘data type [x] has no default operator class for access method [y]’ error messages

August 13, 2022, 12:19 pm

≫ Next: Paul Ramsey: Rise of the Anti-Join

≪ Previous: Federico Campoli: Gustaw, Stefan and the mean parrot

I recently needed to create a GiST index on an integer array column for performance comparison purposes. I expected it to be easy: — If you’re here, you’re probably wondering “What in Sam Hill is an ‘operator class'”? An operator manipulates data in some way. You’re likely already familiar with arithmetic operators, e.g. ‘+’, ‘-‘, […]

↧

Paul Ramsey: Rise of the Anti-Join

August 12, 2022, 8:00 am

≫ Next: Paul Ramsey: Rise of the Anti-Join

≪ Previous: gabrielle roth: How to fix ‘data type [x] has no default operator class for access method [y]’ error messages

<p>Find me all the things in set "A" that are not in set "B".</p> <p><img src="https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/4cff80f1-0c32-4d14-d8ec-2307383d3c00/public" alt="" loading="lazy"></p> <p>This is a pretty common query pattern, and it occurs in both non-spatial and spatial situations. As usual, there are multiple ways to express this query in SQL, but only a couple queries will result in the best possible performance.</p> <h2>Setup</h2> <p>The non-spatial setup starts with two tables with the numbers 1 to 1,000,000 in them, then deletes two records from one of the tables.</p> <pre><code class="language-sql">CREATE TABLE a AS SELECT generate_series(1,1000000) AS i ORDER BY random(); CREATE INDEX a_x ON a (i); CREATE TABLE b AS SELECT generate_series(1,1000000) AS i ORDER BY random(); CREATE INDEX b_x ON b (i); DELETE FROM b WHERE i = 444444; DELETE FROM b WHERE i = 222222; ANALYZE; </code></pre> <p>The spatial setup is a 2M record table of geographic names, and a 3K record table of county boundaries. Most of the geonames are inside counties (because we tend to names things on land) but some of them are not (because sometimes we name things in the ocean, or our boundaries are not detailed enough).</p> <p><img src="https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/be1cb873-7bd0-4f15-2649-6d8721634b00/public" alt="" loading="lazy"></p> <h2>Subqueries? No.</h2> <p>Since the problem statement includes the words "not in", this form of the query seems superficially plausible:</p> <pre><code class="language-sql">SELECT i FROM a WHERE i NOT IN (SELECT i FROM b); </code></pre> <p>Perfect! Give me everything from "A" that is not in "B"! Just what we want? In fact, running the query takes so long that I never got it to complete. The explain gives some hints.</p> <pre><code class="language-bash"> QUERY PLAN ------------------------------------------------------------------------------ Gather (cost=1000.00..5381720008.33 rows=500000 width=4) Workers Planned: 2 -> Parallel Seq Scan on a (cost=0.00..5381669008.33 rows=208333 width=4) Filter: (NOT (SubPlan 1)) SubPlan 1 -> Materialize (cost=0.00..23331.97 rows=999998 width=4) -> Seq Scan on b (cost=0.00..14424.98 rows=999998 width=4) </code></pre> <p>Note that the subquery ends up materializing the whole second table into memory, where it is scanned over and over and over to test each key in table "A". Not good.</p> <h2>Except? Maybe.</h2> <p>PostgreSQL supports some <a href="https://www.postgresql.org/docs/current/queries-union.html">set-based key words</a> that allow you to find logical combinations of queries: <code>UNION</code>, <code>INTERSECT</code> and <code>EXCEPT</code>.</p> <p>Here, we can make use of <code>EXCEPT</code>.</p> <pre><code class="language-sql">SELECT a.i FROM a EXCEPT SELECT b.i FROM b; </code></pre> <p>The SQL still matches our mental model of the problem statement: everything in "A" <strong>except</strong> for everything in "B".</p> <p>And it returns a correct answer in about <strong>2.3 seconds</strong>.</p> <pre><code class="language-sql"> i -------- 222222 444444 (2 rows) </code></pre> <p>The query plan is interesting: the two tables are appended and then sorted for duplicates and then only non-dupes are omitted!</p> <pre><code class="language-bash"> QUERY PLAN --------------------------------------------------------------------------------------------- SetOp Except (cost=322856.41..332856.40 rows=1000000 width=8) -> Sort (cost=322856.41..327856.41 rows=1999998 width=8) Sort Key: "*SELECT* 1".i -> Append (cost=0.00..58849.95 rows=1999998 width=8) -> Subquery Scan on "*SELECT* 1" (cost=0.00..24425.00 rows=1000000 width=8) -> Seq Scan on a (cost=0.00..14425.00 rows=1000000 width=4) -> Subquery Scan on "*SELECT* 2" (cost=0.00..24424.96 rows=999998 width=8) -> Seq Scan on b (cost=0.00..14424.98 rows=999998 width=4) </code></pre> <p>It's a big hammer, but it works.</p> <h2>Anti-Join? Yes.</h2> <p>The best approach is the "anti-join". One way to express an anti-join is with a special "<a href="https://www.postgresql.org/docs/current/functions-subquery.html">correlated subquery</a>" syntax:</p> <pre><code class="language-sql">SELECT a.i FROM a WHERE NOT EXISTS (SELECT b.i FROM b WHERE a.i = b.i); </code></pre> <p>So this returns results from "A" only where those results result in a no-record-returned subquery against "B".</p> <p>It takes about <strong>850 ms</strong> on my test laptop, so <strong>3 times</strong> faster than using <code>EXCEPT</code> in this test, and gets the right answer. The query plan looks like this:</p> <pre><code class="language-bash"> QUERY PLAN ------------------------------------------------------------------------------------ Gather (cost=16427.98..31466.36 rows=2 width=4) Workers Planned: 2 -> Parallel Hash Anti Join (cost=15427.98..30466.16 rows=1 width=4) Hash Cond: (a.i = b.i) -> Parallel Seq Scan on a (cost=0.00..8591.67 rows=416667 width=4) -> Parallel Hash (cost=8591.66..8591.66 rows=416666 width=4) -> Parallel Seq Scan on b (cost=0.00..8591.66 rows=416666 width=4) </code></pre> <p>The same sentiment can be expressed without the <code>NOT EXISTS</code> construct, using only basic SQL and a <code>LEFT JOIN</code>:</p> <pre><code class="language-sql">SELECT a.i FROM a LEFT JOIN b ON (a.i = b.i) WHERE b.i IS NULL; </code></pre> <p>This also takes about <strong>850 ms</strong>.</p> <p>The <code>LEFT JOIN</code> is required to return a record for every row of "A". So what does it do if there's no record in "B" that satisfies the join condition? It returns <code>NULL</code> for the columns of "B" in the join relation for those records. That means any row with a <code>NULL</code> in a column of "B" that is normally non-<code>NULL</code> is a record in "A" that is not in "B".</p> <h2>Now do Spatial</h2> <p>The nice thing about the <code>LEFT JOIN</code> expression of of the solution is that it generalizes nicely to arbitrary join conditions, like those using spatial predicate functions.</p> <p>"Find the geonames points that are not inside counties"... OK, we will <code>LEFT JOIN</code> geonames with counties and find the records where counties are <code>NULL</code>.</p> <pre><code class="language-sql">SELECT g.name, g.geonameid, g.geom FROM geonames g LEFT JOIN counties c ON ST_Contains(c.geom, g.geom) WHERE g.geom IS NULL; </code></pre> <p>The answer pops out in about a minute.</p> <p><img src="https://imagedelivery.net/lPM0ntuwQfh8VQgJRu0mFg/b3a33f81-dc16-470f-efd3-00552eae3400/public" alt="" loading="lazy"></p> <p>Unsurprisingly, that's about how long a standard inner join takes to associate the 2M geonames with the 3K counties, since the anti-join has to do about that much work to determine which records do not match the join condition.</p> <h2>Conclusion</h2> <ul> <li>"Find the things in A that aren't in B" is a common use pattern.</li> <li>The "obvious" SQL patterns might not be the most efficient ones.</li> <li>The <code>WHERE NOT EXISTS</code> and <code>LEFT JOIN</code> patterns both result in the most efficient query plans and executions.</li> </ul>

↧

Paul Ramsey: Rise of the Anti-Join

August 15, 2022, 5:00 am

≫ Next: Andreas 'ads' Scherbaum: Hou Zhijie

≪ Previous: Paul Ramsey: Rise of the Anti-Join

↧

Andreas 'ads' Scherbaum: Hou Zhijie

August 15, 2022, 7:00 am

≫ Next: Joshua Drake: Recent blog updates

≪ Previous: Paul Ramsey: Rise of the Anti-Join

PostgreSQL Person of the Week Interview with Hou Zhijie: My name is Hou Zhijie. I am from China and live in Nanjing. I have been working at Nanjing Fujitsu Nanda Software Technology since graduation. I was working in the Fujitsu Enterprise PostgreSQL development and maintenance team, and now I work as part of the PostgreSQL open-source team at Fujitsu.

↧

Joshua Drake: Recent blog updates

August 15, 2022, 11:36 am

≫ Next: Laurenz Albe: How to corrupt your PostgreSQL database

≪ Previous: Andreas 'ads' Scherbaum: Hou Zhijie

When you have been around as long as Command Prompt, you are bound to forget blogs you wrote as well as the fact that those blogs are likely exceedingly outdated. I was recently doing a review of the Command Prompt Dead Sea Scrolls and have come across two that we have updated to be accurate for the modern times of PostgreSQL.

The blogs

Why update?

It is important to update your blogs including marking that they are updated or referencing former authors. It allows the content to maintain its relevancy and usefulness to readers. Remember, when dealing with technology things change often. If you don't update your content, you are doing a disservice to your readers as they may end up with old, bad or just wrong information about what you are trying to teach them.

↧

Laurenz Albe: How to corrupt your PostgreSQL database

August 16, 2022, 2:00 am

≫ Next: Egor Rogov: PostgreSQL 14 Internals, Part II

≪ Previous: Joshua Drake: Recent blog updates

Of course most people do not want to corrupt their databases. These people will profit from avoiding the techniques used in this article. But for some, it might be useful to corrupt a database, for example if you want to test a tool or procedure that will be used to detect or fix data corruption.

Prerequisites

We need a database with some data in it, and for some of our experiments, we will need to have some ongoing activity. For that, we can use the built-in PostgreSQL benchmark pgbench. We use scale factor 100, so that the largest table contains 10 million rows:

$ pgbench -q -i -s 100
dropping old tables...
creating tables...
generating data (client-side)...
10000000 of 10000000 tuples (100%) done (elapsed 7.44 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 10.12 s (drop tables 0.18 s, create tables 0.01 s, client-side generate 7.52 s, vacuum 0.14 s, primary keys 2.28 s).

Load will be generated with 5 concurrent client sessions:

$ pgbench -c 5 -T 3600

Creating a corrupt database by setting `fsync = off`

Let’s set fsync = off in postgresql.conf and power off the server while it is under load.

After a few attempts, we can detect data corruption with the amcheck extension:

postgres=# CREATE EXTENSION amcheck;
CREATE EXTENSION
postgres=# SELECT bt_index_parent_check('pgbench_accounts_pkey', TRUE, TRUE);
WARNING:  concurrent delete in progress within table "pgbench_accounts"
ERROR:  could not access status of transaction 1949706
DETAIL:  Could not read from file "pg_subtrans/001D" at offset 196608: read too few bytes.
CONTEXT:  while checking uniqueness of tuple (131074,45) in relation "pgbench_accounts"

What happened? Data were no longer flushed to disk in the correct order, so that data modifications could hit the disk before the WAL did. This leads to data corruption during crash recovery.

Creating a corrupt database from a backup

While pgbench is running, we create a base backup:

$ psql
postgres=# SELECT pg_backup_start('test');
 pg_backup_start 
═════════════════
 1/47F8A130
(1 row)

Note that since I am using PostgreSQL v15, the function to start backup mode is pg_backup_start() rather than pg_start_backup(). This is because the exclusive backup API, which had been deprecated since PostgreSQL 9.6, was finally removed in v15. To find out more, read my updated post in the link.

Let’s figure out the object IDs of the database and of the primary key index of pgbench_accounts:

postgres=# SELECT relfilenode FROM pg_class
           WHERE relname = 'pgbench_accounts_pkey';
 relfilenode 
═════════════
       16430
(1 row)

postgres=# SELECT oid FROM pg_database
           WHERE datname = 'postgres';
 oid 
═════
   5
(1 row)

We create a backup by copying the data directory. Afterwards, we copy the primary key index of pgbench_accounts and the commit log again to make sure that they are more recent than the rest:

$ cp -r data backup
$ cp data/base/5/16430* backup/base/5
$ cp data/pg_xact/* backup/pg_xact/
$ rm backup/postmaster.pid

The crucial part: do not create `backup_label`

Now we exit backup mode, but ignore the contents of the backup_label file returned from pg_backup_stop():

postgres=# SELECT labelfile FROM pg_backup_stop();
NOTICE:  WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup
                           labelfile                            
════════════════════════════════════════════════════════════════
 START WAL LOCATION: 1/47F8A130 (file 000000010000000100000047)↵
 CHECKPOINT LOCATION: 1/65CD24F0                               ↵
 BACKUP METHOD: streamed                                       ↵
 BACKUP FROM: primary                                          ↵
 START TIME: 2022-07-05 08:32:47 CEST                          ↵
 LABEL: test                                                   ↵
 START TIMELINE: 1                                             ↵
 
(1 row)

Then, let’s make sure that the last checkpoint in the control file is different:

$ pg_controldata -D backup | grep REDO
Latest checkpoint's REDO location:    1/890077D0
Latest checkpoint's REDO WAL file:    000000010000000100000089

Great! Let’s start the server:

$ echo 'port = 5555' >> backup/postgresql.auto.conf
$ pg_ctl -D backup start
waiting for server to start..... done
server started

Now an index scan on pgbench_accounts fails, because the index contains more recent data than the table:

postgres=# SELECT * FROM pgbench_accounts ORDER BY aid;
ERROR:  could not read block 166818 in file "base/5/16422.1": read only 0 of 8192 bytes

What happened? By omitting the backup_label file from the backup, we recovered from the wrong checkpoint, so the data in the table and its index were no longer consistent. Note that we can get the same effect without pg_backup_start() and pg_backup_stop(), I only wanted to emphasize the importance of backup_label.

Creating a corrupt database with `pg_resetwal`

While the database is under load from pgbench, we crash it with

pg_ctl stop -m immediate -D data

Then we run pg_resetwal:

pg_resetwal -D data
The database server was not shut down cleanly.
Resetting the write-ahead log might cause data to be lost.
If you want to proceed anyway, use -f to force reset.
$ pg_resetwal -f -D data
Write-ahead log reset

Then we start the server and use amcheck like before to check the index for integrity:

postgres=# CREATE EXTENSION amcheck;
CREATE EXTENSION
postgres=# SELECT bt_index_parent_check('pgbench_accounts_pkey', TRUE, TRUE);
WARNING:  concurrent delete in progress within table "pgbench_accounts"
ERROR:  could not access status of transaction 51959
DETAIL:  Could not read from file "pg_subtrans/0000" at offset 204800: read too few bytes.
CONTEXT:  while checking uniqueness of tuple (1,1) in relation "pgbench_accounts"

What happened?pg_resetwal is only safe to use on a cluster that was shutdown cleanly. The option -f is intended as a last-ditch effort to get a corrupted server to start and salvage some data. Only experts should use it.

Creating a corrupt database with `pg_upgrade --link`

We create a second cluster with initdb:

$ initdb -E UTF8 --locale=C -U postgres data2

Then we edit postgresql.conf and choose a different port number. After shutting down the original cluster, we run an “upgrade” in link mode:

$ pg_upgrade -d /home/laurenz/data -D /home/laurenz/data2 \
> -b /usr/pgsql-15/bin -B /usr/pgsql-15/bin -U postgres --link
Performing Consistency Checks
...
Performing Upgrade
...
Adding ".old" suffix to old global/pg_control               ok

If you want to start the old cluster, you will need to remove
the ".old" suffix from /home/laurenz/data/global/pg_control.old.
Because "link" mode was used, the old cluster cannot be safely
started once the new cluster has been started.
...
Upgrade Complete
----------------
Optimizer statistics are not transferred by pg_upgrade.
Once you start the new server, consider running:
    /usr/pgsql-15/bin/vacuumdb -U postgres --all --analyze-in-stages

Running this script will delete the old cluster's data files:
    ./delete_old_cluster.sh

pg_upgrade renamed the control file of the old cluster, so that it cannot get started by accident. We’ll undo that:

mv /home/laurenz/data/global/pg_control.old \
>  /home/laurenz/data/global/pg_control

Now we can start both clusters and run pgbench on both. Soon we will see error messages like

ERROR:  unexpected data beyond EOF in block 1 of relation base/5/16397
HINT:  This has been seen to occur with buggy kernels; consider updating your system.

ERROR:  duplicate key value violates unique constraint "pgbench_accounts_pkey"
DETAIL:  Key (aid)=(8040446) already exists.

WARNING:  could not write block 13 of base/5/16404
DETAIL:  Multiple failures --- write error might be permanent.

ERROR:  xlog flush request 0/98AEE3E0 is not satisfied --- flushed only to 0/648CDC58
CONTEXT:  writing block 13 of relation base/5/16404

ERROR:  could not access status of transaction 39798
DETAIL:  Could not read from file "pg_subtrans/0000" at offset 155648: read too few bytes.

What happened? Since both clusters share the same data files, we managed to start two servers on the same data files. This leads to data corruption.

Creating a corrupt database by manipulating data files

For that, we figure out the file name that belongs to the table pgbench_accounts:

postgres=# SELECT relfilenode FROM pg_class
           WHERE relname = 'pgbench_accounts';
 relfilenode 
═════════════
       16396
(1 row)

Now we stop the server and write some garbage into the first data block:

yes 'this is garbage' | dd of=data/base/5/16396 bs=1024 seek=2 count=1 conv=notrunc
0+1 records in
0+1 records out
1024 bytes (1.0 kB, 1.0 KiB) copied, 0.00031255 s, 3.3 MB/s

Then we start the server and try to select from the table:

postgres=# TABLE pgbench_accounts ;
ERROR:  compressed pglz data is corrupt

What happened? We tampered with the data files, so it’s unsurprising that the table is corrupted.

Creating a corrupt database with catalog modifications

Who needs ALTER TABLE to drop a table column? We can simply run

DELETE FROM pg_attribute
WHERE attrelid = 'pgbench_accounts'::regclass
  AND attname = 'bid';

After that, an attempt to query the table will result in an error:

ERROR:  pg_attribute catalog is missing 1 attribute(s) for relation OID 16396

What happened? We ignored that dropping a column sets attisdropped to TRUE in pg_attribute rather than actually removing the entry. Moreover, we didn’t check for dependencies in pg_depend, nor did we properly lock the table against concurrent access. Modifying catalog tables is unsupported, and if it breaks the database, you get to keep both pieces.

Conclusion

We have seen a number of ways how you can corrupt a PostgreSQL database. Some of these were obvious, some might surprise the beginner. If you don’t want a corrupted database,

don’t mess with the system catalogs
never modify anything in the data directory (with the exception of configuration files)
don’t run with fsync = off
don’t call pg_resetwal -f on a crashed server
remove the old cluster after an upgrade with pg_upgrade --link
don’t delete or omit backup_label
run a supported version of PostgreSQL to avoid known software bugs
run on reliable hardware

I hope you can save some databases with this information! If you’d like to know more about troubleshooting PostgreSQL performance, read my post on join strategies.

The post How to corrupt your PostgreSQL database appeared first on CYBERTEC.

↧

Egor Rogov: PostgreSQL 14 Internals, Part II

August 16, 2022, 5:00 pm

≫ Next: Pavlo Golub: Aliases for sub-SELECTS in FROM clause

≪ Previous: Laurenz Albe: How to corrupt your PostgreSQL database

I’m pleased to announce that Part II of the “PostgreSQL 14 Internals” book is available now. This part explores the purpose and design of the buffer cache and explains the need for write-ahead logging.

Please download the book freely in PDF. There are three more parts to come, so stay tuned!

Thanks to Matthew Daniel who has spotted a blunder in typesetting of Part I. Your comments are much appreciated; write us to edu@postgrespro.ru.

↧

Pavlo Golub: Aliases for sub-SELECTS in FROM clause

August 17, 2022, 2:00 am

≫ Next: Pavel Luzanov: PostgreSQL 16: part 1 or CommitFest 2022-07

≪ Previous: Egor Rogov: PostgreSQL 14 Internals, Part II

What are aliases in SQL?

SQL aliases in FROM clauses are used to give a result set a temporary name. A result set may be produced by a table or view name, sub-SELECT and/or VALUES clause. An alias is created with the AS keyword, but the grammar allows us to omit it. An alias only exists for the duration of that query and is often used to make result set and column names more readable.

SELECT r.rolname, s.usesuper, sub.datname, val.s
FROM
 pg_roles AS r,
 pg_shadow AS s,
 LATERAL (SELECT datname FROM pg_database d WHERE d.datdba = r.oid) AS sub,
 (VALUES (10, 'foo'), (13, 'bar')) AS val(i, s)
WHERE r.oid = val.i;

In this useless piece of SQL code we have 4 alieses: r, s, sub and val. For the last result set we not only use the name for it but also defined the aliases for its columns: i and s.

Rules to define aliases

The funny thing here is that one may not use an alias for regular tables, views, and function references. But according to the SQL standard, we are obliged to use aliases for sub-SELECTS and VALUES clauses. And PostgreSQL strictly followed this rule for decades:

SELECT * FROM (SELECT generate_series(1, 3)), (VALUES (10, 'foo'), (13, 'bar'));

SQL Error [42601]: ERROR: subquery in FROM must have an alias
  Hint: For example, FROM (SELECT ...) [AS] foo.

This wasn’t a big problem until PostgreSQL became a popular database and a lot of people started migration from Oracle and other RDBMS. Because in Oracle syntax it wasn’t mandatory to use aliases for such clauses. This situation was spotted by Bernd Helmle for the first time 5 years ago:
From time to time, especially during migration projects from Oracle to PostgreSQL, i'm faced with people questioning why the alias in the FROM clause for subqueries in PostgreSQL is mandatory. The default answer here is, the SQL standard requires it.

This also is exactly the comment in our parser about this topic:

/*
* The SQL spec does not permit a subselect
* () without an alias clause,
* so we don't either. This avoids the problem
* of needing to invent a unique refname for it.
* That could be surmounted if there's sufficient
* popular demand, but for now let's just implement
* the spec and see if anyone complains.
* However, it does seem like a good idea to emit
* an error message that's better than "syntax error".
*/
...

Bernd provided a patch immediately but it was rejected due to the lack of consensus on the solution.

My understanding is at that time there were no consensus on the importance of getting rid of this rule. So this question was rather a political and not a technical one. And for five long years, PostgreSQL followed the rule. And will still follow in the upcoming v15 major release.

Make subquery aliases optional in the next PostgreSQL major release

But recently a new commit pushed by Dean Rasheed to the development branch: Make subquery aliases optional in the FROM clause.

This allows aliases for sub-SELECTs and VALUES clauses in the FROM clause to be omitted.

This is an extension of the SQL standard, supported by some other
database systems, and so eases the transition from such systems, as
well as removing the minor inconvenience caused by requiring these
aliases.

Patch by me, reviewed by Tom Lane.

Discussion: https://postgr.es/m/CAEZATCUCGCf82=hxd9N5n6xGHPyYpQnxW8HneeH+uP7yNALkWA@mail.gmail.com

I suggest you examine those patches, both provided by Bernd and Dean, to get an idea about how the very same problem can be addressed differently. In a few words, Bernd’s approach was to generate unique aliases for clauses automatically, and Dean’s approach is not generating an alias at all, which seems to be neater and simpler, and less code than trying to generate a unique alias.

Let’s check if that works. In order to do that, I built PostgreSQL from the source and ran the following test query:

test=# SELECT version();
                                                 version                                                  
----------------------------------------------------------------------------------------------------------
 PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
(1 row)

test=# SELECT * FROM (SELECT generate_series(1, 3)), (VALUES (10, 'foo'), (13, 'bar'));
 generate_series | column1 | column2 
-----------------+---------+---------
               1 |      10 | foo
               1 |      13 | bar
               2 |      10 | foo
               2 |      13 | bar
               3 |      10 | foo
               3 |      13 | bar
(6 rows)

Yay! It works!

Conclusion

Thanks to this feature, PostgreSQL will be more compatible with Oracle syntax which will ease migration a lot. But if you really want to have a smooth Oracle to PostgreSQL migration experience, give our enhanced Cybertec Migrator a chance! You may start with watching demo videos and later download it and test it on your environment.

The post Aliases for sub-SELECTS in FROM clause appeared first on CYBERTEC.

↧

Side Note: PostgreSQL Optimization on Repeat Buffer Visits

Comparing with Equal Number of Blocks

Linux Readahead

Linux Readahead Histograms

Linux Readahead Disabled

Linux Readahead Wastefulness and Dumb Luck

Introduction

NFS Server

NFS Client

One-pass hash join

The blogs

Why update?

Prerequisites

Creating a corrupt database by setting fsync = off

Creating a corrupt database from a backup

The crucial part: do not create backup_label

Creating a corrupt database with pg_resetwal

Creating a corrupt database with pg_upgrade --link

Creating a corrupt database by manipulating data files

Creating a corrupt database with catalog modifications

Conclusion

What are aliases in SQL?

Rules to define aliases

Make subquery aliases optional in the next PostgreSQL major release

Conclusion

Creating a corrupt database by setting `fsync = off`

The crucial part: do not create `backup_label`

Creating a corrupt database with `pg_resetwal`

Creating a corrupt database with `pg_upgrade --link`