Andrew Dunstan: new pg_partman release

March 4, 2015, 3:50 pm

≫ Next: Michael Paquier: sslyze, a SSL scanner supporting Postgres

≪ Previous: damien clochard: PoWA 2 ! Moving Fast and Breaking Things

Keith Fiske's pg_partman is a pretty nice tool for managing partitions of tables. I've recommended it recently to a couple of clients, and it's working well for them.

Based on that I have made a couple of suggestions for improvement, and today he's made a release including one of them. Previously, the script to rebuild the child indexes essentially removed them all and rebuilt them. Now it only removes those that are no longer on the parent table, and only adds those that are on the parent but not on the child, so if you just add or delete one index on the parent that's all that gets done on the children too.

I'm happy to say that he's also working on my other, more significant suggestion, which is to have a hybrid partitioning mode where the trigger has static inserts for the likely common tables and dynamic inserts for the rest. That will mean you don't have to make nasty choices between flexibility and speed. I'm looking forward to it.

↧

Michael Paquier: sslyze, a SSL scanner supporting Postgres

March 5, 2015, 12:16 am

≫ Next: Tomas Vondra: Performance since PostgreSQL 7.4 / pgbench

≪ Previous: Andrew Dunstan: new pg_partman release

The last months have showed a couple of vulnerabilities in openssl, so sometimes it is handy to get a status of how SSL is used on a given instance. For this purpose, there is a nice tool called sslyze that can help scanning SSL usage on a given server and it happens that it has support for the SSLrequest handshake that PostgreSQL embeds in its protocol (see here regarding the message SSLrequest for more details).

The invocation of Postgres pre-handshake, support added recently with this commit, can be done using --starttls=postgres. With "auto", the utility can guess that the Postgres protocol needs to be used thanks to the port number.

Something important to note is that when writing this blog post the last release of sslyze does not include the support of Postgres protocol, so it is necessary to fetch the raw code from github, and to add nassl/ in the root of the git code tree to have the utility working (simply fetch it from the last release build for example).

Once the utility is ready, simply scan a server with SSL enabled with a command similar to that:

python sslyze.py --regular --starttls=postgres $SERVER_IP:$PORT

PORT would be normally 5432.

Now let's take the case of for example EXPORT ciphers which are not included by default in the list of ciphers available on server. The scanner is able to detect their presence:

$ python sslyze.py --regular --starttls=postgres 192.168.172.128:5432 | grep EXP
EXP-EDH-RSA-DES-CBC-SHA       DH-512 bits    40 bits
EXP-EDH-RSA-DES-CBC-SHA       DH-512 bits    40 bits
EXP-EDH-RSA-DES-CBC-SHA       DH-512 bits    40 bits
$ psql -c 'show ssl_ciphers'
         ssl_ciphers
------------------------------
 HIGH:MEDIUM:+3DES:EXP:!aNULL
(1 row)

And once disabled the contrary happens:

$ python sslyze.py --regular --starttls=postgres 192.168.172.128:5432 | grep EXP
$ psql -c 'show ssl_ciphers'
          ssl_ciphers
-------------------------------
 HIGH:MEDIUM:!EXP:+3DES:!aNULL
(1 row)

This makes this utility quite a handy tool to get a full status of a given server regarding SSL.

↧

Tomas Vondra: Performance since PostgreSQL 7.4 / pgbench

March 5, 2015, 1:10 pm

≫ Next: Josh Berkus: KaiGai and PG-Strom March 18th

≪ Previous: Michael Paquier: sslyze, a SSL scanner supporting Postgres

So, in the introduction post I briefly described what was the motivation of this whole effort, what hardware and benchmarks were used, etc. So let's see the results for the first benchmark - the well known pgbench and how they evolved since PostgreSQL 7.4.

If there's a one chart in this blog postthat you should remember, it's probably the following one - it shows throughput (transactions per second, y-axis) for various numbers of clients (x-axis), for PostgreSQL releases since 7.4. bad

Note: If a version is not shown, it means "same (or almost the same) as the previous version" - for example 9.3 and 9.4 give almost the same performance as 9.2 in this particular test, so only 9.2 is on the chart.

So on 7.4 we could do up to 10k transactions per second (although the scalability was poor, and with more clients the throughput quickly dropped to ~3k tps). Since then the performance gradually improved, and on 9.2 we can do more than 70k tps, and keep this throughput even for higher client counts. Not bad ;-)

Also note that up until PostgreSQL 9.1 (including), we've seen a significant performance drop once we exceeded the number of CPUs (the machine has 8 cores in total, which is exactly the point where the number of transactions per second starts decreasing on the chart).

Of course, these are the result of just one test (read-only on medium dataset), and gains in other tests may be more modest, but it nicely illustrates the amount of improvements that happened sice PostgreSQL 7.4.

This post is a bit long, but don't worry - most of it are pretty images ;-)

pgbench

But first let's talk a bit more about pgbench, because the short description in the intro post was bit too brief. So, what is "select-only" mode, "scale", client count and so on?

If you know these pgbench basics, you may just skip to the "Results" section, presenting the results of the benchmarks.

Mode

Pgbench has two main modes - read-write and read-only (there's also a third mode, and an option to use custom scripts, but I won't talk about those).

read-write (default)

The read-write mode executes a simple transaction, simulating a withdrawal from an account - updating the balance in "accounts" table, selecting the current balance, updating tables representing a branch and a teller, and recording a row into a history.

BEGIN;UPDATEpgbench_accountsSETabalance=abalance+:deltaWHEREaid=:aid;SELECTabalanceFROMpgbench_accountsWHEREaid=:aid;UPDATEpgbench_tellersSETtbalance=tbalance+:deltaWHEREtid=:tid;UPDATEpgbench_branchesSETbbalance=bbalance+:deltaWHEREbid=:bid;INSERTINTOpgbench_history(tid,bid,aid,delta,mtime)VALUES(:tid,:bid,:aid,:delta,CURRENT_TIMESTAMP);END;

As you can see, all the conditions are on a PK column, always work with a single row, etc. Simple.

read-only (SELECT-only)

The read-only mode is even simpler - it simply executes only the "SELECT" query from the read-write mode, so it's pretty much this:

SELECTabalanceFROMpgbench_accountsWHEREaid=:aid;

Again, access through PK column - very simple.

Scale

In short, scale determines size of the database as a number of rows in the main "accounts" table - the supplied value gets multiplied by 100.000 and that's how many rows in that table you get. This of course determines the size on disk, as every 100.000 rows corresponds to 15MB on disk (including indexes etc.).

When choosing the scale for your benchmark, you have three basic choices, each testing something slightly different.

small

usually scale between 1-10 (15-150MB databases)
only a small fraction of RAM (assuming regular hardware)
usually exposes locking contention, problems with CPU caches and similar issues not visible with larger scales (where it gets overshadowed by other kinds of overhead - most often I/O)

medium

scales corresponding to ~50% of RAM (e.g. 200-300 on systems with 8GB RAM)
the database fits into RAM (assuming there's enough free memory for queries)
often exposes issues with CPU utilization (especially on read-only workloads) or locking

large

scales corresponding to ~200% of RAM, or more (so 1000 on systems with 8GB RAM)
the database does not fit into RAM, so both modes (read-only and read-write) hit the I/O subsystem
exposes issues with inefficient disk access (e.g. because of missing index) and I/O bottlenecks

Client counts

The third important parameter I'd like to discuss is client count, determining the number of concurrent connections used to execute queries.

Let's assume you measure performance with a single client, and you get e.g. 1000 transactions per second. Now, what performance will you get with 2 clients? In an ideal world, you'd get 2000 tps, twice the performance with a single client.

For N clients you'd get N-times the performance of a single client, but it's clear it doesn't work like that, because sooner or later you'll run into some bottleneck. Either the CPUs will become fully utilized, the I/O subsystem will be unable to handle more I/O requests, you'll saturate network or memory bandwidth, or hit some other bottleneck.

As the number of clients is increased, we usually see three segments on the charts:

linear scalability at the beginning (so N clients really give nearly N-times the throughput of a single client), until saturating at least one of the resources (CPU, disk, ...)
after saturating one of the resources, the system should maintain constant throughput, i.e. adding more clients should result in linear increase of latencies
eventually the system runs into even worse issues, either at the DB or kernel level (excessive locking, process management, context switching etc.), usually resulting in a quick performance drop (where latencies grow exponentially)

On a simple chart, it might look like this:

Of course - this is how it would look in an ideal world. In practice, the initial growth is not linear, and the throughtput in the second part is not constant but drops over time - sometimes fast (bad), sometimes gradually (better). We'll see examples of this.

For sake of clarity and brevity, I've left out the latencies from the charts in this blog post. A proper latency analysis would require a considerable amount of space (to show averages, percentiles, ...) but the results turned out to be especially dull and boring in this case. You have to trust me that the transaction rate (and implied average latency) is a sufficient description of the results in this case.

Jobs count (pgbench threads)

By default, all the pgbench clients (connections) are handled by a single thread - the thread submits new queries, collects results, writes statistics etc. When the queries are long, or when the number of connections is low, this is fine. But as the queries get shorter (e.g. with read-only test on small dataset), or when the number of connections grows (on many-core systems), the management overhead becomes significant.

That's why pgbench has --jobs N (or -j N) option, to specify the number of threads. The value has to be a divisor of number of clients, so for example with 8 clients you may use 1, 2, 4 or 8 threads, but not 5 (because 8 % 5 = 3). There are various smart strategies to set the value, but the easiest and most straightforward way is to use the same value for both options, and that's what I used for this benchmark.

pgbench tweaks

The pgbench tool evolves with PostgreSQL, so when new features are added to PostgreSQL, which means new pgbench versions may not work with older PostgreSQL releases (e.g. on releases not supporting fillfactor storage option, which is used by pgbench when initializing tables). There are two ways to overcome this - either use pgbench version matching the PostgreSQL release, or use the new version but patch it so that it works with older releases (by skipping features not present in that PostgreSQL version).

I've used the latter approach, mostly because I wanted to use some of the features available only in recent pgbench versions (especially --aggregate-interval).

The patch I used is available here. It changes three things:

adds -X option to disable IF EXISTS when dropping table (because that's not available on old releases)
adds -Q option to entirely skip cleanup (because the script simply recreates the database anyway)
disables fillfactor on all the tables (so the default version is used)

None of the changes should have no impact on the results, because the first two just remove new syntax (which is not necessary), and after removing the explicit fillfactor option the default will be used (which is 100 on all versions).

If you're interested in the script driving the testing, it's available here. It's a bit hackish, but it should give you an idea of how the testing was done.

Benchmarked combinations

So, given all the possible parameters (mode, scale, client count), what combinations were benchmarked?

scales: small, medium, large
modes: read-only, read-write
clients: 1, 2, 3, 4, 6, 8, 10, 12, 14, 16, 20, 24, 28, 32 (14 values)

That gives 84 combinations in total, and for each combination there were 3 runs, 30 minutes each. That's ~126 hours per version (not including initialization and warmup), so ~6 days per PostgreSQL version. There were 12 versions benchmarked (7.4, 8.0, 8.1, 8.2, 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, 9.4b2, head) so the total runtime is close to 70 days.

Hopefully that convinces you this was not a rushed unreliable benchmark.

postgresql.conf

There were only minimal changes to the configuration file:

# machine has 16GB of RAM in totalshared_buffers= 2GB 
work_mem= 64MB
maintenance_work_mem= 512MB

checkpoint_segments= 64
checkpoint_completion_target= 0.9
effective_cache_size= 8GB

# when executed on the SSD (otherwise 4)random_page_cost= 2

Of course, there were some tweaks (e.g. because older versions don't know checkpoint_completion_target) but no extensive per-version tuning was performed, as that'd make the testing much more time consuming (to the point that it'd be impossible).

I however did a limited amount of tests to see how big impact may such tuning have - I'll show you a chart illustrating that later (spoiler: the older versions may be much faster with a bit of tuning, but the new ones are still faster).

Results

So, let's see the results! I'll only show some of the charts - the rest of them is usually quite similar.

medium, read-only

First, the chart we've already seen:

This is a read-only workload, with a dataset that completely fits into RAM. So this is CPU-bound, with locking etc. Yet PostgreSQL 9.4 is 7-20x faster than 7.4. Not bad, I guess.

The other interesting observation is that it scales linearly with the number of cores - in this case it's actually a bit better (the performance on 8 cores is more than 2x of 4 cores), most likely thanks to caching effects. During the 9.2 development this was thoroughly tested (by Robert Haas) and the conclusion was that we do scale linearly up to 64 cores (and probably beyond, but we didn't have machines with more cores).

There's also an excellent talk by Heikki Linnakangas about all the improvements done in PostgreSQL 9.2, presented at FOSDEM 2012.

This is of course dependent on workload - the improvements in locking are most visible on read-only workloads on small datasets, because that's mostly cached and should do much I/O. Once you start hitting any kind of I/O, it will probably overshadow these improvements. It's also related to improvements made in Linux kernel, for example the lseek scalability improvements committed into kernel 3.2. And finally all this is of course dependent on the exact CPU type - on older CPUs the improvements may not be as great.

large, read-only

Let's see how the read-only workload performs on a large dataset (thus not fitting into RAM), on an SSD drive

This is clearly I/O bound, because pgbench generates a lot of random I/O (although the SSD handles that much better than rotational drives). Because of the I/O bottleneck the difference is not as significant as with the medium dataset, but 2-4x speedup is still really nice.

There were two releases that significantly improved the performance - first 8.0, then 8.1 (and that's mostly what we get on 9.4, performance-wise).

large, read-write

So, what happens when we keep the large dataset, but do a read-write test?

Interesting. The 7.4 is doing rather poorly (just ~500 transactions per second on an SSD). The 8.x releases perform much better, with roughly ~2x the performance, but the throughput is mostly constant after reaching 8 clients (which is the number of CPUs in the machine).

But since 9.1, we see a quite interesting behavior - the throughput grows even after 8 clients (where it's mostly equal to 8.x), up to ~2500 tps with 32 clients. And the chart suggests it would probably grow even further with more clients (had it been part of the benchmark). This can happen because SSDs are quite good with handling parallel requests (thanks to internal architecture, which uses multiple independend channels), and since 9.1 we can saturate it much better.

Note 1: The read-write results on medium dataset are very similar to this.

Note 2: If you're looking for a nice but thorough explanation of how SSDs work, I recommend this article on codecapsule.com.

large, read-write on SAS drives

If you ask how would the comparison look on traditional rotational drives, this is the answer:

For the record, these are 10k SAS drives, connected to a HP P400 controller (RAID10 on 6 drives).

It's interesting that 8.4 is slightly faster than 9.4 for higher client counts, but we're still doing better than 7.4, but clearly the I/O is a significant bottleneck that's difficult to beat at the code level.

large, read-only / Xeon vs. i5-2500k

Another interesting question is how much this depends on hardware, so here is a comparison with results from the machine with i5-2500k CPU.

Wow! On PostgreSQL 7.4 the difference is "only" 2x, but on 9.4 the i5 really crushes the Xeon, giving nearly ~4x the performance. Also, the i5 has only 4 cores while the Xeon has 8 cores (in 2 sockets), so that's ~8x the performance per socket, while the frequencies are about the same (3 GHz vs. 3.3 GHz).

Just for the record, this difference is not only thanks to the CPU - the i5 system has a better memory throughput, the SSD is not connected using a lousy PCIe controller (which limits the bandwidth), etc. But clearly, modern hardware can make a huge difference.

large, read-only / postgresql.conf tweaks

I also mentioned that each PostgreSQL version may have different "optimal" configuration values. That makes sense, because the development happens in the context of the current hardware. When 7.4 was released 10 years ago, most machines had only 4GB of RAM (not to mention that everything was still 32-bit and thus capped to 2GB), so there was no point in optimizing for large shared_buffers values, for example.

So, what happens if we use much lower settings for shared_buffers (only 128MB instead of the 2GB used in all the previous benchmarks)? This:

On 7.4, using shared_buffers=128MB improved the performance of shared_buffers=2GB by a factor of ~3x. On 8.0 there's still some difference, but it's much smaller (~5%). The nice thing is that 9.4b1 is faster even with the 2GB.

Note: There are also other reasons to use small shared_buffers values on older versions (absence of checkpoint_completion_target and usually running them on older filesystems like ext3, with poor fsync behavior).

small, read-write

Those were medium and large datasets, but what about the small one? The read-only results look quite similar to the medium dataset, but the read-write seems interesting on it's own:

Wow! 8.2 really improved this, boosting the performance 5x, compared to 8.1. The following releases further improved the results by ~25%, but 8.2 was clearly a huge jump forward.

small / Xeon vs. i5-2500k

And finally, let's do two more comparisons of the Xeon and i5 based systems, with read-only and read-write benchmarks on the small dataset.

The solid lines are Xeon, the dashed lines are i5. This time the Xeon actually wins over i5 (unlike with the large dataset), at least on 9.4 - having 2x the cores matters here, apparently. It's also nice to see that 9.4 fixed the performance degradation on the Xeon, with client counts higher than the number of CPU cores (on the i5 this did not happen at all).

On read-write workload, i5 wins again (despite the smaller number of cores).

Summary

Let me summarize the main points we've learned from the benchmark results:

The newer the PostgreSQL release, the better the performance. Yay!
Depending on the test, some releases make a huge improvement - e.g. PostgreSQL 8.1 improves a lot read-write workloads on small datasets, while PostgreSQL 9.2 makes a big improvement in other workloads.
Older releases perform much better with more conservative settings (smaller shared_buffers etc.), because that's the context of their development.
Newer hardware matters - if your workload is bound by I/O and you're stuck with the same storage system, locking-related improvements may have only limited impact.

↧

Josh Berkus: KaiGai and PG-Strom March 18th

March 5, 2015, 5:22 pm

≫ Next: Shaun M. Thomas: PG Phriday: Materialized Views, Revisited

≪ Previous: Tomas Vondra: Performance since PostgreSQL 7.4 / pgbench

main-image

One of the hot topics for this year's pgCon will be parallel processing in postgres, including using Postgres with GPU processing. One such project is PG-Strom, led by Tokyo-based PostgreSQL contributor KaiGai Kohei. KaiGai, who works for NEC and is also the leading contributor behind SEPostgres and Row-Level Security, will be visiting the Bay Area and presenting about PG-Strom for SFPUG. RSVP now to join us.

This SFPUG meeting will be hosted by crowd-funding platform Tilt (which runs on Postgres).

Also, registration for next week's pgDaySF is still open! Join us on Tuesday in Burlingame.

↧

Shaun M. Thomas: PG Phriday: Materialized Views, Revisited

March 6, 2015, 9:32 am

≫ Next: Josh Berkus: Fancy SQL Friday: subtracting arrays

≪ Previous: Josh Berkus: KaiGai and PG-Strom March 18th

Materialized views are a great improvement to performance in many cases. Introduced in PostgreSQL 9.3, they finally added an easy method for turning a view into a transient table that could be indexed, mined for statistics for better planner performance, and easily rebuilt. Unfortunately, refreshing a materialized view in PostgreSQL 9.3 caused a full exclusive lock, blocking any use until the process was complete. In 9.4, this can finally be done concurrently, though there are still a couple caveats.

Fake data is always best to illustrate, so let’s create a very basic table and a materialized view based on its contents. Heck, why not use the table from last week’s PG Phriday? For the purposes of this article, we’ll modify it slightly so there’s more than one day of orders.

CREATE TABLE sys_order
(
    order_id     SERIAL     NOT NULL,
    product_id   INT        NOT NULL,
    item_count   INT        NOT NULL,
    order_dt     TIMESTAMP  NOT NULL DEFAULT now()
);

INSERT INTO sys_order (product_id, item_count, order_dt)
SELECT (a.id % 100) + 1, (random()*10)::INT + 1,
       CURRENT_DATE - ((random()*30)::INT || 'days')::INTERVAL
  FROM generate_series(1, 1000000) a(id);

ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);

ANALYZE sys_order;

Now, how would we create a materialized view from this? There are a million rows constructed of 100 products, with varying order totals. A good way to use materialized views is to collect the underlying data into some kind of aggregate. How about product order totals for the entire day?

CREATE MATERIALIZED VIEW mv_daily_orders AS
SELECT order_dt, product_id, sum(item_count) AS item_total
  FROM sys_order
 GROUP BY order_dt, product_id;

CREATE UNIQUE INDEX udx_daily_orders_order_dt
    ON mv_daily_orders (order_dt, product_id);

ANALYZE mv_daily_orders;

Note that we added a unique index to the order date and product ID. This is a subtle but required element to concurrently refresh a materialized view. Without it, we would get this error trying a concurrent refresh:

ERROR:  cannot refresh materialized view "public.mv_daily_orders" concurrently
HINT:  Create a unique index with no WHERE clause on one or more columns of the materialized view.

In addition, we also analyzed the table. It’s always good practice to ANALYZE a table after significant modifications so statistics are fresh. The planner and your query run times will thank you. There is an automatic daemon that will analyze tables after a certain threshold of changes, but this threshold may be too low, especially for extremely large tables that are on the receiving end of nightly insert jobs.

The materialized view is much smaller than the original, making it theoretically a better choice for reports and summaries. Let’s take a look at the last five days of sales for product ID number 1:

SELECT order_dt, item_total
  FROM mv_daily_orders
 WHERE order_dt >= CURRENT_DATE - INTERVAL '4 days'
   AND product_id = 1
 ORDER BY order_dt;

      order_dt       | item_total
---------------------+------------
 2015-03-02 00:00:00 |       1875
 2015-03-03 00:00:00 |       1977
 2015-03-04 00:00:00 |       2150
 2015-03-05 00:00:00 |       1859
 2015-03-06 00:00:00 |       1003

Great! The 6th isn’t over yet, so let’s insert some more orders.

INSERT INTO sys_order (product_id, item_count, order_dt)
SELECT (a.id % 100) + 1, (random()*10)::INT + 1, CURRENT_DATE
  FROM generate_series(1, 1000) a(id);

Materialized views to not update when their parent table is modified. This means we would have to update the view ourselves by calling REFRESH MATERIALIZED VIEW. In order to illustrate that 9.4 doesn’t lock these during a refresh, we’ll send these commands to the first connection:

BEGIN;
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_orders;

Notice that we didn’t end the transaction? That will preserve any locks PostgreSQL allocates while we attempt to access data in the materialized view. Next, we need a second connection where we try to use the view:

SELECT count(1) FROM mv_daily_orders;

 count
-------
  3100

If this were a 9.3 database, we couldn’t use the CONCURRENTLY keyword, and the second connection would have hung until the first connection issued a COMMIT statement. Otherwise, we have normal transaction isolation controls. The view contents would look as if they had not been refreshed until the transaction is committed. Let’s look at those totals again:

      order_dt       | item_total
---------------------+------------
 2015-03-02 00:00:00 |       1875
 2015-03-03 00:00:00 |       1977
 2015-03-04 00:00:00 |       2150
 2015-03-05 00:00:00 |       1859
 2015-03-06 00:00:00 |       1067

Now we have 64 more sales for product number 1.

But what about those caveats? First of all, this is from the PostgreSQL Wiki regarding the concurrent update process:

Instead of locking the materialized view up, it instead creates a temporary updated version of it, compares the two versions, then applies INSERTs and DELETEs against the materialized view to apply the difference. This means queries can still use the materialized view while it’s being updated.

This means a concurrent update needs to re-run the definition query, compare it against the existing rows of the view, and then merge in the changes using INSERT and DELETE statements. Note that in our case, it would have been much more efficient to simply ‘top off’ the view by getting the totals for the current day and replacing them manually. A “real” sys_order table would be much larger than a short 30-day window, and such volume would impart a profound performance impact.

But we can’t do that with the built-in materialized view structure. All manual attempts to modify the view produce this error:

ERROR:  cannot change materialized view "mv_daily_orders"

Drat. Granted, such variance would not be reflected in our view definition and would therefore not be recommended. However, it would be nice if we could define a function and bind that as a callback when REFRESH MATERIALIZED VIEW is invoked, especially when rebuilding the entire data set from scratch is slow and inefficient. The assumption in such a case would be that the function would leave the view in an accurate state when complete.

A concurrent refresh does not lock the view for use, but can require significant resources and time to rebuild and merge. Why this approach was chosen? The PostgreSQL TRUNCATE command is also transaction safe and entirely atomic. A new table is created based on the definition of the current one, but empty, and the new structures replace the old ones when no transactions refer to them any longer. This same process could have been used for the concurrent view updates, considering the definition query has to be re-run anyway. There are probably some complicated internals that would have made this difficult, but I still wonder.

Which brings us to the second caveat: VACUUM. Since the contents of the materialized view are directly modified with DELETE and INSERT behind the scenes, that leaves dead rows according to PostgreSQL’s MVCC storage mechanism. This means the view has to be vacuumed following a refresh or there’s a risk the structure will bloat over time. Now, the autovacuum daemon should take care of this for us in the long run, but it’s still an element to consider.

In the end, CONCURRENT is a vastly needed improvement. Once a materialized view is created, there’s a high likelihood it will require periodic updates. If end users can’t treat materialized views the same way as they would a normal table or view, they’ll eventually migrate away from using them at all. That’s the path of least resistance, after all; why use a materialized view when it could be exclusively locked for long periods of time.

Such conduct directly conflicts with the expectation MVCC normally provides, and end users have come to expect: writers do not block readers. Since it’s only the data being modified, and not the structure of the view, I personally wonder why CONCURRENTLY isn’t the default behavior. As such, if you’re using a 9.4 database in conjunction with materialized views, I strongly encourage using the CONCURRENTLY keyword at all times when performing refreshes.

Here’s hoping materialized views become even more viable in future releases!

↧

Josh Berkus: Fancy SQL Friday: subtracting arrays

March 6, 2015, 4:27 pm

≫ Next: Pavel Golub: MERGE in PostgreSQL

≪ Previous: Shaun M. Thomas: PG Phriday: Materialized Views, Revisited

Here's one which just came up: how to see all of the elements in a new array which were not in the old array. This isn't currently supported by any of PostgreSQL's array operators, but thanks to UNNEST() and custom operators, you can create your own:

    create or replace function diff_elements_text (
        text[], text[] )
    returns text[]
    language sql
    immutable
    as $f$
    SELECT array_agg(DISTINCT new_arr.elem)
    FROM
        unnest($2) as new_arr(elem)
        LEFT OUTER JOIN
        unnest($1) as old_arr(elem)
        ON new_arr.elem = old_arr.elem
    WHERE old_arr.elem IS NULL;
    $f$;

    create operator - (
        procedure = diff_elements_text,
        leftarg = text[],
        rightarg = text[]
    );

Now you can just subtract text arrays:

    ktsf=# select array['a','n','z'] - array['n','z','d','e'];
    ?column?
    ----------
    {d,e}
    (1 row)

Unfortunately, you'll need to create a new function and operator for each base type; I haven't been able to get it to work with "anyarray". But this should save you some time/code on array comparisons. Enjoy!

↧

Pavel Golub: MERGE in PostgreSQL

February 26, 2015, 1:14 am

≫ Next: Marco Nenciarini: JSONB type performance in PostgreSQL 9.4

≪ Previous: Josh Berkus: Fancy SQL Friday: subtracting arrays

Found cool trick how today implement Orable MERGE in PostgreSQL:

Oracle statement:

MERGE INTO acme_obj_value d
USING(SELECT object_id
FROM acme_state_tmp
) s
ON(d.object_id = s.object_id)
WHEN matched THEN
UPDATESET d.date_value =LEAST(l_dt, d.date_value)
WHENNOT matched THEN
INSERT(d.id, d.object_id, d.date_value)
VALUES(acme_param_sequence.NEXTVAL, s.object_id, l_dt)

PostgreSQL statement:

WITH s AS(
SELECT object_id
FROM acme_state_tmp
),
upd AS(
UPDATE acme_obj_value
SET date_value =LEAST(l_dt, d.date_value)
FROM s
WHERE acme_obj_value.object_id = s.object_id
RETURNING acme_obj_value.object_id
)
INSERTINTO acme_obj_value(id, object_id, date_value)
SELECTNEXTVAL('acme_param_sequence'), s.object_id, l_dt
FROM s
WHERE s.object_id NOTIN(SELECT object_id FROM upd)

Filed under: Coding, PostgreSQL Tagged: PostgreSQL, SQL, trick

↧

Marco Nenciarini: JSONB type performance in PostgreSQL 9.4

March 3, 2015, 1:30 am

≫ Next: Julien Rouhaud: pg_stat_kcache 2.0

≪ Previous: Pavel Golub: MERGE in PostgreSQL

The 9.4 version of PostgreSQL introduces the JSONB data type, a specialised representation of the JSON data, allowing PostgreSQL to be competitive in managing the “lingua franca” of the moment for the exchange of data via web services. It is useful to perform a number of tests to verify its actual performance.

Test data

We will run our tests using the customer reviews data from Amazon for the year 1998 in JSON format. The file customer_reviews_nested_1998.json.gz can be downloaded from the website of Citus Data.

The file, once unzipped, takes up 209MB and contains approximately 600k of records in JSON format with a structure similar to the following:

{
"customer_id":"ATVPDKIKX0DER","product":{"category":"Arts & Photography","group":"Book","id":"1854103040","sales_rank":72019,"similar_ids":["1854102664","0893815381","0893816493","3037664959","089381296X"],"subcategory":"Art","title":"The Age of Innocence"},"review":{"date":"1995-08-10","helpful_votes":5,"rating":5,"votes":12}}

Dimensions

The data can be loaded into a PostgreSQL database using the JSONB data type with the following commands:

CREATETABLE reviews(review jsonb);\copy reviews FROM'customer_reviews_nested_1998.json'VACUUM ANALYZE reviews;

The resulting table will take up approximately 268MB, with an additional cost of disk storage of around 28%. If we try to load the same data using the JSON type, which stores it as text, the result will be a table of 233MB, with an increase in space of roughly 11%. The reason for this difference is that the internal structures of JSONB, which are used to access the data without analysing the entire document each time, have a cost in terms of space.

Data access

Once the data is stored in the database, it is necessary to create an index in order to access it efficiently. Before the 9.4 version of PostgreSQL, the only method of indexing the contents of a JSON field was to use a B-tree index on a specific search expression. For example, if we want to perform a search by product category we will use:

CREATEINDEXon reviews ((review #>>'{product,category}'));

The newly created index takes up 21MB, or approximately 10% of the original data, and will be used for queries that contain within the WHERE clause the exact search term “review #>> {product,category}”, such as:

SELECT
    review #>> '{product,title}'AS title,
    avg((review #>> '{review,rating}')::int)FROM reviews
WHERE review #>> '{product,category}' = 'Fitness & Yoga'GROUPBY1ORDERBY2;                       title                       |        avg
---------------------------------------------------+--------------------
 Kathy Smith - New Yoga Challenge                  | 1.6666666666666667
 Pumping Iron 2                                    | 2.0000000000000000
 Kathy Smith - New Yoga Basics                     | 3.0000000000000000
 Men Are from Mars, Women Are from Venus           | 4.0000000000000000
 Kathy Smith - Functionally Fit - Peak Fat Burning | 4.5000000000000000
 Kathy Smith - Pregnancy Workout                   | 5.0000000000000000
(6 rows)

The query takes approximately 0.180ms to be performed on the test machine, but the index that has been created is highly specific and cannot be used for different searches.

Starting with version 9.4, the JSONB data type supports the use of inverted indexes (GIN, or General inverted Indexes), which allow indexing of the components of a complex object.

Let’s create a GIN index on our table reviews with the following command:

CREATEINDEXon reviews USING GIN (review);

The resulting index takes up 64MB of disk, which is approximately 30% of the size of the original table. This index can be used to speed up searches using the following operators:

JSON@>JSON is a subset
JSON?TEXT contains a value
JSON?&TEXT[] contains all the values
JSON?|TEXT[] contains at least one value

The above query shall thus be rewritten using the operator @> to search for rows that contain '{"product": {"category": "Fitness & Yoga"}}':

SELECT
    review #>> '{product,title}'AStitle,
    avg((review #>> '{review,rating}')::int)FROM reviews
WHEREreview@> '{"product": {"category": "Fitness & Yoga"}}'GROUPBY1ORDERBY2;

The query takes approximately 1.100ms to be performed on the test machine, and the index that has been created is flexible and can be used for any search within the JSON data.

In fact, it is often the case that the only operation used in the applications is the search for a subset. In this case it is possible to use a different GIN index which only supports the @> operator and is therefore considerably smaller. The syntax for creating this type of “optimised” index is as follows:

CREATEINDEXon reviews USING GIN (review jsonb_path_ops);

The resulting index occupies 46MB – only 22% of the size of the original data – and, thanks to its smaller size, it is used more efficiently by PostgreSQL.

This allows the above query to be run in just 0.167ms, with a performance increase of 650% compared to the original GIN index and 8% compared to the specific B-tree index initially used: all this without loss of generality with regards to the possible search operation.

Conclusions

With the introduction of the JSONB type and the GIN indexes
using jsonb_path_ops operator class, PostgreSQL combines the elasticity of the JSON format at an amazing data access speed.

Today it is thus possible to store and process data in JSON format with high performance while enjoying the robustness and flexibility that PostgreSQL has habitually provided us with over the years.

↧

Julien Rouhaud: pg_stat_kcache 2.0

March 4, 2015, 10:33 am

≫ Next: Michael Paquier: Hack to calculate CPU usage of a Postgres backend process

≪ Previous: Marco Nenciarini: JSONB type performance in PostgreSQL 9.4

Some history

My colleague Thomas created the first version of pg_stat_kcache about a year ago. This extension is based on getrusage, which provides some useful metrics, not available in PostgreSQL until now:

CPU usage (user and system)
Disk access (read and write)

PostgreSQL already has its own wrapper around getrusage (see pg_rusage.c), but it’s only used in a few places like VACUUM/ANALYZE execution statistics, only to display CPU usage and execution time, that wasn’t enough for our need.

The first version of the extension gave access to these metrics, but only with the granularity of the query operation (SELECT, UPDATE, INSERT…). It was interesting but still not enough. However, that’s all that could be done with the existing infrastructure.

But then, this patch is committed : Expose qurey ID in pg_stat_statements view.. That means that, starting with PostgreSQL 9.4, we now have a way to aggregate statistic per query, database and user, as long as pg_stat_statements is installed, which is far more useful. That’s what the new version 2.0 of pg_stat_statements is all about.

Content

As I said just before, this version of pg_stat_kcache relies on pg_stat_statements:

#CREATEEXTENSIONpg_stat_kcache;ERROR:requiredextension"pg_stat_statements"isnotinstalled#CREATEEXTENSIONpg_stat_statements;CREATEEXTENSION#CREATEEXTENSIONpg_stat_kcache;CREATEEXTENSION#\dxListofinstalledextensionsName|Version|Schema|Description--------------------+---------+------------+-----------------------------------------------------------pg_stat_kcache|2.0|public|Kernelcachestatisticsgatheringpg_stat_statements|1.2|public|trackexecutionstatisticsofallSQLstatementsexecutedplpgsql|1.0|pg_catalog|PL/pgSQLprocedurallanguage(3rows)

What does the extension provide ?

#\dx+pg_stat_kcacheObjectsinextension"pg_stat_kcache"ObjectDescription---------------------------------functionpg_stat_kcache()functionpg_stat_kcache_reset()viewpg_stat_kcacheviewpg_stat_kcache_detail(4rows)

There are two functions:

pg_stat_kcache(): returns the metric values, grouped by query, database and user.
pg_stat_kcache_reset(): reset the metrics.

And two views on top of the first function:

pg_stat_kcache: provide the metrics, aggregated by database only
pg_stat_kcache_detail: provide the same information as the pg_stat_kcache() function, but with the actual query text, database and user names.

Here are the units:

reads: in bytes
reads_blks: raw output of getursage, unit is 512bits on linux
writes: in bytes
writes_blks: raw output of getursage, unit is 512bits on linux
user_time: in seconds
system_time: in seconds

Usage

So now, let’s see in detail all this stuff.

Let’s first generate some activity to see all that counters going up:

(postgres@127.0.0.1:59412)[postgres]=#CREATETABLEbig_table(idinteger,valtext);CREATETABLE\timing#INSERTINTObig_tableSELECTi,repeat('line '||i,50)FROMgenerate_series(1,1000000)i;INSERT01000000Time:62368.157ms#SELECTi,md5(concat(i::text,md5('line'||i)))FROMgenerate_series(1,1000000)i;[...]Time:5135.980ms

Which gives us:

#\x#SELECT*FROMpg_stat_kcache_detail;-[RECORD1]---------------------------------------------------------------------------------query|INSERTINTObig_tableSELECTi,repeat(?||i,?)FROMgenerate_series(?,?)i;datname|kcacherolname|rjujureads|0reads_blks|0writes|933814272writes_blks|107753user_time|7.592system_time|0.86-[RECORD2]---------------------------------------------------------------------------------query|SELECTi,md5(concat(i::text,md5(?||i)))FROMgenerate_series(?,?)i;datname|kcacherolname|rjujureads|0reads_blks|0writes|14000128writes_blks|1709user_time|5.032system_time|0.088[...]

The INSERT query had a runtime of about 1 minute. We see that it used 7.6s of CPU, and wrote 890 MB on disk. Without any surprise, this query is I/O bound.

The SELECT query had a runtime of 5.1s, and it consumed 5s of CPU time. As expected, using md5() is CPU expensive, to the bottleneck here is the CPU. Also, we see that this query wrote 14000128 bytes. Why would a simple SELECT query without any aggregate would write 13MB on disk ? Yes, the answer is geneate_series(), which use a temporary file if the data don’t fit in work_mem:

#SHOWwork_mem;work_mem----------10MB#EXPLAIN(analyze,buffers)SELECT*FROMgenerate_series(1,1000000);QUERYPLAN----------------------------------------------------------------------------------------------------------------------------FunctionScanongenerate_series(cost=0.00..10.00rows=1000width=4)(actualtime=253.849..462.864rows=1000000loops=1)Buffers:tempread=1710written=1709Planningtime:0.050msExecutiontime:548.298ms-- How many bytes are 1709 blocks ?#SELECT1709*8192;?column?----------14000128(1row)Time:0.753ms

And we find the exact amount of writes :)

Going further

As we now have the number of bytes physically read from disk, and pg_stat_statements provides the bytes read on shared_buffers, read outside the shared_buffers and written, we can compute many things, like:

an exact hit-ratio, meaning having :
- what was read from the shared_buffers
- what was read in the filesystem cache
- what was read from disk

And, thanks to pg_stat_statements, we can compute this exact hit-ratio per query and/or user and/or database!

For instance, getting these metrics on all databases on a server:

#SELECTdatname,query,shared_hit*100/int8larger(1,shared_hit+shared_read)asshared_buffer_hit,(shared_read-reads)*100/int8larger(1,shared_hit+shared_read)assystem_cache_hit,reads*100/int8larger(1,shared_hit+shared_read)asphysical_disk_readFROM(SELECTuserid,dbid,queryid,query,shared_blks_hit*8192asshared_hit,shared_blks_read*8192ASshared_readFROMpg_stat_statements)sJOINpg_stat_kcache()kUSING(userid,dbid,queryid)JOINpg_databasedONs.dbid=d.oidORDERBY1,2

Or getting the 5 most I/O writes consuming queries per database:

#SELECTdatname,query,writesFROM(SELECTdatname,query,writes,row_number()OVER(PARTITIONBYdatnameORDERBYwritesDESC)numFROMpg_stat_statementssJOINpg_stat_kcache()kUSING(userid,dbid,queryid)JOINpg_databasedONs.dbid=d.oid)sqlWHEREnum<=5ORDERBY1ASC,3DESC

As you can see, this new extension is really helpful to have a lot of informations about physical resources consumption on a PostgreSQL server, which wasn’t possible to retrieve before.

But you’ll get much more if you use it with PoWA, as it will gather all the required informations periodically, and will do all the maths to show you nice graphs and charts to ease the interpretation of all these metrics.

It mean that you’ll have all these informations, sampled on a few minutes interval. So, knowing which queries use the most CPU between 2 and 3 AM will just be a few clicks away from you.

If you want to take a look a this interface, you can check out the offical demo, at http://demo-powa.dalibo.com, powa // demo.

Have fun!

pg_stat_kcache 2.0 was originally published by Julien Rouhaud at rjuju's home on March 04, 2015.

↧

Michael Paquier: Hack to calculate CPU usage of a Postgres backend process

March 8, 2015, 3:32 am

≫ Next: Peter Eisentraut: The history of replication in PostgreSQL

≪ Previous: Julien Rouhaud: pg_stat_kcache 2.0

When working on testing WAL compression, I developed a simple hack able to calculate the amount of CPU used by a single Postgres backend process during its lifetime using getrusage invoked at process startup and shutdown. This thing is not aimed for an integration into core, still it may be useful for people who need to measure the amount of CPU used for a given set of SQL queries when working on a patch. Here is the patch, with no more than 20 lines:

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 33720e8..d96a6c6 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -32,6 +32,8 @@
 #include <sys/resource.h>
 #endif

+#include <sys/resource.h>
+
 #ifndef HAVE_GETRUSAGE
 #include "rusagestub.h"
 #endif
@@ -174,6 +176,10 @@ static bool RecoveryConflictPending = false;
 static bool RecoveryConflictRetryable = true;
 static ProcSignalReason RecoveryConflictReason;

+/* Amount of user and system time used, tracked at start */
+static struct timeval user_time;
+static struct timeval system_time;
+
 /* ----------------------------------------------------------------
  *     decls for routines only used in this file
  * ----------------------------------------------------------------
@@ -3555,6 +3561,12 @@ PostgresMain(int argc, char *argv[],
    StringInfoData input_message;
    sigjmp_buf  local_sigjmp_buf;
    volatile bool send_ready_for_query = true;
+   struct rusage r;
+
+   /* Get start usage for reference point */
+   getrusage(RUSAGE_SELF, &r);
+   memcpy((char *) &user_time, (char *) &r.ru_utime, sizeof(user_time));
+   memcpy((char *) &system_time, (char *) &r.ru_stime, sizeof(system_time));

    /* Initialize startup process environment if necessary. */
    if (!IsUnderPostmaster)
@@ -4228,6 +4240,14 @@ PostgresMain(int argc, char *argv[],
            case 'X':
            case EOF:

+               /* Get stop status of process and log comparison with start */
+               getrusage(RUSAGE_SELF, &r);
+               elog(LOG,"user diff: %ld.%06ld, system diff: %ld.%06ld",
+                    (long) (r.ru_utime.tv_sec - user_time.tv_sec),
+                    (long) (r.ru_utime.tv_usec - user_time.tv_usec),
+                    (long) (r.ru_stime.tv_sec - system_time.tv_sec),
+                    (long) (r.ru_stime.tv_usec - system_time.tv_usec));
+
                /*
                 * Reset whereToSendOutput to prevent ereport from attempting
                 * to send any more messages to client.

Once backend code is compiled with it, logs will be filled with entries showing the amount of user and system CPU consumed during the process lifetime. Do not forget to use log_line_prefix with %p to associate what is the PID of process whose resource is calculated. For example, let's take the following test case:

=# show log_line_prefix;
 log_line_prefix
-----------------
 PID %p:
(1 row)
=# select pg_backend_pid();
 pg_backend_pid
----------------
           7502
(1 row)
=# CREATE TABLE huge_table AS SELECT generate_series(1,1000000);
SELECT 1000000

It results in a log entry with the result wanted once connection is ended:

PID 7502: LOG:  user diff: 1.329707, system diff: 0.107755

Developers, feel free to use it for your own stuff.

↧

Peter Eisentraut: The history of replication in PostgreSQL

March 3, 2015, 5:00 pm

≫ Next: Keith Fiske: PG Partman – Sub-partitioning

≪ Previous: Michael Paquier: Hack to calculate CPU usage of a Postgres backend process

2001: PostgreSQL 7.1: write-ahead log

PostgreSQL 7.1 introduced the write-ahead log (WAL). Before that release, all open data files had to be fsynced on every commit, which is very slow. Slow fsyncing is still a problem today, but now we’re only worried about fsyncing the WAL, and fsyncing the data files during the checkpoint process. Back then, we had to fsync everythingall the time.

In the original design of university POSTGRES, the lack of a log was intentional, and contrasted with heavily log-based architectures such as Oracle. In Oracle, you need the log to roll back changes. In PostgreSQL, the nonoverwriting storage system takes care of that. But probably nobody thought about implications for fsyncing back then.

Note that the WAL was really just an implementation detail at this point. You couldn’t read or archive it.

2004: Slony

Just for context: Slony-I 1.0 was released in July 2004.

2005: PostgreSQL 8.0: point-in-time recovery

PostgreSQL 8.0 added the possibility to copy the WAL somewhere else, and later play it back, either all the way or to a particular point in time, hence the name point-in-time recovery (PITR) for this feature. This feature was mainly intended to relieve pg_dump as a backup method. Until then, the only backup method was a full dump, which would get impractical as databases grew. Hence this method to take an occasional base backup, which is the expensive part, and then add on parts of the WAL, which is cheaper.

The basic configuration mechanisms that we still use today, for example the recovery.conf file, were introduced as part of this feature.

But still no replication here.

2008: PostgreSQL 8.3: pg_standby

Crafty people eventually figured that if you archived WAL on one server and at the same time “recovered” endlessly on another, you’d have a replication setup. You could probably have set this up with your own scripts as early as 8.0, but PostgreSQL 8.3 added the pg_standby program into contrib, which gave everyone a standard tool. So, arguably, 8.3 is the first release that contained a semblance of a built-in replication solution.

The standby server was in permanent recovery until promotion, so it couldn’t be read from as it was replicating. This is what we’d now call a warm standby.

I think a lot of PostgreSQL 8.3 installations refuse to die, because this is the first version where you could easily have a reasonably up-to-date reserve server without resorting to complicated and sometimes problematic tools like Slony or DRBD.

2010: PostgreSQL 9.0: hot standby, streaming replication

In PostgreSQL 9.0, two important replication features arrived completely independently. First, the possibility to connect to a standby server in read-only mode, making it a so-called hot standby. Whereas before, a standby server was really mainly useful only as a reserve in case the primary server failed, with hot standby you could use secondary servers to spread out read-only loads. Second, instead of relying solely on the WAL archive and recovery functionalities to transport WAL data, a standby server could connect directly to the primary server via the existing libpq protocol and obtain WAL data that way, so-called streaming replication. The primary use in this release was that the standby could be more up to date, possibly within seconds, rather than several minutes with the archive-based approach. For a robust setup, you would still need to set up an archive. But streaming replication was also a forward-looking feature that would eventually make replication setups easier, by reducing the reliance on the old archiving mechanisms.

PostgreSQL 9.0 was the first release where one could claim that PostgreSQL “supports replication” without having to make qualifications or excuses. Although it is scheduled to go EOL later this year, I expect this release will continue to live for a long time.

2011: PostgreSQL 9.1: pg_basebackup, synchronous replication

pg_basebackup was one of the features facilitated by streaming replication that made things easier. Instead of having to use external tools like rsync for base backups, pg_basebackup would use a normal libpq connection to pull down a base backup, thus avoiding complicated connection and authentication setups for external tools. (Some people continue to favor rsync because it is faster for them.)

PostgreSQL 9.1 also added synchronous replication, which ensures that data is replicated to the designated synchronous standby before a COMMIT reports success. This feature is frequently misunderstood by users. While it ensures that your data is on at least two servers at all times, it might actually reduce the availability of your system, because if the standby server goes down, the primary will also go down, unless you have a third server available to take over the synchronous standby duty.

Less widely know perhaps is that PostgreSQL 9.1 also added the pg_last_xact_replay_timestamp function for easy monitoring of standby lag.

In my experience, the availability of pg_basebackup and pg_last_xact_replay_timestamp make PostgreSQL 9.1 the first release were managing replication was reasonably easy. Go back further, and you might feel constrained by the available tools. But in 9.1, it’s not that much different from what is available in the most recent releases.

2012: PostgreSQL 9.2: cascading replication

Not as widely acclaimed, more for the Slony buffs perhaps, PostgreSQL 9.2 allowed standbys to fetch their streaming replication data from other standbys. A particular consequence of that is that pg_basebackup could copy from a standby server, thus taking the load off the primary server for setting up a new standby or standalone copy.

2013: PostgreSQL 9.3: standby can follow timeline switch

This did not even make it into the release note highlights. In PostgreSQL 9.3, when a primary has two standbys, and one of the standbys is promoted, the other standby can just keep following the new primary. In previous releases, the second standby would have to be rebuilt. This improvement makes dynamic infrastructure changes much simpler. Not only does it eliminate the time, annoyance, and performance impact of setting up a new standby, more importantly it avoids the situation that after a promotion, you don’t have any up to update standbys at all for a while.

2014: PostgreSQL 9.4: replication slots, logical decoding

Logical decoding got all the press for PostgreSQL 9.4, but I think replication slots are the major feature, possibly the biggest replication feature since PostgreSQL 9.0. Note that while streaming replication has gotten more sophisticated over the years, you still needed a WAL archive for complete robustness. That is because the primary server didn’t actually keep a list of its supposed standby servers, it just streamed whatever WAL happened to be requested if it happened to have it. If the standby server fell behind sufficiently far, streaming replication would fail, and recovery from the archive would kick in. If you didn’t have an archive, the standby would then no longer be able to catch up and would have to be rebuilt. And this archiving mechanism has essentially been unchanged since version 8.0, when it was designed for an entirely different purpose. So a replication setup is actually quite messy: You have to configure an access path from the primary to the standby (for archiving) and an access path from the standby to the primary (for streaming). And if you wanted to do multiple standbys or cascading, maintaining the archive could get really complicated. Moreover, I think a lot of archiving setups have problematic archive_command settings. For example, does your archive_command fsync the file on the receiving side? Probably not.

No more: In PostgreSQL 9.4, you can set up so-called replication slots, which effectively means that you register a standby with the primary, and the primary keeps around the WAL for each standby until the standby has fetched it. With this, you can completely get rid of the archiving, unless you need it as a backup.

2015? PostgreSQL 9.5? pg_rewind?

One of the remaining problems is that promoting a standby leaves the old primary unable to change course and follow the new primary. If you fail over because the old primary died, then that’s not an issue. But if you just want to swap primary and standby, perhaps because the standby has more powerful hardware, then the old primary, now standby, needs to be rebuilt completely from scratch. Transforming an old primary into a new standby without a completely new base backup is a rather intricate problem, but a tool that can do it (currently named pg_rewind) is proposed for inclusion into the next PostgreSQL release.

Beyond

One of the problems that this evolution of replication has created is that the configuration is rather idiosyncratic, quite complicated to get right, and almost impossible to generalize sufficiently for documentation, tutorials, and so on. Dropping archiving with 9.4 might address some of these points, but configuring even just streaming replication is still weird, even weirder if you don’t know how it got here. You need to change several obscure configuration parameters, some on the primary, some on the standby, some of which require a hard restart of the primary server. And then you need to create a new configuration file recovery.conf, even though you don’t want to recover anything. Making changes in this area is mostly a complex political process, because the existing system has served people well over many years, and coming up with a new system that is obviously better and addresses all existing use cases is cumbersome.

Another issue is that all of this functionality has been bolted on to the write-ahead log mechanism, and that constrains all the uses of the write-ahead log in various ways. For example, there are optimizations that skip WAL logging in certain circumstances, but if you want replication, you can’t use them. Who doesn’t want replication? Also, the write-ahead log covers an entire database system and is all or nothing. You can’t replicate only certain tables, for example, or consolidate logs from two different sources.

How about not bolting all of this on to the WAL? Have two different logs for two different purposes. This was discussed, especially around the time streaming replication was built. But then you’d need two logs that are almost the same. And the WAL is by design a bottleneck, so creating another log would probably create performance problems.

Logical decoding breaks many of these restrictions and will likely be the foundation for the next round of major replication features. Examples include partial replication and multimaster replication, some of which are being worked on right now.

What can we expect from plain WAL logging in the meantime? Easier configuration is certainly a common request. But can we expect major leaps on functionality? Who knows. At one point, something like hot standby was thought to be nearly impossible. So there might be surprises still.

↧

Keith Fiske: PG Partman – Sub-partitioning

March 9, 2015, 8:40 am

≫ Next: Josh Berkus: See you at pgDay SF 2015 tommorrow

≪ Previous: Peter Eisentraut: The history of replication in PostgreSQL

After my talk at PGCon 2014 where I discussed pg_partman, someone I met at the bar track said they’d use it in a heartbeat if it supported sub-partitioning. Discussing this with others and reading online, I found that there is quite a demand for this feature and the partitioning methods in MySQL & Oracle both support this as well. So I set out to see if I could incorporate it. I thought I’d had it figured out pretty easily and started writing this blog post a while ago (last October) to precede the release version 1.8.0. Then I started working on the examples here and realized this is a trickier problem to manage than I anticipated. The tricky part being managing the context relationship between the top level parent and their child sub-partitions in a general manner that would work for all partitioning types pg_partman supports. When I first started working on the feature, I’d get things like this:

...
  partman_test.time_static_table_p2013_p2013_12_30,
  partman_test.time_static_table_p2013_p2013_12_31,
  partman_test.time_static_table_p2013_p2014_09_06,
  partman_test.time_static_table_p2013_p2014_09_07,
  partman_test.time_static_table_p2013_p2014_09_08,
  partman_test.time_static_table_p2013_p2014_09_09,
...

Obviously, having 2014 child sub-partitions in the 2013 parent partition set doesn’t make sense. I believe I’ve gotten this figured out and handled now in version 1.8.0 and fixed several issues encountered since then in 1.8.1 and 1.8.2 (thanks to the users that reported them!). Also if the parent is serial and the child is time (or vice versa), there’s no contextual relationship and all the child tables will be created in every partition set.

When I first saw that all sub-partitioning did was move the data further down into the inheritance tree, I at first wondered at the expected gains from this, outside of just organizing the data better. The use case of the person I mentioned in the first sentence gave a bit of a hint to the gains. If I remember correctly, they had an extremely large amount of time-series data that needed to be queried as efficiently as possible. One of the advantages of partitioning is the constraint exclusion feature (see my other post for more details on this) which allows the query plan to skip tables that it knows don’t contain that data. But postgres still has to do some work in order to figure out that those tables can be excluded in the first place. For very large partition sets, even this is a noticeable performance hit. Sub-partitioning, with a known naming pattern to the child tables, allows an application to target directly the exact child tables it needs and avoid even the minor overhead of constraint exclusion in the query plan.

Let’s see how it works.

First I create a standard, yearly partitioned table set

keith=# CREATE SCHEMA partman_test;
CREATE SCHEMA
Time: 5.125 ms
keith=# CREATE TABLE partman_test.time_static_table (col1 serial primary key, col2 text, col3 timestamptz NOT NULL DEFAULT now());
CREATE TABLE
Time: 25.398 ms
keith=# CREATE INDEX ON partman_test.time_static_table (col3);
CREATE INDEX
Time: 15.003 ms
keith=# SELECT partman.create_parent('partman_test.time_static_table', 'col3', 'time-static', 'yearly');
 create_parent 
---------------
 t
(1 row)

keith=# \d+ partman_test.time_static_table
                                                          Table "partman_test.time_static_table"
 Column |           Type           |                                   Modifiers                                   | Storage  | Stats target | Description 
--------+--------------------------+-------------------------------------------------------------------------------+----------+--------------+-------------
 col1   | integer                  | not null default nextval('partman_test.time_static_table_col1_seq'::regclass) | plain    |              | 
 col2   | text                     |                                                                               | extended |              | 
 col3   | timestamp with time zone | not null default now()                                                        | plain    |              | 
Indexes:
    "time_static_table_pkey" PRIMARY KEY, btree (col1)
    "time_static_table_col3_idx" btree (col3)
Triggers:
    time_static_table_part_trig BEFORE INSERT ON partman_test.time_static_table FOR EACH ROW EXECUTE PROCEDURE partman_test.time_static_table_part_trig_func()
Child tables: partman_test.time_static_table_p2011,
              partman_test.time_static_table_p2012,
              partman_test.time_static_table_p2013,
              partman_test.time_static_table_p2014,
              partman_test.time_static_table_p2015,
              partman_test.time_static_table_p2016,
              partman_test.time_static_table_p2017,
              partman_test.time_static_table_p2018,
              partman_test.time_static_table_p2019

Next some data is added and I check that everything looks right

keith=# INSERT INTO partman_test.time_static_table (col3) VALUES (generate_series('2011-01-01 00:00:00'::timestamptz, CURRENT_TIMESTAMP, '1 hour'::interval));
INSERT 0 0
Time: 1000.392 ms
keith=# SELECT count(*) FROM partman_test.time_static_table;
 count 
-------
 36613
(1 row)

Time: 12.293 ms
keith=# select min(col3), max(col3) FROM partman_test.time_static_table;
          min           |          max           
------------------------+------------------------
 2011-01-01 00:00:00-05 | 2015-03-06 12:00:00-05
(1 row)

Time: 3.915 ms
keith=# select min(col3), max(col3) FROM partman_test.time_static_table_p2013;
          min           |          max           
------------------------+------------------------
 2013-01-01 00:00:00-05 | 2013-12-31 23:00:00-05
(1 row)

Time: 1.794 ms
keith=# select min(col3), max(col3) FROM partman_test.time_static_table_p2015;
          min           |          max           
------------------------+------------------------
 2015-01-01 00:00:00-05 | 2015-03-06 12:00:00-05
(1 row)

Time: 1.785 ms

Say now we want to subpartition by day to better organize our data because we’re expecting to get A LOT of it. The new create_sub_parent() function works just like create_parent() except the first parameter is instead an already existing parent table whose children we want to partition. In this case, we’ll be telling it we want each yearly child table to be further partitioned by day.

keith=# SELECT partman.create_sub_parent('partman_test.time_static_table', 'col3', 'time-static', 'daily');
 create_sub_parent 
-------------------
 t
(1 row)

Time: 3408.460 ms

Hopefully you’ve realized that all the data we inserted isn’t yet partitioned to the new daily tables yet. It all still resides in each one of the yearly sub-parent tables and the only tables that were created in 2015 are the ones around the current date of March 6th, 2015 +/- 4 days (since the premake config value is set to 4). For previous and future years, only a single partition was created for the lowest possible values. All parent tables in a partition set managed by pg_partman, at all partitioning levels, have at least one child, even if they have no data. You can see that for 2014 below. I don’t yet have things figured out for the data partitioning functions & scripts to handle sub-partitioning, but in the mean time, a query like the the one below the table definitions can generate the script lines for every sub-parent table for a given parent.

keith=# \d+ partman_test.time_static_table_p2014
                                                       Table "partman_test.time_static_table_p2014"
 Column |           Type           |                                   Modifiers                                   | Storage  | Stats target | Description 
--------+--------------------------+-------------------------------------------------------------------------------+----------+--------------+-------------
 col1   | integer                  | not null default nextval('partman_test.time_static_table_col1_seq'::regclass) | plain    |              | 
 col2   | text                     |                                                                               | extended |              | 
 col3   | timestamp with time zone | not null default now()                                                        | plain    |              | 
Indexes:
    "time_static_table_p2014_pkey" PRIMARY KEY, btree (col1)
    "time_static_table_p2014_col3_idx" btree (col3)
Check constraints:
    "time_static_table_p2014_partition_check" CHECK (col3 >= '2014-01-01 00:00:00-05'::timestamp with time zone AND col3 < '2015-01-01 00:00:00-05'::timestamp with time zone)
Triggers:
    time_static_table_p2014_part_trig BEFORE INSERT ON partman_test.time_static_table_p2014 FOR EACH ROW EXECUTE PROCEDURE partman_test.time_static_table_p2014_part_trig_func()
Inherits: partman_test.time_static_table
Child tables: partman_test.time_static_table_p2014_p2014_01_01


keith=# \d+ partman_test.time_static_table_p2015
                                                       Table "partman_test.time_static_table_p2015"
 Column |           Type           |                                   Modifiers                                   | Storage  | Stats target | Description 
--------+--------------------------+-------------------------------------------------------------------------------+----------+--------------+-------------
 col1   | integer                  | not null default nextval('partman_test.time_static_table_col1_seq'::regclass) | plain    |              | 
 col2   | text                     |                                                                               | extended |              | 
 col3   | timestamp with time zone | not null default now()                                                        | plain    |              | 
Indexes:
    "time_static_table_p2015_pkey" PRIMARY KEY, btree (col1)
    "time_static_table_p2015_col3_idx" btree (col3)
Check constraints:
    "time_static_table_p2015_partition_check" CHECK (col3 >= '2015-01-01 00:00:00-05'::timestamp with time zone AND col3 < '2016-01-01 00:00:00-05'::timestamp with time zone)
Triggers:
    time_static_table_p2015_part_trig BEFORE INSERT ON partman_test.time_static_table_p2015 FOR EACH ROW EXECUTE PROCEDURE partman_test.time_static_table_p2015_part_trig_func()
Inherits: partman_test.time_static_table
Child tables: partman_test.time_static_table_p2015_p2015_03_02,
              partman_test.time_static_table_p2015_p2015_03_03,
              partman_test.time_static_table_p2015_p2015_03_04,
              partman_test.time_static_table_p2015_p2015_03_05,
              partman_test.time_static_table_p2015_p2015_03_06,
              partman_test.time_static_table_p2015_p2015_03_07,
              partman_test.time_static_table_p2015_p2015_03_08,
              partman_test.time_static_table_p2015_p2015_03_09,
              partman_test.time_static_table_p2015_p2015_03_10

keith=# SELECT DISTINCT 'partition_data.py -p '|| inhparent::regclass ||' -t time -c host=localhost'
FROM pg_inherits
WHERE inhparent::regclass::text ~ 'partman_test.time_static_table'
ORDER BY 1;
                                      ?column?                                       
-------------------------------------------------------------------------------------
 partition_data.py -p partman_test.time_static_table_p2011 -t time -c host=localhost
 partition_data.py -p partman_test.time_static_table_p2012 -t time -c host=localhost
 partition_data.py -p partman_test.time_static_table_p2013 -t time -c host=localhost
 partition_data.py -p partman_test.time_static_table_p2014 -t time -c host=localhost
 partition_data.py -p partman_test.time_static_table_p2015 -t time -c host=localhost
 partition_data.py -p partman_test.time_static_table_p2016 -t time -c host=localhost
 partition_data.py -p partman_test.time_static_table_p2017 -t time -c host=localhost
 partition_data.py -p partman_test.time_static_table_p2018 -t time -c host=localhost
 partition_data.py -p partman_test.time_static_table_p2019 -t time -c host=localhost
 partition_data.py -p partman_test.time_static_table -t time -c host=localhost
(10 rows)

Time: 3.539 ms

After running the partitioning script for each parent, you can see it automatically created 365 child partitions for 2014 (because there was data in the tables for every day) and only 69 for 2015 since we’re only partway into the year. It did so for the other years as well, but I figured showing one should be proof enough it worked.

keith=# \d partman_test.time_static_table_p2014
                                   Table "partman_test.time_static_table_p2014"
 Column |           Type           |                                   Modifiers                                   
--------+--------------------------+-------------------------------------------------------------------------------
 col1   | integer                  | not null default nextval('partman_test.time_static_table_col1_seq'::regclass)
 col2   | text                     | 
 col3   | timestamp with time zone | not null default now()
Indexes:
    "time_static_table_p2014_pkey" PRIMARY KEY, btree (col1)
    "time_static_table_p2014_col3_idx" btree (col3)
Check constraints:
    "time_static_table_p2014_partition_check" CHECK (col3 >= '2014-01-01 00:00:00-05'::timestamp with time zone AND col3 < '2015-01-01 00:00:00-05'::timestamp with time zone)
Triggers:
    time_static_table_p2014_part_trig BEFORE INSERT ON partman_test.time_static_table_p2014 FOR EACH ROW EXECUTE PROCEDURE partman_test.time_static_table_p2014_part_trig_func()
Inherits: partman_test.time_static_table
Number of child tables: 365 (Use \d+ to list them.)

keith=# \d partman_test.time_static_table_p2015
                                   Table "partman_test.time_static_table_p2015"
 Column |           Type           |                                   Modifiers                                   
--------+--------------------------+-------------------------------------------------------------------------------
 col1   | integer                  | not null default nextval('partman_test.time_static_table_col1_seq'::regclass)
 col2   | text                     | 
 col3   | timestamp with time zone | not null default now()
Indexes:
    "time_static_table_p2015_pkey" PRIMARY KEY, btree (col1)
    "time_static_table_p2015_col3_idx" btree (col3)
Check constraints:
    "time_static_table_p2015_partition_check" CHECK (col3 >= '2015-01-01 00:00:00-05'::timestamp with time zone AND col3 < '2016-01-01 00:00:00-05'::timestamp with time zone)
Triggers:
    time_static_table_p2015_part_trig BEFORE INSERT ON partman_test.time_static_table_p2015 FOR EACH ROW EXECUTE PROCEDURE partman_test.time_static_table_p2015_part_trig_func()
Inherits: partman_test.time_static_table
Number of child tables: 69 (Use \d+ to list them.)

Since this is the first sub-partition level, that parent table argument to create_sub_parent() just happens to be the same as we originally used for create_parent(). If you then wanted to again further sub-partition one of the new child tables, you would feed that to create_sub_partition() and it would be different.

I’ve also included a howto.md file in pg_partman now that gives some more detailed instructions on this and also how to undo such partitioning as well. If anyone has any issues with this feature, I’d appreciate the feedback.

Also, as a sneak preview for what’s currently in development, I believe I’ve gotten a very simple background worker process to handle partition maintenance working. This means, for the general maintenance where you call run_maintenance() with no parent table argument, you will no longer need an external scheduler such as cron! Just set a few variables in postgresql.conf and pg_partman will take care of things all within postgres itself. This does mean the next major version of pg_partman (2.0.0) will be 9.4+ only (I’m creating a dynamic BGW), but it also allows me to simplify a lot of code I’d been keeping around for 9.1 compatibility and add more features that are only available in later versions. So, think of this as more motivation to get your systems upgraded if you want to keep up with new features in this extension!

↧

Josh Berkus: See you at pgDay SF 2015 tommorrow

March 9, 2015, 12:04 pm

≫ Next: US PostgreSQL Association: Whatcom PgDay @ LinuxFestNorthwest April 25th & 26th

≪ Previous: Keith Fiske: PG Partman – Sub-partitioning

pgDay SF 2015 is tommorrow. We've got an exciting lineup of talks, some awesome t-shirts, and of course all the FOSS4G goodness for those of you who can stay the whole week. Last I checked, there were still spots open for onsite registration. See you there!

↧

US PostgreSQL Association: Whatcom PgDay @ LinuxFestNorthwest April 25th & 26th

March 9, 2015, 1:24 pm

≫ Next: gabrielle roth: PDXPUG Lab Recap: postgres_fdw

≪ Previous: Josh Berkus: See you at pgDay SF 2015 tommorrow

JD says:

It is that time of year and once again, PostgreSQL will be at LinuxFest Northwest. LinuxFest Northwest is a high attendance (1500+) conference covering Linux and other Open Source technologies. It is a free event (although there are paid options). The PgDay as part of United States PostgreSQL has the following talks!

* Web-Scale PostgreSQL: The Best of the JSON and Relational Worlds. Speaker: Jonathan Katz
* Shootout at the PAAS Corral. Speaker: Josh Berkus
* Vacuum 101. Speaker: Gabrielle Roth

↧

gabrielle roth: PDXPUG Lab Recap: postgres_fdw

March 9, 2015, 6:14 pm

≫ Next: Tomas Vondra: Performance since PostgreSQL 7.4 / TPC-DS

≪ Previous: US PostgreSQL Association: Whatcom PgDay @ LinuxFestNorthwest April 25th & 26th

Back in January, PDXPUG had a lab night to try out the postgres_fdw. Our labs are even more casual than our meetings: I don’t set an agenda, I invite the attendees to choose specific questions or topics they want to investigate. Here’s what we came up with for our FDW lab: – What is it […]

↧

Tomas Vondra: Performance since PostgreSQL 7.4 / TPC-DS

March 10, 2015, 5:00 am

≫ Next: Ernst-Georg Schmid: The Long Tail - vertical table partitioning III

≪ Previous: gabrielle roth: PDXPUG Lab Recap: postgres_fdw

About a week ago, I posted comparison of pgbench results since PostgreSQL 7.4, which was one of the benchmarks done for my pgconf.eu 2014 talk. For an explanation of the whole quest, please see the first post.

Now it's time to discuss results of the second benchmark - TPC-DS, which models an analytical workload, i.e. queries processing large amounts of data, with aggregations, large sorts, joins of large tables, TOP-N queries etc. That is very different from pgbench, executing queries manipulating individual rows mostly through primary keys, etc.

The one chart you should remember from this post is this one, illustrating how long it takes to execute 41 queries (subset of TPC-DS queries compatible with PostgreSQL since 7.4) on a 16GB dataset (raw CSV size, after loading into database it occupies about 5x the size because of overhead, indexes etc.).

The numbers are runtime in seconds (on the i5-2500k machine), and apparently while on PostgreSQL 8.0 it took ~5100 seconds (85 minutes), on 9.4 it takes only ~1200 seconds (20 minutes). That's a huge improvement.

Notice the 8.0 results are marked with a star - that's because on 8.0 one of the queries did not complete within an arbitrary limit of 30 minutes, so it was cancelled and was counted as taking 1h. Based on several experiments, I believe the actual runtime would be even longer than that - in any case it was much longer than on PostgreSQL 8.1, where this particular query got significantly improved by bitmap index scans.

Considering that this is quite I/O intensive (the database size is ~5x the RAM), that's a huge improvement. As we'll see later, with smaller datasets (that completely fit into RAM), the speedup is even larger.

BTW I've had trouble making this work on PostgreSQL 7.4 (without making the results difficult to compare), so I'll only present results for PostgreSQL 8.0 and newer releases.

TPC-DS

But let's talk a bit about the TPC-DS benchmark, because the brief description in the introduction is not really detailed enough. TPC-DS is another benchmark from TPC. It's representing analytical workloads (reporting, data analysis, decision support systems and so on), so the queries are processing large amounts of data, performing aggregations (GROUP BY), various joins, etc.

It effectively extends and deprecates TPC-H benchmark, improving it in multiple ways to make it more representative of actual workloads. Firstly, it makes the schema more complex (e.g. more tables), and uses less uniform distributions of the data (which makes cardinality estimations way more difficult). It also increases the number of query templates from 22 to 99, and uses modern features like CTEs, window functions and grouping sets.

Of course, presenting all the 99 query templates here would be pointless, but one of the simpler ones looks like this:

selectca_zip,sum(cs_sales_price)fromcatalog_sales,customer,customer_address,date_dimwherecs_bill_customer_sk=c_customer_skandc_current_addr_sk=ca_address_skand(substr(ca_zip,1,5)in('85669','86197','88274','83405','86475','85392','85460','80348','81792')orca_statein('CA','WA','GA')orcs_sales_price>500)andcs_sold_date_sk=d_date_skandd_qoy=2andd_year=2000groupbyca_ziporderbyca_ziplimit100;

This particular query joins 4 tables, uses non-trivial WHERE conditions, aggregation and finally selects only results for the first 100 ZIP codes. The other templates are often way more complex.

Some of the templates are incompatible with PostgreSQL, because they rely on not-yet-implemented features (e.g. CUBE/ROLLUP). Some of the templates also seem broken, as the query generator fails on them. There are 61 queries working fine since PostgreSQL 8.4 (when CTEs and window functions were added to PostgreSQL), and 41 queries are compatible with versions since 7.4. And those 41 queries were used for this benchmark.

Note 1: Most of the remaining queries may be rewritten to make them work, but I haven't done that. Those queries were designed specifically to test those features, and the rewritten versions would benchmark the same things as the remaining queries anyway. I'm pretty sure it's possible to rewrite some of the "compatible" queries to get better performance, but I haven't done that for the same reason.

Note 2: The TPC-DS benchmark also includes tests for maintenance/management tasks (e.g. ALTER TABLE, various DDL etc.) but I have not performed this part.

Dataset Sizes

As with all benchmarks, dataset sizes really matter. I've used two sizes 1GB and 16GB, specified as the size of the CSV filed generated by the TPC tooling.

small / 1GB

5-6GB after loading into database (still fits into RAM)
too small for publication of results (according to TPC-DS specification)
still interesting, becaue many datasets are actually quite small

large / 16GB

~80GB after loading into database (a multiple of RAM)
non-standard scale (TPC-DS requires 10, 100, ... to make comparisons easier)
we don't really care about comparison with other databases anyway (in this benchmark)

Schema

I haven't really spent too much time optimizing the schema - I simply used the schema provided with the TPC-DS suite, and after a few rounds of benchmarking created suitable indexes. I have used the same schema for all PostgreSQL versions, which probably discriminates the newer versions, as those might benefit from tweaking indexes to be suitable for index-only scans, for example.

PostgreSQL Configuration

The the PostgreSQL configuration was mostly default, with only minimal changes. All the versions used the same configuration:

shared_buffers= 512MB
work_mem= 128MB
maintenance_work_mem= 128MB
effective_cache_size= 1GB
checkpoint_segments= 32

Similarly to pgbench we could probably get a bit better performance by tuning the values for each version. The values used here are quite conservative already, so don't expect an effect similar to pgbench, when lowering the values resulted in significant speedup on older versions.

Tooling

If you want to review the tooling I used, it's available here. It's a bit hackish (mostly a bunch of shell scripts) and certainly is not "ready to run" in various ways - it does not include the data and query generators (you can get them at tpc.org, and you'll have to modify a few paths in the scripts (to data directory etc.), but it shouldn't be difficult. Let me know by e-mail if you run into problems.

This also includes DDL (schema including indexes), PostgreSQL config files, query templates and actual queries used for the benchmark.

Data Load

The first thing you often need to do is loading the data into database. The load process is very simple:

COPY - load data from tables into 'fresh' tables (with only primary keys)
CREATE INDEX - all other indexes
VACUUM FULL - compact the table (not really needed)
VACUUM FREEZE - mark all tuples as visible
ANALYZE - collect statistics

and the results for a 1GB dataset (raw size of CSV files) look like this:

Clearly, something changed in 9.0, because the VACUUM FULL step takes much longer, and indeed - that's when VACUUM FULL was rewritten to use the same implementation as CLUSTER, i.e. completely rewriting the table. On older releases it used a different approach, that was more efficient for tables with only small fraction of dead tuples (which is the case here, right after a load), but much less efficient on tables with a large portion of dead tuples (which is exactly when you want to do a CLUSTER).

That means the VACUUM FULL is rather misleading here, because it's used in exactly the context where you should not use that (and instead let autovacuum do it's job), so let's remove it from the chart.

Much better, I guess. While it took ~1000 seconds on 8.0, on 9.4 it only takes ~500 seconds - not bad, I guess. The main improvement happened in 8.1, and 8.2 was a minor regression, followed by small incremental improvements.

Let's see the performance with a larger (16GB) dataset:

Again, about 2x the speedup between 8.0 and 9.4 (just like with the small dataset), but the pattern is slightly different - both 8.1 and 8.2 improved the performance about equally (no regression in 8.2), followed by releases keeping a stable performance.

So far I've been talking about "raw size" of the datasets, i.e. size of the CSV files produced by the TPC-DS generator. But what that means for the database size? After the small (1GB) dataset, you'll get about this:

That's ~5-6GB databases - PostgreSQL 9.4 needs about 15% less compared to 8.0, which is certainly nice. About 60% of that are indexes, leaving ~2.5GB for the tables. By applying this to the 16GB dataset, it will require ~40GB on disk for the table, and additional ~60GB for the indexes.

Querying

And finally, the query performance. On the small dataset (which fits into memory), the 41 looks like this on average:

and on the large dataset, you'll get this:

Clearly, PostgreSQL 8.1 was a significant improvement. It's also possible to look at duration broken down per query (so that each query gets the same color on all versions):

The regressions on 8.1 and 8.4 - 9.1 are clearly visible - I haven't looked much into them though.

If you want look into the results, a complete set of results (including logs, EXPLAIN, EXPLAIN ANALYZE and such) for the 16GB dataset is available here. Feel free to point any inconsistencies or errors.

Improvements

If you look into release notes, two major features introduced in that version should catch your eye:

Allow index scans to use an intermediate in-memory bitmap (Tom)
Automatically use indexes for MIN() and MAX() (Tom)

PostgreSQL 9.2 was another release significantly improving performance (almost 2x compared to 9.1), most likely because of this feature (see release notes):

Allow queries to retrieve data only from indexes, avoiding heap access (Robert Haas, Ibrar Ahmed, Heikki Linnakangas, Tom Lane)

You can also notice that further improvement happened in 9.4, by about 10%. That's likely thanks to optimization of Numeric aggregates:

Improve speed of aggregates that use numeric state values (Hadi Moshayedi)

There are of course many other performance-related improvements both in all the releases since 8.0, but those are related to other kinds of queries.

Summary

So, what's the conclusion?

Loading is much faster than on 8.0 - about 2x as fast. Most of the speedup happened in 8.1 / 8.2, and we're keeping about the same performance since then.
The query speedup is even better - PostgreSQL 9.4 is about 7x faster than 8.0. Again, most of the speedup happened in 8.1/8.2, but there are signifincant improvements in the following releases too.

↧

Ernst-Georg Schmid: The Long Tail - vertical table partitioning III

March 11, 2015, 2:06 am

≫ Next: Greg Sabino Mullane: Postgres searchable release notes - one page with all versions

≪ Previous: Tomas Vondra: Performance since PostgreSQL 7.4 / TPC-DS

"In part III I'll try to give a raw estimate how big the performance penalty is when the partitioned table switches from fast to slow storage during a query. And there is another important problem to be solved..."

It took some time...

Since I don't have a SMR HDD, I used a standard 5400 RPM 2TB HDD together with a 128 MB SSD instead. Both drives are attached to SATA2 ports.

The test machine is a Fujitsu Workstation with one Intel i7-4770, 16 GB RAM, running Windows 8.1 64 Bit and PostgreSQL 9.4.1 64 Bit.

CrystalDiskMark gives the following performance data for the pair:

HDD:

Sequential Read : 139.764 MB/s
Sequential Write : 128.897 MB/s
Random Read 512KB : 17.136 MB/s
Random Write 512KB : 71.074 MB/s
Random Read 4KB (QD=1) : 0.280 MB/s [ 68.3 IOPS]
Random Write 4KB (QD=1) : 0.642 MB/s [ 156.8 IOPS]
Random Read 4KB (QD=32) : 0.999 MB/s [ 243.8 IOPS]
Random Write 4KB (QD=32) : 0.889 MB/s [ 217.0 IOPS]

SSD:

Sequential Read : 431.087 MB/s
Sequential Write : 299.641 MB/s
Random Read 512KB : 268.955 MB/s
Random Write 512KB : 293.199 MB/s
Random Read 4KB (QD=1) : 24.519 MB/s [ 5986.0 IOPS]
Random Write 4KB (QD=1) : 67.369 MB/s [ 16447.6 IOPS]
Random Read 4KB (QD=32) : 328.456 MB/s [ 80189.5 IOPS]
Random Write 4KB (QD=32) : 205.667 MB/s [ 50211.6 IOPS]

As you can see, the SSD is about 3x faster reading sequentially and about 85x - 328x faster reading random blocks, depending on the command queue depth.

PostgreSQL is running with

shared_buffers = 128kB# min 128kB

to minimize cache hits since I want to see how the disks perform.

For the 'benchmark' I first set up two tablespaces, hdd and ssd. Then the long tailed table was created as shown in the previous posts:

CREATE UNLOGGED TABLE fast
(
id serial NOT NULL,
value real
)
WITH (
OIDS=FALSE
)
TABLESPACE ssd;

CREATE UNLOGGED TABLE slow
(

)
INHERITS (fast)
TABLESPACE hdd;

Then I created one billion rows in fast and slow:

INSERT INTO fast (value) SELECT random()*1000000000 FROM generate_series(1,1000000000);

INSERT INTO slow SELECT * FROM fast;

First, I wanted to see how each table performs with full table scans. All these numbers are ten-run averages, that's one reason why it took some time :-)

SELECT avg(value) FROM ONLY slow; -- 210 sec

SELECT avg(value) FROM ONLY fast; -- 90 sec

Which pretty much reflects the 3/1 ratio from CrystalDiskMark.

For the random read test, I created primary keys on the id columns of each table, but put their underlying indexes on the SSD to be fair. Then, 10000 random rows where selected from the whole table:

SELECT avg(value) FROM ONLY fast WHERE id IN (SELECT 1+floor(random()*1000000000)::integer FROM generate_series(1,10000)); -- 6 sec

SELECT avg(value) FROM ONLY slow WHERE id IN (SELECT 1+floor(random()*1000000000)::integer FROM generate_series(1,10000)); -- 100 sec

Here, the HDD is about 16x slower than the SSD.

And from the top 20% of each table:

SELECT avg(value) FROM ONLY fast WHERE id IN (SELECT 800000001+floor(random()*200000000)::integer FROM generate_series(1,10000)); -- 5 sec

SELECT avg(value) FROM ONLY slow WHERE id IN (SELECT 800000001+floor(random()*200000000)::integer FROM generate_series(1,10000)); -- 80 sec

Again, the HDD is about 16x slower than the SSD.

Knowing how each table performs, I then moved the top 20% of rows into fast and left the remaining 80% in slow, thus creating the long tailed table.

SELECT avg(value) FROM fast; -- 178 sec

Surprise, surprise, 210*0.8=168, 90*0.2=18, 168+18=186. The long tailed table is not slower than it's individual parts!

And with random reads?

Whole table:

SELECT avg(value) FROM fast WHERE id IN (SELECT 1+floor(random()*1000000000)::integer FROM generate_series(1,10000)); -- 50 sec

It's way faster than the table on the SSD alone. This seems to be an anomaly I cannot explain at the moment. Either it helps a lot to have two indexes instead of one, or the most rows where selected from the SSD part.

Top 20% only:

SELECT avg(value) FROM fast WHERE id IN (SELECT 800000001+floor(random()*200000000)::integer FROM generate_series(1,10000)); -- 4 sec

A bit faster than having the whole table on SSD.

Conclusion:

Aside from the (positive) anomaly with random reads on the whole long tailed table, using a long tailed table is at least not slower than a vanilla table but you can put your data graveyard on slow but inexpensive storage while having the hot rows and the indexes on the fast drives.

However, one question remains...

Is it possible to query PostgreSQL which the hot rows of a table are, independent of the application? Then, the background worker could balance the long tailed table without having to know a specific, application dependent access pattern!

An that would be the icing on the cake...

A quick glance over pg_stat* and pg_statio* didn't show anything usable for this task, but I'm open for suggestions. :-)

↧

Greg Sabino Mullane: Postgres searchable release notes - one page with all versions

March 11, 2015, 4:00 am

≫ Next: Michael Paquier: Postgres 9.5 feature highlight: Compression of full-page writes in WAL

≪ Previous: Ernst-Georg Schmid: The Long Tail - vertical table partitioning III

The inability to easily search the Postgres release notes has been a long-standing annoyance of mine, and a recent thread on the pgsql-general mailing list showed that others share the same frustration. One common example when a new client comes to End Point with a mysterious Postgres problem. Since it is rare that a client is running the latest Postgres revision (sad but true), the first order of business is to walk through all the revisions to see if a simple Postgres update will cure the problem. Currently, the release notes are arranged on the postgresql.org web site as a series of individual HTML pages, one per version. Reading through them can be very painful - especially if you are trying to search for a specific item. I whipped up a Perl script to gather all of the information, reformat it, clean it up, and summarize everything on one giant HTML page. This is the result: https://bucardo.org/postgres_all_versions.html

Please feel free to use this page however you like. It will be updated as new versions are released. You may notice there are some differences from the original separate pages:

All 270 versions are now on a single page. Create a local greppable version with:
links -dump https://bucardo.org/postgres_all_versions.html > postgres_all_versions.txt
All version numbers are written clearly. The confusing "E.x.y" notation was stripped out
A table of contents at the top allows for jumping to each version (which has the release date next to it).
Every bulleted feature has the version number written right before it, so you never have to scroll up or down to see what version you are currently reading.
If a feature was applied to more than one version, all the versions are listed (the current version always appears first).
All CVE references are hyperlinks now.
All "mailtos" were removed, and other minor cleanups.
Replaced single-word names with the full names (e.g. "Massimo Dal Zotto" instead of "Massimo") (see below)

Here's a screenshot showing the bottom of the table of contents, and some of the items for Postgres 9.4:

The name replacements took the most time, as some required a good bit of detective work. Most were unambiguous: "Tom" became "Tom Lane", "Bruce" became "Bruce Momjian", and so on. For the final document, 3781 name replacements were performed! Some of the trickier ones were "Greg" - both myself ("Greg Sabino Mullane") and "Greg Stark" had single-name entries. Similar problems popped up with "Ryan", and with "Peter" *not* being the familiar Peter Eisentraut (but Peter T. Mount) threw me off for a second. The only one I was never able to figure out was "Clark", who is attributed (via Bruce) with "Fix tutorial code" in version 6.5. Pointers or corrections welcome.

Hopefully this page will be of use to others. It's a very large page, but not remarkably wasteful of space, like many HTML pages these days. Perhaps some of the changes will make their way to the official docs over time.

↧

Michael Paquier: Postgres 9.5 feature highlight: Compression of full-page writes in WAL

March 12, 2015, 5:54 am

≫ Next: Rikard Pavelic: Fast Postgres from .NET

≪ Previous: Greg Sabino Mullane: Postgres searchable release notes - one page with all versions

In Postgres, full-page writes. which are in short complete images of a page added in WAL after the first modification of this page after a checkpoint, can be an origin of WAL bloat for applications manipulating many relation pages. Note that full-page writes are critical to ensure data consistency in case particularly if a crash happens during a page write, making perhaps this page made of both new and old data.

In Postgres 9.5, the following patch has landed to leverage this quantity of "recovery journal" data, by adding the possibility to compress full-page writes in WAL (full commit message is shortened for this post and can be found here):

commit: 57aa5b2bb11a4dbfdfc0f92370e0742ae5aa367b
author: Fujii Masao <fujii@postgresql.org>
date: Wed, 11 Mar 2015 15:52:24 +0900
Add GUC to enable compression of full page images stored in WAL.

When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.

[...]

Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.

As described in this message, a new GUC parameter, called wal_compression by default disabled to not impact existing users, can be used for this purpose. The compression of full-write pages is done using PGLZ, that has been moved to libpgcommon a couple of weeks back as the idea is to make it available particularly for frontend utilities of the type pg_xlogdump that decode WAL. Be careful though that compression has a CPU cost, in exchange of reducing the I/O caused by WAL written to disks, so this feature is really for I/O bounded environment or for people who want to reduce their amount of WAL on disk and have some CPU to spare on it. There are also a couple of benefits that can show up when using this feature:

WAL replay can speed up, meaning that a node in recovery can recover faster (after a crash, after creating a fresh standby node or whatever)
As synchronous replication is very sensitive to WAL length particularly in presence of multiple backends that need to wait for WAL flush confirmation from a standby, the write/flush position that a standby reports can be sent faster because the standby recovers faster. Meaning that synchronous replication response gets faster as well.

Note as well that this parameter can be changed without restarting the server just with a reload, or SIGHUP, and that it can be updated within a session, so for example if a given application knows that a given query is going to generate a bunch of full-page writes in WAL, wal_compression can be disabled temporarily on a Postgres instance that has it set as enabled. The contrary is true as well.

Now let's have a look at what this feature can do with for example the two following tables having close to 480MB of data, on a server with 1GB of shared_buffers, the first table contains very repetitive data, and the second uses uuid data (see pgcrypto for more details):

=# CREATE TABLE int_tab (id int);
CREATE TABLE
=# ALTER TABLE int_tab SET (FILLFACTOR = 50);
ALTER TABLE
-- 484MB of repetitive int data
=# INSERT INTO int_tab SELECT 1 FROM generate_series(1,7000000);
INSERT 0 7000000
=# SELECT pg_size_pretty(pg_relation_size('int_tab'));
pg_size_pretty
----------------
 484 MB
(1 row)
=# CREATE TABLE uuid_tab (id uuid);
CREATE TABLE
=# ALTER TABLE uuid_tab SET (FILLFACTOR = 50);
ALTER TABLE
-- 484MB of UUID data
=# INSERT INTO uuid_tab SELECT gen_random_uuid() FROM generate_series(1, 5700000);
INSERT 0 5700000
=# SELECT pg_size_pretty(pg_relation_size('uuid_tab'));
pg_size_pretty
----------------
484 MB
(1 row)

The fillfactor is set to 50%, and each table will be updated, generated completely full page writes with a minimum hole size to maximize the effects of compression.

Now that the data has been loaded, let's be sure that it is loaded in the database buffers (not mandatory here, but being maniac costs nothing), and the number of shared buffers of those relations can be fetched at the same time (not exactly the same but it does not really matter to have such few diffence of pages at this scale):

=# SELECT pg_prewarm('uuid_tab');
 pg_prewarm
------------
      61957
(1 row)
=# SELECT pg_prewarm('int_tab');
 pg_prewarm
------------
      61947
(1 row)

After issuing a checkpoint, let's see how this behaves with the following UPDATE commands:

UPDATE uuid_tab SET id = gen_random_uuid();
UPDATE int_tab SET id = 2;

Before and after each command pg_current_xlog_location() is used to get the XLOG position to evaluate the amount of WAL generated. So, after running that with wal_compression enabled and disabled, combined with a trick to calculate CPU for a single backend, I am getting the following results:

Case	WAL generated	User CPU	System CPU
UUID tab, compressed	633 MB	30.64	1.89
UUID tab, not compressed	727 MB	17.05	0.51
int tab, compressed	545 MB	20.90	0.68
int tab, not compressed	727 MB	14.54	0.84

In short, WAL compression saves 27% for this integer table, and 13% with the data largely incompressible!

Note as well that PGLZ is a CPU-eater, so one of the areas of improvements would be to plug in another compression algorithm of the type lz4, or add a hook in backend code to be able to compress full-page writes with something that has a license not necessarily compatible with PostgreSQL preventing its integration into core code. Another area would be to make this parameter settable at relation-level, as it depends on how a schema is compressible. In any case, that's great stuff.

↧

Rikard Pavelic: Fast Postgres from .NET

March 12, 2015, 8:43 am

≫ Next: Shaun M. Thomas: A Short Examination of pg_shard

≪ Previous: Michael Paquier: Postgres 9.5 feature highlight: Compression of full-page writes in WAL

It's often said that abstractions slow down your program, since they add layers which makes your application slower.
While this is generally correct, it's not always true.
Performance can be improved somewhat by removing layers, but the best way to improve performance is to change algorithms.

So let's see how we can beat performance of writing SQL and doing object materialization by hand, as it is (wrongly) common knowledge that this is the fastest way to talk to the database.

First use case, simple single table access

CREATETABLE Post
(
  id UUIDPRIMARYKEY,
  title VARCHARNOTNULL,
  created DATENOTNULL)

So – the standard pattern to access such a table would be:

SELECT*FROM Post

ignoring (for now) that it would probably be a better style to explicitly name columns. Alternatively, in Postgres we can also do:

SELECT p FROM Post p

which would return a tuple for each row in the table.

For the first query, without going too deep into the actual Postgres protocol, we would get three "columns" with length and content. Parsing such a response would look something like this:

IDataReader dr =...returnnew Post { 
    id = dr.GetGuid(0), 
    title = dr.GetString(1),
    created = dr.GetDateTime(2)};

The second query, on the other hand, has only one "column" with length and content. Parsing such a response requires knowledge of Postgres rules for tuple assembly and is similar to parsing JSON. The code would look like this:

IDataReader dr =...return PostgresDriverLibrary.Parse<Post>(dr.GetValue(0));

In the TEXT protocol, the response from Postgres would look like this:

(f4d84c89-c179-4ae4-991a-e2e6bc12d879,"some text",2015-03-12)

So, now we can raise a couple of questions:

is it faster or slower for Postgres to return the second version?
can we parse the second response faster than the first response on the client side?

To make things more interesting, let's investigate how would it compare talking to Postgres using BINARY protocol in first case and using TEXT protocol for second case. Common knowledge tells us that binary protocols are much faster then textual ones, but this also isn’t always true:

(DSL Platform – serialization benchmark)

Verdict: for such a simple table, performance of both approaches is similar

(DSL Platform DAL benchmark – single table)

Second use case, master-detail table access

Common pattern in DB access is reading two tables to reconstruct an object on the client side. While we could use several approaches, let's use the "standard one" which first reads from one table and then from a second one. This can sometimes lead to reading inconsistent data, unless we change the isolation level.

For this example, let's use an Invoice and Item tables:

CREATETABLE Invoice
(
  number VARCHAR(20)PRIMARYKEY,
  dueDate DATENOTNULL,
  total NUMERICNOTNULL,
  paid TIMESTAMPTZ,
  canceled BOOLNOTNULL,
  version BIGINTNOTNULL,
  tax NUMERIC(15,2)NOTNULL,
  reference VARCHAR(15),
  createdAt TIMESTAMPTZNOTNULL,
  modifiedAt TIMESTAMPTZNOTNULL);
 
CREATETABLE Item
(
  invoiceNumber VARCHAR(20)REFERENCES Invoice,
  _index INT,PRIMARYKEY(invoiceNumber, _index),
  product VARCHAR(100)NOTNULL,
  cost NUMERICNOTNULL,
  quantity INTNOTNULL,
  taxGroup NUMERIC(4,1)NOTNULL,
  discount NUMERIC(6,2)NOTNULL);

To make things more interesting we'll also investigate how performance would compare if we used a type instead of table for the items property. In that case we don't need a join or two queries to reconstruct the whole object.

So let's say that we want to read several invoices and their details. We would usually write something along the lines of:

SELECT*FROM Invoice
WHERE number IN('invoice-1','invoice-2',...) 
SELECT*FROM Item 
WHERE invoiceNumber IN('invoice-1','invoice-2',...)

and if we wanted to simplify materialization we could add ordering:

SELECT*FROM Invoice
WHERE number IN('invoice-1','invoice-2',...)ORDERBY number
 
SELECT*FROM Item
WHERE invoiceNumber IN('invoice-1','"invoice-2',...)ORDERBY invoiceNumber, _index

While this is slightly more taxing on the database, if we did a more complicated search, it would be much easier to process stuff in order via the second version.

On the other hand, by combining records into one big object directly on the database, we can load it in a single query:

SELECT inv,ARRAY_AGG(SELECT it FROM Item it
  WHERE it.invoiceNumber = inv.number
  ORDERBY it._index) as items  
FROM Invoice inv
WHERE inv.number IN('invoice-1','invoice-2',...)

The above query actually returns two columns, but it could be changed to return only one column.

Materialization of such objects on the client for the first version would look like this:

IDataReader master =...IDataReader detail =...var memory =new Dictionary<string, Invoice>();while(master.Read()){var head =new Invoice { 
    number = master.GetString(0), 
    dueDate = master.GetDateTime(1), ...}...}while(detail.Read()){var invoice = memory[detail.GetString(0)];var detail =new Item { 
    product = detail.GetString(2),
    cost = detail.GetDecimal(3)...}
  invoice.Items.Add(detail);}

Postgres native format would be materialized as in first example along the lines of:

IDataReader dr =...return PostgresDriverLibrary.Parse<Invoice>(dr.GetValue(0));

Postgres response in TEXT protocol would start to suffer from nesting and escaping, and would look something like:

(invoice-1,2015-03-16,"{""(invoice-1,1,""""product name"""",...)...}",...)

With each nesting layer more and more space would be spent on escaping. By developing optimized parsers for this specific Postgres TEXT response we can parse such a response very quickly.

Verdict: manual coding of SQL and materialization has become non-trivial. Joins introduce noticeable performance difference. Manual approach is losing ground.

(DSL Platform DAL benchmark – parent/child)

Third use case, master-child-detail table access

Sometimes we have nesting two levels deep. Since Postgres has rich type support this is something which we can leverage. So, how would our object-oriented modeling approach look like if we had to store bank account data into a database?

CREATETYPE Currency AS ENUM ('EUR','USD','Other');
 
CREATETYPE Transaction AS(
  date DATE,
  description VARCHAR(200),
  currency Currency,
  amount NUMERIC(15,2));
 
CREATETYPE Account AS(
  balance NUMERIC(15,2),
  number VARCHAR(40),
  name VARCHAR(100),
  notes VARCHAR(800),
  transactions Transaction[]);
 
CREATETABLE BankScrape
(
  id INTPRIMARYKEY,
  website VARCHAR(1024)NOTNULL,
  at TIMESTAMPTZNOTNULL,
  info HSTORENOTNULL,
  externalId VARCHAR(50),
  ranking INTNOTNULL,
  tags VARCHAR(10)[]NOTNULL,
  createdAt TIMESTAMPTZNOTNULL,
  accounts Account[]NOTNULL);

Our SQL queries and materialization code will look similar to before (although complexity will have increased drastically). Escaping issue is even worse than before and while reading transactions we are mostly skipping escaped chars. Not to mention that due to LOH issues we can’t just process a string, it must be done using TextReader.

Verdict: manual coding of SQL and materialization is really complex. Joins introduce a noticeable performance difference. Manual approach is not comparable on any test:

(DSL Platform DAL benchmark – parent/child/child)

Conclusions

Although we have looked into simple reading scenarios here, insert/update performance is maybe even more interesting.
Approach took by the Revenj and backing compiler is not something which can realistically be reproduced by manual coding.
Postgres is suffering from parsing complex tuples – but with smart optimizations that can yield net win. There are also few "interesting" behaviors of Postgres which required various workarounds.
It would be interesting to compare BINARY and TEXT protocol on deep nested aggregates.
JSON might have similar performance to Postgres native format, but it's probably more taxing on Postgres.

↧