Michael Paquier: Two-phase commit and temporary objects

February 1, 2019, 12:01 am

≫ Next: James Coleman: PostgreSQL at Scale: Database Schema Changes Without Downtime

≪ Previous: Keith Fiske: Managing Transaction ID Exhaustion (Wraparound) in PostgreSQL

A couple of weeks ago a bug has popped up on the community mailing lists about the use of temporary objects in two-phase commit. After discussions, the result is the following commit:

commit: c5660e0aa52d5df27accd8e5e97295cf0e64f7d4
author: Michael Paquier <michael@paquier.xyz>
date: Fri, 18 Jan 2019 09:21:44 +0900
Restrict the use of temporary namespace in two-phase transactions

Attempting to use a temporary table within a two-phase transaction is
forbidden for ages.  However, there have been uncovered grounds for
a couple of other object types and commands which work on temporary
objects with two-phase commit.  In short, trying to create, lock or drop
an object on a temporary schema should not be authorized within a
two-phase transaction, as it would cause its state to create
dependencies with other sessions, causing all sorts of side effects with
the existing session or other sessions spawned later on trying to use
the same temporary schema name.

Regression tests are added to cover all the grounds found, the original
report mentioned function creation, but monitoring closer there are many
other patterns with LOCK, DROP or CREATE EXTENSION which are involved.
One of the symptoms resulting in combining both is that the session
which used the temporary schema is not able to shut down completely,
waiting for being able to drop the temporary schema, something that it
cannot complete because of the two-phase transaction involved with
temporary objects.  In this case the client is able to disconnect but
the session remains alive on the backend-side, potentially blocking
connection backend slots from being used.  Other problems reported could
also involve server crashes.

This is back-patched down to v10, which is where 9b013dc has introduced
MyXactFlags, something that this patch relies on.

Reported-by: Alexey Bashtanov
Author: Michael Paquier
Reviewed-by: Masahiko Sawada
Discussion: https://postgr.es/m/5d910e2e-0db8-ec06-dd5f-baec420513c3@imap.cc
Backpatch-through: 10

In PostgreSQL, temporary objects are assigned into a temporary namespace which gets cleaned up automatically when the session ends, taking care consistently of any object which are session-dependent. This can include any types of objects which can be schema-qualified: tables, functions, operators, or even extensions (linked with a temporary schema). The schema name is chosen based on the position of the session in a backend array, prefixed with “pg_temp_”, hence it is perfectly possible to finish with different temporary namespace names if reconnecting a session. There are a couple of functions which can be used to status of this schema:

pg_my_temp_schema, to get the OID of the temporary schema used, useful when casted with “::regnamespace”.
pg_is_other_temp_schema, to check if a schema is from the existing session or not.
At a certain degree, current_schema and current_schemas are also useful as they can display respectively the current schema in use and the schemas in “search_path”. Note that it is possible to include directly “pg_temp” in “search_path” as an alias of the temporary schema, and that those functions will return the effective temporary schema name. For example:
=# SET search_path = ‘pg_temp’; SET =# SELECT current_schema(); current_schema —————- pg_temp_3 (1 row) =# SELECT pg_my_temp_schema()::regnamespace; pg_my_temp_schema ——————- pg_temp_3 (1 row) =# SELECT pg_is_other_temp_schema(pg_my_temp_schema()); pg_is_other_temp_schema ————————- f (1 row)

One thing to note in this particular case is that current_schema() may finish by creating a temporary schema as it needs to return the real temporary namespace associated to a session, and not an alias like “pg_temp” as in some cases the alias is not able to work with some commands. One example of that is CREATE EXTENSION specified to create objects on the session’s temporary schema (note that ALTER EXTENSION cannot move an extension contents from a persistent schema to a temporary one).

Another thing, essential to understand, is that all those temporary objects are linked to a given session, but two-phase commit is not. Hence, it is perfectly possible to run PREPARE TRANSACTION in one session, and COMMIT PREPARED in a second session. The problem discussed in the thread mentioned up-thread is that one could possibly associate temporary object within a two-phase transaction, which is logically incorrect. An effect of doing so is that the temporary schema dropped at the end of a session would block until the two-phase transaction is commit-prepared, blocking a backend slot from being used, and potentially messing up upcoming sessions trying to use the same temporary schema. So if this effect accumulates and many two-phase transactions are not committed, this could bloat the shared memory areas for upcoming connections, preventing future connections. Multiple object types may be involved, but there are other patterns like LOCK on a temporary table within a transaction running two-phase commit, or just the drop of a temporary object. One visible effect is for example a session waiting for a lock to be released, while the client thinks that the session has actually finished, which could be accomplished with just that:

=# CREATE TEMP TABLE temp_tab (a int);
CREATE TABLE
=# BEGIN;
BEGIN
=# LOCK temp_tab IN ACCESS EXCLUSIVE MODE;
LOCK TABLE
=# PREPARE TRANSACTION '2pc_lock_temp';
PREPARE TRANSACTION
-- Leave the session
=# \q

When patched, PREPARE TRANSACTION would just throw an error instead.

=# PREPARE TRANSACTION '2pc_lock_temp';
ERROR:  0A000: cannot PREPARE a transaction that has operated on temporary objects
LOCATION:  PrepareTransaction, xact.c:2284

The fix here involves more restriction of two-phase transactions when involving temporary objects, which has been on preventing only the use of tables within such transactions for many years, so this tightens the corner cases found.

Note that this found only its way down to Postgres 10, as the bug fix relies on a session-level variable called MyXactFlags, which can be used in a transaction to mark certain events. And in the case of two-phase commit, the flag is used to issue properly an error at PREPARE TRANSACTION phase so as the state of the transaction does not mess up with the temporary namespace, so as there are no after-effects with the existing session or a future session trying to use the same temporary namespace. It could be possible to lower the restriction, particularly for temporary tables which use ON COMMIT DROP, but that would be rather tricky to achieve so as it would need special handling of temporary objects which now happens at COMMIT PREPARED phase.

↧

James Coleman: PostgreSQL at Scale: Database Schema Changes Without Downtime

February 1, 2019, 3:01 pm

≫ Next: Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 12 – Allow COPY FROM to filter data using WHERE conditions

≪ Previous: Michael Paquier: Two-phase commit and temporary objects

Braintree Payments uses PostgreSQL as its primary datastore. We rely heavily on the data safety and consistency guarantees a traditional relational database offers us, but these guarantees come with certain operational difficulties. To make things even more interesting, we allow zero scheduled functional downtime for our main payments processing services.

Several years ago we published a blog post detailing some of the things we had learned about how to safely run DDL (data definition language) operations without interrupting our production API traffic.

Since that time PostgreSQL has gone through quite a few major upgrade cycles — several of which have added improved support for concurrent DDL. We’ve also further refined our processes. Given how much has changed, we figured it was time for a blog post redux.

In this post we’ll address the following topics:

First, some basics

For all code and database changes, we require that:

Live code and schemas be forward-compatible with updated code and schemas: this allows us to roll out deploys gradually across a fleet of application servers and database clusters.
New code and schemas be backward-compatible with live code and schemas: this allows us to roll back any change to the previous version in the event of unexpected errors.

For all DDL operations we require that:

Any exclusive locks acquired on tables or indexes be held for at most ~2 seconds.
Rollback strategies do not involve reverting the database schema to its previous version.

Transactionality

PostgreSQL supports transactional DDL. In most cases, you can execute multiple DDL statements inside an explicit database transaction and take an “all or nothing” approach to a set of changes. However, running multiple DDL statements inside a transaction has one serious downside: if you alter multiple objects, you’ll need to acquire exclusive locks on all of those objects in a single transactions. Because locks on multiple tables creates the possibility of deadlock and increases exposure to long waits, we do not combine multiple DDL statements into a single transaction. PostgreSQL will still execute each separate DDL statement transactionally; each statement will be either cleanly applied or fail and the transaction rolled back.

Note: Concurrent index creation is a special case. Postgres disallows executing CREATE INDEX CONCURRENTLY inside an explicit transaction; instead Postgres itself manages the transactions. If for some reason the index build fails before completion, you may need to drop the index before retrying, though the index will still never be used for regular queries if it did not finish building successfully.

Locking

PostgreSQL has many different levels of locking. We’re concerned primarily with the following table-level locks since DDL generally operates at these levels:

ACCESS EXCLUSIVE: blocks all usage of the locked table.
SHARE ROW EXCLUSIVE: blocks concurrent DDL against and row modification (allowing reads) in the locked table.
SHARE UPDATE EXCLUSIVE: blocks concurrent DDL against the locked table.

Note: “Concurrent DDL” for these purposes includes VACUUM and ANALYZE operations.

All DDL operations generally necessitate acquiring one of these locks on the object being manipulated. For example, when you run:

https://medium.com/media/b1ba327e98b89a84461c915a075c77aa/href

PostgreSQL attempts to acquire an ACCESS EXCLUSIVE lock on the table foos. Atempting to acquire this lock causes all subsequent queries on this table to queue until the lock is released. In practice your DDL operations can cause other queries to back up for as long as your longest running query takes to execute. Because arbitrarily long queueing of incoming queries is indistinguishable from an outage, we try to avoid any long-running queries in databases supporting our payments processing applications.

But sometimes a query takes longer than you expect. Or maybe you have a few special case queries that you already know will take a long time. PostgreSQL offers some additional runtime configuration options that allow us to guarantee query queueing backpressure doesn’t result in downtime.

Instead of relying on Postgres to lock an object when executing a DDL statement, we acquire the lock explicitly ourselves. This allows us to carefully control the time the queries may be queued. Additionally when we fail to acquire a lock within several seconds, we pause before trying again so that any queued queries can be executed without significantly increasing load. Finally, before we attempt lock acquisition, we query pg_locks¹ for any currently long running queries to avoid unnecessarily queueing queries for several seconds when it is unlikely that lock acquisition is going to succeed.

Starting with Postgres 9.3, you adjust the lock_timeout parameter to control how long Postgres will allow for lock acquisition before returning without acquiring the lock. If you happen to be using 9.2 or earlier (and those are unsupported; you should upgrade!), then you can simulate this behavior by using the statement_timeout parameter around an explicit LOCK <table> statement.

In many cases an ACCESS EXCLUSIVE lock need only be held for a very short period of time, i.e., the amount of time it takes Postgres to update its "catalog" (think metadata) tables. Below we'll discuss the cases where a lower lock level is sufficient or alternative approaches for avoiding long-held locks that block SELECT/INSERT/UPDATE/DELETE.

Note: Sometimes holding even an ACCESS EXCLUSIVE lock for something more than a catalog update (e.g., a full table scan or even rewrite) can be functionally acceptable when the table size is relatively small. We recommend testing your specific use case against realistic data sizes and hardware to see if a particular operation will be "fast enough". On good hardware with a table easily loaded into memory, a full table scan or rewrite for thousands (possibly even 100s of thousands) of rows may be "fast enough".

Table operations

Create table

In general, adding a table is one of the few operations we don’t have to think too hard about since, by definition, the object we’re “modifying” can’t possibly be in use yet. :D

While most of the attributes involved in creating a table do not involve other database objects, including a foreign key in your initial table definition will cause Postgres to acquire a SHARE ROW EXCLUSIVE lock against the referenced table blocking any concurrent DDL or row modifications. While this lock should be short-lived, it nonetheless requires the same caution as any other operation acquiring such a lock. We prefer to split these into two separate operations: create the table and then add the foreign key.

Drop table

Dropping a table requires an exclusive lock on that table. As long as the table isn’t in current use you can safely drop the table. Before allowing a DROP TABLE ... to make its way into our production environments we require documentation showing when all references to the table were removed from the codebase. To double check that this is the case you can query PostgreSQL's table statistics view pg_stat_user_tables² confirming that the returned statistics don't change over the course of a reasonable length of time.

Rename table

While it’s unsurprising that a table rename requires acquiring an ACCESS EXCLUSIVE lock on the table, that's far from our biggest concern. Unless the table is not being read from or written to, it's very unlikely that your application code could safely handle a table being renamed underneath it.

We avoid table renames almost entirely. But if a rename is an absolute must, then a safe approach might look something like the following:

Create a new table with the same schema as the old one.
Backfill the new table with a copy of the data in the old table.
Use INSERT and UPDATE triggers on the old table to maintain parity in the new table.
Begin using the new table.

Other approaches involving views and/or RULEs may also be viable depending on the performance characteristics required.

Column operations

Note: For column constraints (e.g., NOT NULL) or other constraints (e.g., EXCLUDES), see Constraints.

Add column

Adding a column to an existing table generally requires holding a short ACCESS EXCLUSIVE lock on the table while catalog tables are updated. But there are several potential gotchas:

Default values: Introducing a default value at the same time of adding the column will cause the table to be locked while the default value in propagated for all rows in the table. Instead, you should:

Add the new column (without the default value).
Set the default value on the column.
Backfill all existing rows separately.

Note: In the recently release PostgreSQL 11, this is no longer the case for non-volatile default values. Instead adding a new column with a default value only requires updating catalog tables, and any reads of rows without a value for the new column will magically have it “filled in” on the fly.

Not-null constraints: Adding a column with a NOT NULL constraint is only possible if there are no existing rows or a DEFAULT is also provided. If there are no existing rows, then the change is effectively equivalent to a catalog only change. If there are existing rows and you are also specifying a default value, then the same caveats apply as above with respect to default values.

Note: Adding a column will cause all SELECT * FROM ... style queries referencing the table to begin returning the new column. It is important to ensure that all currently running code safely handles new columns. To avoid this gotcha in our applications we require queries to avoid * expansion in favor of explicit column references.

Change column type

In the general case changing a column’s type requires holding an exclusive lock on a table while the entire table is rewritten with the new type.

There are a few exceptions:

Note: Even though one of the exceptions above was added in 9.1, changing the type of an indexed column would always rewrite the index even if a table rewrite was avoided. In 9.2 any column data type that avoids a table rewrite also avoids rewriting the associated indexes. If you’d like to confirm that your change won’t rewrite the table or any indexes, you can query pg_class³ and verify the relfilenode column doesn't change.

If you need to change the type of a column and one of the above exceptions doesn’t apply, then the safe alternative is:

Add a new column new_<column>.
Dual write to both columns (e.g., with a BEFORE INSERT/UPDATE trigger).
Backfill the new column with a copy of the old column’s values.
Rename <column> to old_<column> and new_<column> inside a single transaction and explicit LOCK <table> statement.
Drop the old column.

Drop column

It goes without saying that dropping a column is something that should be done with great care. Dropping a column requires an exclusive lock on the table to update the catalog but does not rewrite the table. As long as the column isn’t in current use you can safely drop the column. It’s also important to confirm that the column is not referenced by any dependent objects that could be unsafe to drop. In particular, any indexes using the column should be dropped separately and safely with DROP INDEX CONCURRENTLY since otherwise they will be automatically dropped along with the column under an ACCESS EXCLUSIVE lock. You can query pg_depend⁴ for any dependent objects.

Before allowing a ALTER TABLE ... DROP COLUMN ... to make its way into our production environments we require documentation showing when all references to the column were removed from the codebase. This process allows us to safely roll back to the release prior to the one that dropped the column.

Note: Dropping a column will require that you update all views, triggers, function, etc. that rely on that column.

Index operations

Create index

The standard form of CREATE INDEX ... acquires an ACCESS EXCLUSIVE lock against the table being indexed while building the index using a single table scan. In contrast, the form CREATE INDEX CONCURRENTLY ... acquires an SHARE UPDATE EXCLUSIVE lock but must complete two table scans (and hence is somewhat slower). This lower lock level allows reads and writes to continue against the table while the index is built.

Caveats:

Multiple concurrent index creations on a single table will not return from either CREATE INDEX CONCURRENTLY ... statement until the slowest one completes.
CREATE INDEX CONCURRENTLY ... may not be executed inside of a transaction but does maintain transactions internally. This holding open a transaction means that no auto-vacuums (against any table in the system) will be able to cleanup dead tuples introduced after the index build begins until it finishes. If you have a table with a large volume of updates (particularly bad if to a very small table) this could result in extremely sub-optimal query execution.
CREATE INDEX CONCURRENTLY ... must wait for all transactions using the table to complete before returning.

Drop index

The standard form of DROP INDEX ... acquires an ACCESS EXCLUSIVE lock against the table with the index while removing the index. For small indexes this may be a short operation. For large indexes, however, file system unlinking and disk flushing can take a significant amount of time. In contrast, the form DROP INDEX CONCURRENTLY ... acquires a SHARE UPDATE EXCLUSIVE lock to perform these operations allowing reads and writes to continue against the table while the index is dropped.

Caveats:

DROP INDEX CONCURRENTLY ... cannot be used to drop any index that supports a constraint (e.g., PRIMARY KEY or UNIQUE).
DROP INDEX CONCURRENTLY ... may not be executed inside of a transaction but does maintain transactions internally. This holding open a transaction means that no auto-vacuums (against any table in the system) will be able to cleanup dead tuples introduced after the index build begins until it finishes. If you have a table with a large volume of updates (particularly bad if to a very small table) this could result in extremely sub-optimal query execution.
DROP INDEX CONCURRENTLY ... must wait for all transactions using the table to complete before returning.

Note: DROP INDEX CONCURRENTLY ... was added in Postgres 9.2. If you're still running 9.1 or prior, you can achieve somewhat similar results by marking the index as invalid and not ready for writes, flushing buffers with the pgfincore extension, and the dropping the index.

Rename index

ALTER INDEX ... RENAME TO ... requires an ACCESS EXCLUSIVE lock on the index blocking reads from and writes to the underlying table. However a recent commit expected to be a part of Postgres 12 lowers that requirement to SHARE UPDATE EXCLUSIVE.

Reindex

REINDEX INDEX ... requires an ACCESS EXCLUSIVE lock on the index blocking reads from and writes to the underlying table. Instead we use the following procedure:

Create a new index concurrently that duplicates the existing index definition.
Drop the old index concurrently.
Rename the new index to match the original index’s name.

Note: If the index you need to rebuild backs a constraint, remember to re-add the constraint as well (subject to all of the caveats we’ve documented.)

Constraints

NOT NULL Constraints

Removing an existing not-null constraint from a column requires an exclusive lock on the table while a simple catalog update is performed.

In contrast, adding a not-null constraint to an existing column requires an exclusive lock on the table while a full table scan verifies that no null values exist. Instead you should:

Add a CHECK constraint requiring the column be not-null with ALTER TABLE <table> ADD CONSTRAINT <name> CHECK (<column> IS NOT NULL) NOT VALID;. The NOT VALID tells Postgres that it doesn't need to scan the entire table to verify that all rows satisfy the condition.
Manually verify that all rows have non-null values in your column.
Validate the constraint with ALTER TABLE <table> VALIDATE CONSTRAINT <name>;. With this statement PostgreSQL will block acquisition of other EXCLUSIVE locks for the table, but will not block reads or writes.

Bonus: There is currently a patch in the works (and possibly it will make it into Postgres 12) that will allow you to create a NOT NULL constraint without a full table scan if a CHECK constraint (like we created above) already exists.

Foreign keys

ALTER TABLE ... ADD FOREIGN KEY requires a SHARE ROW EXCLUSIVE lock (as of 9.5) on both the altered and referenced tables. While this won't block SELECT queries, blocking row modification operations for a long period of time is equally unacceptable for our transaction processing applications.

To avoid that long-held lock you can use the following process:

ALTER TABLE ... ADD FOREIGN KEY ... NOT VALID: Adds the foreign key and begins enforcing the constraint for all new INSERT/UPDATE statements but does not validate that all existing rows conform to the new constraint. This operation still requires SHARE ROW EXCLUSIVE locks, but the locks are only briefly held.
ALTER TABLE ... VALIDATE CONSTRAINT <constraint>: This operation checks all existing rows to verify they conform to the specified constraint. Validation requires a SHARE UPDATE EXCLUSIVE so may run concurrently with row reading and modification queries.

Check constraints

ALTER TABLE ... ADD CONSTRAINT ... CHECK (...) requires an ACCESS EXCLUSIVE lock. However, as with foreign keys, Postgres supports breaking the operation into two steps:

ALTER TABLE ... ADD CONSTRAINT ... CHECK (...) NOT VALID: Adds the check constraint and begins enforcing it for all new INSERT/UPDATE statements but does not validate that all existing rows conform to the new constraint. This operation still requires an ACCESS EXCLUSIVE lock.
ALTER TABLE ... VALIDATE CONSTRAINT <constraint>: This operation checks all existing rows to verify they conform to the specified constraint. Validation requires a SHARE UPDATE EXCLUSIVE on the altered table so may run concurrently with row reading and modification queries. A ROW SHARE lock is held on the reference table which will block any operations requiring exclusive locks while validating the constraint.

Uniqueness constraints

ALTER TABLE ... ADD CONSTRAINT ... UNIQUE (...) requires an ACCESS EXCLUSIVE lock. However, Postgres supports breaking the operation into two steps:

Create a unique index concurrently. This step will immediately enforce uniqueness, but if you need a declared constraint (or a primary key), then continue to add the constraint separately.
Add the constraint using the already existing index with ALTER TABLE ... ADD CONSTRAINT ... UNIQUE USING INDEX <index>. Adding the constraint still requires an ACCESS EXCLUSIVE lock, but the lock will only be held for fast catalog operations.

Note: If you specify PRIMARY KEY instead of UNIQUE then any non-null columns in the index will be made NOT NULL. This requires a full table scan which currently can't be avoided. See NOT NULL Constraints for more details.

Exclusion constraints

ALTER TABLE ... ADD CONSTRAINT ... EXCLUDE USING ... requires an ACCESS EXCLUSIVE lock. Adding an exclusion constraint builds the supporting index, and, unfortunately, there is currently no support for using an existing index (as you can do with a unique constraint).

Enum Types

CREATE TYPE <name> AS (...) and DROP TYPE <name> (after verifying there are no existing usages in the database) can both be done safely without unexpected locking.

Modifying enum values

ALTER TYPE <enum> RENAME VALUE <old> TO <new> was added in Postgres 10. This statement does not require locking tables which use the enum type.

Deleting enum values

Enums are stored internally as integers and there is no support for gaps in the valid range, removing a value would currently shifting values and rewriting all rows using those values. PostgreSQL does not currently support removing values from an existing enum type.

Announcing Pg_ha_migrations for Ruby on Rails

We’re also excited to announce that we have open-sourced our internal library pg_ha_migrations. This Ruby gem enforces DDL safety in projects using Ruby on Rails and/or ActiveRecord with an emphasis on explicitly choosing trade-offs and avoiding unnecessary magic (and the corresponding surprises). You can read more in the project’s README.

Footnotes

[1] You can find active long-running queries and the tables they lock with the following query:

https://medium.com/media/90ee0e73f1d666273b12ad4df0ce2dfd/href

[2] You can see PostgreSQL’s internal statistics about table accesses with the following query:

https://medium.com/media/571d5801ea7602f91f4c6cf98361de39/href

[3] You can see if DDL causes a relation to be rewritten by seeing if the relfilenode value changes after running the statement:

https://medium.com/media/4526240441c8b45733d22b7b04326d48/href

[4] You can find objects (e.g., indexes) that depend on a specific column by running the statement:

https://medium.com/media/5dfc04bc00958bcd9a44f3568c73a49c/href

PostgreSQL at Scale: Database Schema Changes Without Downtime was originally published in Braintree Product and Technology on Medium, where people are continuing the conversation by highlighting and responding to this story.

↧

Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 12 – Allow COPY FROM to filter data using WHERE conditions

February 3, 2019, 12:47 am

≫ Next: Andreas Scherbaum: How long will a 64 bit Transaction-ID last in PostgreSQL?

≪ Previous: James Coleman: PostgreSQL at Scale: Database Schema Changes Without Downtime

On 19th of January 2019, Tomas Vondra committed patch: Allow COPY FROM to filter data using WHERE conditions Extends the COPY FROM command with a WHERE condition, which allows doing various types of filtering while importing the data (random sampling, condition on a data column, etc.). Until now such filtering required either preprocessing of … Continue reading

↧

Andreas Scherbaum: How long will a 64 bit Transaction-ID last in PostgreSQL?

February 3, 2019, 5:34 am

≫ Next: Paul Ramsey: Dr. JTS comes to Crunchy

≪ Previous: Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 12 – Allow COPY FROM to filter data using WHERE conditions

Andreas 'ads' Scherbaum

At FOSDEM someone asked how long 64 bit Transaction-IDs will last.

To refresh: PostgreSQL is currently using 32 bits for the TXID, and is good for around 4 billion transactions:

fosdem=# SELECT 2^32;
  ?column?  
------------
 4294967296
(1 row)

That will not last very long if you have a busy database, doing many writes over the day. MVCC keeps the new and old versions of a row in the table, and the TXID will increase with every transaction. At some point the 4 billion transactions are reached, the TXID will overrun, and start again at the beginning. The way transactions are working in PostgreSQL, suddenly all data in your database will become invisible. No one wants that!

To limit this problem, PostgreSQL has a number mechanism in place:

PostgreSQL splits transaction ids into half: 2 billion in the past are visible, 2 billion in the future are not visible - all visible rows must live in the 2 billion in the past, at all times.
Old, deleted row versions are enevtually removed by VACUUM (or Autovacuum), the XID is no longer used.
Old row versions, which are still live, are marked as "freezed" in a table, and assigned a special XID - the previously used XID is no longer needed. The problem here is that every single table in every database must be Vacuumed before the 2 billion threshold is reached.
PostgreSQL uses lazy XIDs, where a "real" transaction id is only assigned if the transaction changes something on disk - if a transaction is read only, and does not change anything, no transaction id is consumed.

Continue reading "How long will a 64 bit Transaction-ID last in PostgreSQL?"

↧

Paul Ramsey: Dr. JTS comes to Crunchy

February 4, 2019, 5:00 am

≫ Next: Hubert 'depesz' Lubaczewski: Converting list of integers into list of ranges

≪ Previous: Andreas Scherbaum: How long will a 64 bit Transaction-ID last in PostgreSQL?

Today’s an exciting day in the Victoria office of Crunchy Data: our local staff count goes from one to two, as Martin Davis joins the company!

This is kind of a big deal, because this year Martin and I will be spending much or our time on the core computational geometry library that powers PostGIS, the GEOS library, and the JTS library from which it derives its structure.

Why is that a big deal? Because GEOS, JTS and other language ports provide the computational geometry algorithms underneath most of the open source geospatial ecosystem – so improvements in our core libraries ripple out to help a huge swathe of other software.

JTS came first, initially as a project of the British Columbia government. GEOS is a C++ port of JTS. There are also Javascript and .Net ports (JSTS and NTS.

Each of those libraries has developed a rich downline of other libraries and projects that depend on them. On the desktop, on the web, in the middleware, JTS and GEOS power all of it.

So we know that work on JTS and GEOS on our side is going to benefit far more than just PostGIS.

I’ve already spent a decent amount of time on bringing the GEOS library up to date with the changes in JTS over the past few months, and trying to fulfill the “maintainer” role, merging pull requests and closing some outstanding tickets.

As Martin starts adding to JTS, I now feel more confident in my ability to bring those changes into the C++ world of GEOS as they land.

Without pre-judging what will get first priority, topics of overlay robustness, predicate performance, and geometry cleaning are near the top of our list.

Our spatial customers at Crunchy process a lot of geometry, so ensuring that PostGIS (GEOS) operations are robust and high performance is a big win for PostgreSQL and for our customers as well.

↧

Hubert 'depesz' Lubaczewski: Converting list of integers into list of ranges

February 4, 2019, 7:13 am

≫ Next: Brandur Leach: SortSupport: Sorting in Postgres at Speed

≪ Previous: Paul Ramsey: Dr. JTS comes to Crunchy

Yesterday someone on irc asked: i've a query that returns sequential numbers with gaps (generate_series + join) and my question is: can is somehow construct ranges out of the returned values? sort of range_agg or something? There was no further discussion, aside from me saying sure you can. not trivial task, but possible. you'd need … Continue reading

↧

Brandur Leach: SortSupport: Sorting in Postgres at Speed

February 4, 2019, 8:56 am

≫ Next: Bruce Momjian: Permission Letters

≪ Previous: Hubert 'depesz' Lubaczewski: Converting list of integers into list of ranges

Most often, there’s a trade off involved in optimizing software. The cost of better performance is the opportunity cost of the time that it took to write the optimization, and the additional cost of maintenance for code that becomes more complex and more difficult to understand.

Many projects prioritize product development over improving runtime speed. Time is spent building new things instead of making existing things faster. Code is kept simpler and easier to understand so that adding new features and fixing bugs stays easy, even as particular people rotate in and out and institutional knowledge is lost.

But that’s certainly not the case in all domains. Game code is often an interesting read because it comes from an industry where speed is a competitive advantage, and it’s common practice to optimize liberally even at some cost to modularity and maintainability. One technique for that is to inline code in critical sections even to the point of absurdity. CryEngine, open-sourced a few years ago, has a few examples of this, with “tick” functions like this one that are 800+ lines long with 14 levels of indentation.

Another common place to find optimizations is in databases. While games optimize because they have to, databases optimize because they’re an example of software that’s extremely leveraged – if there’s a way to make running select queries or building indexes 10% faster, it’s not an improvement that affects just a couple users, it’s one that’ll potentially invigorate millions of installations around the world. That’s enough of an advantage that the enhancement is very often worth it, even if the price is a challenging implementation or some additional code complexity.

Postgres contains a wide breadth of optimizations, and happily they’ve been written conscientiously so that the source code stays readable. The one that we’ll look at today is SortSupport, a technique for localizing the information needed to compare data into places where it can be accessed very quickly, thereby making sorting data much faster. Sorting for types that have had Sortsupport implemented usually gets twice as fast or more, a speedup that transfers directly into common database operations like ORDER BY, DISTINCT, and CREATE INDEX.

Sorting with abbreviated keys

While sorting, Postgres builds a series of tiny structures that represent the data set being sorted. These tuples have space for a value the size of a native pointer (i.e. 64 bits on a 64-bit machine) which is enough to fit the entirety of some common types like booleans or integers (known as pass-by-value types), but not for others that are larger than 64 bits or arbitrarily large. In their case, Postgres will follow a references back to the heap when comparing values (they’re appropriately called pass-by-reference types). Postgres is very fast, so that still happens quickly, but it’s slower than comparing values readily available in memory.

An array of sort tuples.

SortSupport augments pass-by-reference types by bringing a representative part of their value into the sort tuple to save trips to the heap. Because sort tuples usually don’t have the space to store the entirety of the value, SortSupport generates a digest of the full value called an abbreviated key, and stores it instead. The contents of an abbreviated key vary by type, but they’ll aim to store as much sorting-relevant information as possible while remaining faithful to pre-existing sorting rules.

Abbreviated keys should never produce an incorrect comparison, but it’s okay if they can’t fully resolve one. If two abbreviated keys look equal, Postgres will fall back to comparing their full heap values to make sure it gets the right result (called an “authoritative comparison”).

A sort tuple with an abbreviated key and pointer to the heap.

Implementing an abbreviated key is straightforward in many cases. UUIDs are a good example of that: at 128 bits long they’re always larger than the pointer size even on a 64-bit machine, but we can get a very good proxy of their full value just by sampling their first 64 bits (or 32 on a 32-bit machine). Especially for V4 UUIDs which are almost entirely random ¹, the first 64 bits will be enough to definitively determine the order for all but unimaginably large data sets. Indeed, the patch that brought in SortSupport for UUIDs made sorting them about twice as fast!

String-like types (e.g. text, varchar) aren’t too much harder: just pack as many characters from the front of the string in as possible (although made somewhat more complicated by locales). Adding SortSupport for them made operations like CREATE INDEXabout three times faster. My only ever patch to Postgres was implementing SortSupport for the macaddr type, which was fairly easy because although it’s pass-by-reference, its values are only six bytes long ². On a 64-bit machine we have room for all six bytes, and on 32-bit we sample the MAC address’ first four bytes.

Some abbreviated keys are more complex. The implementation for the numeric type, which allows arbitrary scale and precision, involves excess-K coding and breaking available bits into multiple parts to store sort-relevant fields.

A glance at the implementation

Let’s try to get a basic idea of how SortSupport is implemented by examining a narrow slice of source code. Sorting in Postgres is extremely complex and involves thousands of lines of code, so fair warning that I’m going to simplify some things and skip a lot of others.

A good place start is with Datum, the pointer-sized type (32 or 64 bits, depending on the CPU) used for sort comparisons. It stores entire values for pass-by-value types, abbreviated keys for pass-by-reference types that implement SortSupport, and a pointer for those that don’t. You can find it defined in postgres.h:

/*
 * A Datum contains either a value of a pass-by-value type or a pointer
 * to a value of a pass-by-reference type.  Therefore, we require:
 *
 * sizeof(Datum) == sizeof(void *) == 4 or 8
 */

typedef uintptr_t Datum;

#define SIZEOF_DATUM SIZEOF_VOID_P

Building abbreviated keys for UUID

The format of abbreviated keys for the uuid type is one of the easiest to understand, so let’s look at that. In Postgres, the struct pg_uuid_t defines how UUIDs are physically stored in the heap (from uuid.h):

/* uuid size in bytes */
#define UUID_LEN 16

typedef struct pg_uuid_t
{
    unsigned char data[UUID_LEN];
} pg_uuid_t;

You might be used to seeing UUIDs represented in string format like 123e4567-e89b-12d3-a456-426655440000, but remember that this is Postgres which likes to be as efficient as possible! A UUID contains 16 bytes worth of information, so pg_uuid_t above defines an array of exactly 16 bytes. No wastefulness to be found.

SortSupport implementations define a conversion routine which takes the original value and produces a datum containing an abbreviated key. Here’s the one for UUIDs (from uuid.c):

static Datum
uuid_abbrev_convert(Datum original, SortSupport ssup)
{
    pg_uuid_t *authoritative = DatumGetUUIDP(original);
    Datum      res;

    memcpy(&res, authoritative->data, sizeof(Datum));

    ...

    /*
     * Byteswap on little-endian machines.
     *
     * This is needed so that uuid_cmp_abbrev() (an unsigned integer 3-way
     * comparator) works correctly on all platforms.  If we didn't do this,
     * the comparator would have to call memcmp() with a pair of pointers to
     * the first byte of each abbreviated key, which is slower.
     */
    res = DatumBigEndianToNative(res);

    return res;
}

memcpy (“memory copy”) extracts a datum worth of bytes from a pg_uuid_t and places it into res. We can’t take the whole UUID, but we’ll be taking its 4 or 8 most significant bytes, which will be enough information for most comparisons.

Abbreviated key formats for the `uuid` type.

The call DatumBigEndianToNative is there to help with an optimization. When comparing our abbreviated keys, we could do so with memcmp (“memory compare”) which would compare each byte in the datum one at a time. That’s perfectly functional of course, but because our datums are the same size as native integers, we can instead choose to take advantage of the fact that CPUs are optimized to compare integers really, really quickly, and arrange the datums in memory as if they were integers. You can see this integer comparison taking place in the UUID abbreviated key comparison function:

static int
uuid_cmp_abbrev(Datum x, Datum y, SortSupport ssup)
{
    if (x > y)
        return 1;
    else if (x == y)
        return 0;
    else
        return -1;
}

However, pretending that some consecutive bytes in memory are integers introduces some complication. Integers might be stored like data in pg_uuid_t with the most significant byte first, but that depends on the architecture of the CPU. We call architectures that store numerical values this way big-endian. Big-endian machines exist, but the chances are that the CPU you’re using to read this article stores bytes in the reverse order of their significance, with the most significant at the highest address. This layout is called little-endian, and is in use by Intel’s X86, as well as being the default mode for ARM chips like the ones in Android and iOS devices.

If we left the big-endian result of the memcpy unchanged on little-endian systems, the resulting integer would be wrong. The answer is to byteswap, which reverses the order of the bytes, and corrects the integer.

Example placement of integer bytes on little and big endian architectures.

You can see in pg_bswap.h that DatumBigEndianToNative is defined as a no-op on a big-endian machine, and is otherwise connected to a byteswap (“bswap”) routine of the appropriate size:

#ifdef WORDS_BIGENDIAN

        #define        DatumBigEndianToNative(x)    (x)

#else

    #if SIZEOF_DATUM == 8
        #define        DatumBigEndianToNative(x)    pg_bswap64(x)
    #else
        #define        DatumBigEndianToNative(x)    pg_bswap32(x)
    #endif

#endif

Conversion abort & HyperLogLog

Let’s touch upon one more feature of uuid_abbrev_convert. In data sets with very low cardinality (i.e, many duplicated items) SortSupport introduces some danger of worsening performance. With so many duplicates, the contents of abbreviated keys would often show equality, in which cases Postgres would often have to fall back to the authoritative comparator. In effect, by adding SortSupport we would have added a useless additional comparison that wasn’t there before.

To protect against performance regression, SortSupport has a mechanism for aborting abbreviated key conversion. If the data set is found to be below a certain cardinality threshold, Postgres stops abbreviating, reverts any keys that were already abbreviated, and disables further abbreviation for the sort.

Cardinality is estimated with the help of HyperLogLog, an algorithm that estimates the distinct count of a data set in a very memory-efficient way. Here you can see the conversion routine adding new values to the HyperLogLog if an abort is still possible:

uss->input_count += 1;

if (uss->estimating)
{
    uint32        tmp;

#if SIZEOF_DATUM == 8
    tmp = (uint32) res ^ (uint32) ((uint64) res >> 32);
#else
    tmp = (uint32) res;
#endif

    addHyperLogLog(&uss->abbr_card, DatumGetUInt32(hash_uint32(tmp)));
}

And where it makes an abort decision (from uuid.c):

static bool
uuid_abbrev_abort(int memtupcount, SortSupport ssup)
{
    ...

    abbr_card = estimateHyperLogLog(&uss->abbr_card);

    /*
     * If we have >100k distinct values, then even if we were
     * sorting many billion rows we'd likely still break even,
     * and the penalty of undoing that many rows of abbrevs would
     * probably not be worth it. Stop even counting at that point.
     */
    if (abbr_card > 100000.0)
    {
        uss->estimating = false;
        return false;
    }

    /*
     * Target minimum cardinality is 1 per ~2k of non-null inputs.
     * 0.5 row fudge factor allows us to abort earlier on genuinely
     * pathological data where we've had exactly one abbreviated
     * value in the first 2k (non-null) rows.
     */
    if (abbr_card < uss->input_count / 2000.0 + 0.5)
    {
        return true;
    }

    ...
}

It also covers aborting the case where we have a data set that’s poorly suited to the abbreviated key format. For example, imagine a million UUIDs that all shared a common prefix in their first eight bytes, but were distinct in their last eight ³. Realistically this will be extremely unusual, so abbreviated key conversion will rarely abort.

Tuples and data types

Sort tuples are the tiny structures that Postgres sorts in memory. They hold a reference to the “true” tuple, a datum, and a flag to indicate whether or not the first value is NULL (which has its own special sorting semantics). The latter two are named with a 1 suffix as datum1 and isnull1 because they represent only one field worth of information. Postgres will need to fall back to different values in the event of equality in a multi-column comparison. From tuplesort.c:

/*
 * The objects we actually sort are SortTuple structs.  These contain
 * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
 * which is a separate palloc chunk --- we assume it is just one chunk and
 * can be freed by a simple pfree() (except during merge, when we use a
 * simple slab allocator).  SortTuples also contain the tuple's first key
 * column in Datum/nullflag format, and an index integer.
 */
typedef struct
{
    void       *tuple;          /* the tuple itself */
    Datum       datum1;         /* value of first key column */
    bool        isnull1;        /* is first key column NULL? */
    int         tupindex;       /* see notes above */
} SortTuple;

In the code we’ll look at below, SortTuple may reference a heap tuple, which has a variety of different struct representations. One used by the sort algorithm is HeapTupleHeaderData (from htup_details.h):

struct HeapTupleHeaderData
{
    union
    {
        HeapTupleFields t_heap;
        DatumTupleFields t_datum;
    }            t_choice;

    ItemPointerData t_ctid; /* current TID of this or newer tuple (or a
                             * speculative insertion token) */

    ...
}

Heap tuples have a pretty complex structure which we won’t cover, but you can see that it contains an ItemPointerData value. This struct is what gives Postgres the precise information it needs to find data in the heap (from itemptr.h):

/*
 * ItemPointer:
 *
 * This is a pointer to an item within a disk page of a known file
 * (for example, a cross-link from an index to its parent table).
 * blkid tells us which block, posid tells us which entry in the linp
 * (ItemIdData) array we want.
 */
typedef struct ItemPointerData
{
    BlockIdData ip_blkid;
    OffsetNumber ip_posid;
}

Tuple comparison

The algorithm to compare abbreviated keys is duplicated in the Postgres source in a number of places depending on the sort operation being carried out. We’ll take a look at comparetup_heap (from tuplesort.c) which is used when sorting based on the heap. This would be invoked for example if you ran an ORDER BY on a field that doesn’t have an index on it.

static int
comparetup_heap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
{
    SortSupport sortKey = state->sortKeys;
    HeapTupleData ltup;
    HeapTupleData rtup;
    TupleDesc     tupDesc;
    int           nkey;
    int32         compare;
    AttrNumber    attno;
    Datum         datum1,
                  datum2;
    bool          isnull1,
                  isnull2;


    /* Compare the leading sort key */
    compare = ApplySortComparator(a->datum1, a->isnull1,
                                  b->datum1, b->isnull1,
                                  sortKey);
    if (compare != 0)
        return compare;

ApplySortComparator gets a comparison result between two datum values. It’ll compare two abbreviated keys where appropriate and handles NULL sorting semantics. The return value of a comparison follows the spirit of C’s strcmp: when comparing (a, b), -1 indicates a < b, 0 indicates equality, and 1 indicates a > b.

The algorithm returns immediately if inequality (!= 0) was detected. Otherwise, it checks to see if abbreviated keys were used, and if so applies the authoritative comparison if they were. Because space in abbreviated keys is limited, two being equal doesn’t necessarily indicate that the values that they represent are.

if (sortKey->abbrev_converter)
{
    attno = sortKey->ssup_attno;

    datum1 = heap_getattr(&ltup, attno, tupDesc, &isnull1);
    datum2 = heap_getattr(&rtup, attno, tupDesc, &isnull2);

    compare = ApplySortAbbrevFullComparator(datum1, isnull1,
                                            datum2, isnull2,
                                            sortKey);
    if (compare != 0)
        return compare;
}

Once again, the algorithm returns if inequality was detected. If not, it starts to look beyond the first field (in the case of a multi-column sort):

    ...

    sortKey++;
    for (nkey = 1; nkey < state->nKeys; nkey++, sortKey++)
    {
        attno = sortKey->ssup_attno;

        datum1 = heap_getattr(&ltup, attno, tupDesc, &isnull1);
        datum2 = heap_getattr(&rtup, attno, tupDesc, &isnull2);

        compare = ApplySortComparator(datum1, isnull1,
                                      datum2, isnull2,
                                      sortKey);
        if (compare != 0)
            return compare;
    }

    return 0;
}

After finding abbreviated keys to be equal, full values to be equal, and all additional sort fields to be equal, the last step is to return 0, indicating in classic libc style that the two tuples are really, fully equal.

Fast code and leveraged software

SortSupport is a good example of the type of low-level optimization that most of us probably wouldn’t bother with in our projects, but which makes sense in an extremely leveraged system like a database. As implementations are added for it and Postgres’ tens of thousands of users like myself upgrade, common operations like DISTINCT, ORDER BY, and CREATE INDEX get twice as fast, for free.

Credit to Peter Geoghegan for some of the original exploration of this idea and implementations for UUID and a generalized system for SortSupport on variable-length string types, Robert Haas and Tom Lane for adding the necessary infrastructure, and Andrew Gierth for a difficult implementation for numeric. (I hope I got all that right.)

¹ A note for the pedantic that V4 UUIDs usually have only 122 bits of randomness as four bits are used for the version and two for the variant.

² The new type macaddr8 was later introduced to handle EUI-64 MAC addresses, which are 64 bits long.

³ A data set of UUIDs with common datum-sized prefixes is a pretty unlikely scenario, but it’s a little more realistic for variable-length string types, where users are storing much more free-form data.

↧

Bruce Momjian: Permission Letters

February 4, 2019, 12:00 pm

≫ Next: Christophe Pettus: “Breaking PostgreSQL at Scale” at FOSDEM 2019

≪ Previous: Brandur Leach: SortSupport: Sorting in Postgres at Speed

If you have looked at Postgres object permissions in the past, I bet you were confused. I get confused, and I have been at this for a long time.

The way permissions are stored in Postgres is patterned after the long directory listing of Unix-like operating systems, e.g., ls -l. Just like directory listings, the Postgres system stores permissions using single-letter indicators. r is used for read (SELECT) in both systems, while w is used for write permission in ls, and UPDATE in Postgres. The other nine letters used by Postgres don't correspond to any directory listing permission letters, e.g., d is DELETE permission. The full list of Postgres permission letters is in the GRANT documentation page; the other letters are:

D -- TRUNCATE
x -- REFERENCES
t -- TRIGGER
X -- EXECUTE
U -- USAGE
C -- CREATE
c -- CONNECT
T -- TEMPORARY

↧

Christophe Pettus: “Breaking PostgreSQL at Scale” at FOSDEM 2019

February 4, 2019, 1:00 pm

≫ Next: Avinash Kumar: Use pg_repack to Rebuild PostgreSQL Database Objects Online

≪ Previous: Bruce Momjian: Permission Letters

The slides for my talk, “Breaking PostgreSQL at Scale” at FOSDEM 2019 are available.

↧

Avinash Kumar: Use pg_repack to Rebuild PostgreSQL Database Objects Online

February 4, 2019, 5:14 pm

≫ Next: Hans-Juergen Schoenig: Implementing “AS OF”-queries in PostgreSQL

≪ Previous: Christophe Pettus: “Breaking PostgreSQL at Scale” at FOSDEM 2019

In this blog post, we’ll look at how to use

pg_repack

to rebuild PostgreSQL database objects online.

We’ve seen a lot of questions regarding the options available in PostgreSQL for rebuilding a table online. We created this blog post to explain the

pg_repack

extension, available in PostgreSQL for this requirement. pg_repack is a well-known extension that was created and is maintained as an open source project by several authors.

There are three main reasons why you need to use

pg_repack

in a PostgreSQL server:

Reclaim free space from a table to disk, after deleting a huge chunk of records
Rebuild a table to re-order the records and shrink/pack them to lesser number of pages. This may let a query fetch just one page ( or < n pages) instead of n pages from disk. In other words, less IO and more performance.
Reclaim free space from a table that has grown in size with a lot of bloat due to improper autovacuum settings.

You might have already read our previous articles that explained what bloat is, and discussed the internals of autovacuum. After reading these articles, you can see there is an autovacuum background process that removes dead tuples from a table and allows the space to be re-used by future updates/inserts on that table. Over a period of time, tables that take the maximum number of updates or deletes may have a lot of bloated space due to poorly tuned autovacuum settings. This leads to slow performing queries on these tables. Rebuilding the table is the best way to avoid this.

Why is just autovacuum not enough for tables with bloat?

We have discussed several parameters that change the behavior of an autovacuum process in this blog post. There cannot be more than

autovacuum_max_workers

number of autovacuum processes running in a database cluster at a time. At the same time, due to untuned autovacuum settings and no manual vacuuming of the database as a weekly or monthy jobs, many tables can be skipped from autovacuum. We have discussed in this post that the default autovacuum settings run autovacuum on a table with ten records more times than a table with a million records. So, it is very important to tune your autovacuum settings, set table-level customized autovacuum parameters and enable automated jobs to identify tables with huge bloat and run manual vacuum on them as scheduled jobs during low peak times (after thorough testing).

VACUUM FULL

VACUUM FULL

is the default option available with a PostgreSQL installation that allows us to rebuild a table. This is similar to

ALTER TABLE

in MySQL. However, this command acquires an exclusive lock and locks reads and writes on a table.

VACUUM FULL tablename;

pg_repack

pg_repack

is an extension available for PostgreSQL that helps us rebuild a table online. This is similar to

pt-online-schema-change

for online table rebuild/reorg in MySQL. However,

pg_repack

works for tables with a Primary key or a NOT NULL Unique key only.

Installing pg_repack extension

In RedHat/CentOS/OEL from PGDG Repo

Obtain the latest PGDG repo from https://yum.postgresql.org/ and perform the following step:

# yum install pg_repack11 (This works for PostgreSQL 11)
Similarly, for PostgreSQL 10,
# yum install pg_repack10

In Debian/Ubuntu from PGDG repo

Add certificates, repo and install

pg_repack

Following certificate may change. Please validate before you perform these steps.
# sudo apt-get install wget ca-certificates
# wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
# sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
# sudo apt-get update
# apt-get install postgresql-server-dev-11
# apt-get install postgresql-11-repack

Loading and creating pg_repack extension

Step 1 :

You need to add

pg_repack

shared_preload_libraries

. For that, just set this parameter in postgresql.conf or postgresql.auto.conf file.

shared_preload_libraries = 'pg_repack'

Setting this parameter requires a restart.

$ pg_ctl -D $PGDATA restart -mf

Step 2 :

In order to start using

pg_repack

, you must create this extension in each database where you wish to run it:

$ psql
\c percona
CREATE EXTENSION pg_repack;

Using pg_repack to Rebuild Tables Online

Similar to

pt-online-schema-change

, you can use the option

--dry-run

to see if this table can be rebuilt using

pg_repack

. When you rebuild a table using

pg_repack

, all its associated Indexes does get rebuild automatically. You can also use

-t

instead of

--table

as an argument to rebuild a specific table.

Success message you see when a table satisfies the requirements for pg_repack.

$ pg_repack --dry-run -d percona --table scott.employee
INFO: Dry run enabled, not executing repack
INFO: repacking table "scott.employee"

Error message when a table does not satisfy the requirements for pg_repack.

$ pg_repack --dry-run -d percona --table scott.sales
INFO: Dry run enabled, not executing repack
WARNING: relation "scott.sales" must have a primary key or not-null unique keys

Now to execute the rebuild of a table: scott.employee ONLINE, you can use the following command. It is just the previous command without

--dry-run

$ pg_repack -d percona --table scott.employee
INFO: repacking table "scott.employee"

Rebuilding Multiple Tables using pg_repack

Use an additional

--table

for each table you wish to rebuild.

Dry Run

$ pg_repack --dry-run -d percona --table scott.employee --table scott.departments
INFO: Dry run enabled, not executing repack
INFO: repacking table "scott.departments"
INFO: repacking table "scott.employee"

Execute

$ pg_repack -d percona --table scott.employee --table scott.departments
INFO: repacking table "scott.departments"
INFO: repacking table "scott.employee"

Rebuilding an entire Database using pg_repack

You can rebuild an entire database online using

-d

. Any table that is not eligible for

pg_repack

is skipped automatically.

Dry Run

$ pg_repack --dry-run -d percona
INFO: Dry run enabled, not executing repack
INFO: repacking table "scott.departments"
INFO: repacking table "scott.employee"

Execute

$ pg_repack -d percona
INFO: repacking table "scott.departments"
INFO: repacking table "scott.employee"

Running pg_repack in parallel jobs

To perform a parallel rebuild of a table, you can use the option

-j

. Please ensure that you have sufficient free CPUs that can be allocated to run

pg_repack

in parallel.

$ pg_repack -d percona -t scott.employee -j 4
NOTICE: Setting up workers.conns
INFO: repacking table "scott.employee"

Running pg_repack remotely

You can always run

pg_repack

from a Remote Machine. This helps in scenarios where we have PostgreSQL databases deployed on Amazon RDS. To run

pg_repack

from a remote machine, you must have the same version of

pg_repack

installed in the remote server as well as the database server (say AWS RDS).

↧

Hans-Juergen Schoenig: Implementing “AS OF”-queries in PostgreSQL

February 5, 2019, 6:01 am

≫ Next: James Coleman: We scale both vertically and horizontally with dozens of PostgreSQL clusters across multiple…

≪ Previous: Avinash Kumar: Use pg_repack to Rebuild PostgreSQL Database Objects Online

Over the years many people have asked for “timetravel” or “AS OF”-queries in PostgreSQL. Oracle has provided this kind of functionality for quite some time already. However, in the PostgreSQL world “AS OF timestamp” is not directly available. The question now is: How can we implement this vital functionality in user land and mimic Oracle functionality?

Implementing “AS OF” and timetravel in user land

Let us suppose we want to version a simple table consisting of just three columns: id, some_data1 and some_data2. To do this we first have to install the btree_gist module, which adds some valuable operators we will need to manage time travel. The table storing the data will need an additional column to handle the validity of a row. Fortunately PostgreSQL supports “range types”, which allow to store ranges in an easy and efficient way. Here is how it works:

CREATE EXTENSION IF NOT EXISTS btree_gist;

CREATE TABLE t_object
(
	id		int8,
	valid		tstzrange,
	some_data1	text,
	some_data2	text,
	EXCLUDE USING gist (id WITH =, valid WITH &&)
);

Mind the last line here: “EXLUDE USING gist” will ensure that if the “id” is identical the period (“valid”) must not overlap. The idea is to ensure that the same “id” only has one entry at a time. PostgreSQL will automatically create a Gist index on that column. The feature is called “exclusion constraint”. If you are looking for more information about this feature consider checking out the official documentation (https://www.postgresql.org/docs/current/ddl-constraints.html).

If you want to filter on some_data1 and some_data2 consider creating indexes. Remember, missing indexes are in many cases the root cause of bad performance:

CREATE INDEX idx_some_index1 ON t_object (some_data1);
CREATE INDEX idx_some_index2 ON t_object (some_data2);

By creating a view, it should be super easy to extract data from the underlying tables:

CREATE VIEW t_object_recent AS
	SELECT 	id, some_data1, some_data2
	FROM 	t_object
	WHERE 	current_timestamp <@ valid;

SELECT * FROM t_object_recent;

For the sake of simplicity I have created a view, which returns the most up to date state of the data. However, it should also be possible to select an old version of the data. To make it easy for application developers I decided to introduce a new GUC (= runtime variable), which allows users to set the desired point in time. Here is how it works:

SET timerobot.as_of_time = '2018-01-10 00:00:00';

Then you can create a second view, which returns the old data:

CREATE VIEW t_object_historic AS
	SELECT 	id, some_data1, some_data2
	FROM 	t_object
	WHERE 	current_setting('timerobot.as_of_time')::timestamptz <@ valid;
SELECT * FROM t_object_historic;

It is of course also possible to do that with just one view. However, the code is easier to read if two views are used (for the purpose of this blog post). Feel free to adjust the code to your needs.

If you are running an application you usually don’t care what is going on behind the scenes – you simply want to modify a table and things should take care of themselves in an easy way. Therefore it makes sense to add a trigger to your t_object_current table, which takes care of versioning. Here is an example:

CREATE FUNCTION version_trigger() RETURNS trigger AS
$$
BEGIN
	IF TG_OP = 'UPDATE'
	THEN
		IF NEW.id <> OLD.id
		THEN
			RAISE EXCEPTION 'the ID must not be changed';
		END IF;

		UPDATE 	t_object
		SET 	valid = tstzrange(lower(valid), current_timestamp)
		WHERE	id = NEW.id
			AND current_timestamp <@ valid;

		IF NOT FOUND THEN
			RETURN NULL;
		END IF;
	END IF;

	IF TG_OP IN ('INSERT', 'UPDATE')
	THEN
		INSERT INTO t_object (id, valid, some_data1, some_data2)
			VALUES (NEW.id,
				tstzrange(current_timestamp, TIMESTAMPTZ 'infinity'),
				NEW.some_data1,
				NEW.some_data2);

		RETURN NEW;
	END IF;

	IF TG_OP = 'DELETE'
	THEN
		UPDATE 	t_object
		SET 	valid = tstzrange(lower(valid), current_timestamp)
		WHERE id = OLD.id
			AND current_timestamp <@ valid;

		IF FOUND THEN
			RETURN OLD;
		ELSE
			RETURN NULL;
		END IF;
	END IF;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER object_trig
	INSTEAD OF INSERT OR UPDATE OR DELETE
	ON t_object_recent
	FOR EACH ROW
	EXECUTE PROCEDURE version_trigger();

The trigger will take care that INSERT, UPDATE, and DELETE is properly taken care of.

Finally: Timetravel made easy

It is obvious that versioning does have an impact on performance. You should also keep in mind that UPDATE and DELETE are more expensive than previously. However, the advantage is that things are really easy from an application point of view. Implementing time travel can be done quite generically and most applications might not have to be changed at all. What is true, however, is that foreign keys will need some special attention and might be easy to implement in general. It depends on your applications whether this kind of restriction is in general a problem or not.

The post Implementing “AS OF”-queries in PostgreSQL appeared first on Cybertec.

↧

James Coleman: We scale both vertically and horizontally with dozens of PostgreSQL clusters across multiple…

February 5, 2019, 7:03 am

≫ Next: Jeff McCormick: What's New in Crunchy PostgreSQL Operator 3.5

≪ Previous: Hans-Juergen Schoenig: Implementing “AS OF”-queries in PostgreSQL

We scale both vertically and horizontally with dozens of PostgreSQL clusters across multiple datacenters.

↧

Jeff McCormick: What's New in Crunchy PostgreSQL Operator 3.5

January 24, 2019, 8:12 am

≫ Next: Jignesh Shah: PGConf.RU 2019: Slides from my sessions

≪ Previous: James Coleman: We scale both vertically and horizontally with dozens of PostgreSQL clusters across multiple…

68747470733a2f2f6372756e636879646174612e6769746875622e696f2f706f7374677265732d6f70657261746f722f6c61746573742f4f70657261746f722d4172636869746563747572652e706e67

Crunchy Data is happy to announce the release of the open source PostgreSQL Operator 3.5 for Kubernetes project, which you can find here: https://github.com/CrunchyData/postgres-operator/

This latest release provides further feature enhancements designed to support users intending to deploy large-scale PostgreSQL clusters on Kubernetes, with enterprise high-availability and disaster recovery requirements.

When combined with the Crunchy PostgreSQL Container Suite, the PostgreSQL Operator provides an open source, Kubernetes-native PostgreSQL-as-a-Service capability.

Read on to see what is new in PostgreSQL Operator 3.5.

↧

Jignesh Shah: PGConf.RU 2019: Slides from my sessions

February 6, 2019, 2:17 am

≫ Next: Peter Eisentraut: PostgreSQL with passphrase-protected SSL keys under systemd

≪ Previous: Jeff McCormick: What's New in Crunchy PostgreSQL Operator 3.5

It was my first visit to Moscow for PGConf.RU 2019. Enjoyed meeting the strong community of PostgreSQL in Russia!

Slides from my sessions:

1. Deep Dive into the RDS PostgreSQL Universe

Deep Dive into the RDS PostgreSQL Universe from Jignesh Shah

2. Tips and Tricks for Amazon RDS for PostgreSQL

Tips and Tricks with Amazon RDS for PostgreSQL from Jignesh Shah

This blog represents my own view points and not of my employer, Amazon Web Services.

↧

Peter Eisentraut: PostgreSQL with passphrase-protected SSL keys under systemd

February 6, 2019, 4:34 am

≫ Next: Bruce Momjian: Expanding Permission Letters

≪ Previous: Jignesh Shah: PGConf.RU 2019: Slides from my sessions

PostgreSQL supports SSL, and SSL private keys can be protected by a passphrase. Many people choose not to use passphrases with their SSL keys, and that’s perhaps fine. This blog post is about what happens when you do have a passphrase.

If you have SSL enabled and a key with a passphrase and you start the server, the server will stop to ask for the passphrase. This happens automatically from within the OpenSSL library. Stopping to ask for a passphrase obviously prevents automatic starts, restarts, and reboots, but we’re assuming here that you have made that tradeoff consciously.

When you run PostgreSQL under systemd, which is very common nowadays, there is an additional problem. Under systemd, the server process does not have terminal access, and so it cannot ask for any passphrases. By default, the startup will fail in such setups.

As of PostgreSQL 11, it is possible to configure an external program to obtain the SSL passphrase, using the configuration setting ssl_passphrase_command. As a simple example, you can set

ssl_passphrase_command = 'echo "secret"'

and it will apply the passphrase that the program prints out. You can use this to fetch the passphrase from a file or other secret store, for example.

But what if you still want the security of having to manually enter the password? Systemd has a facility that lets services prompt for passwords. You can use that like this::

ssl_passphrase_command = '/bin/systemd-ask-password "%p"'

Except that that doesn’t actually work, because non-root processes are not permitted to use the systemd password system; see this bug. But there are workarounds.

One workaround is to use sudo. So you use

ssl_passphrase_command = 'sudo /bin/systemd-ask-password "%p"'

and then put something like this into /etc/sudoers:

postgres ALL=(root) NOPASSWD: /bin/systemd-ask-password

A more evil workaround (discussed in the above-mentioned bug report) is to override the permissions on the socket file underlying this mechanism. Add this to the postgresql service unit:

ExecStartPre=+/bin/setfacl -m u:${USER}:wx /run/systemd/ask-password
ExecStartPost=+/bin/setfacl -x u:${USER} /run/systemd/ask-password

This enables access to the socket before the service starts and then removes it again. Note that the + in the Exec lines, which runs those lines as root, requires at least system version 231. So for example CentOS 7 won’t support it. Maybe don’t do this.

Anyway, if you have this set up and run the usual sudo systemctl start postgresql, you should see

Enter PEM pass phrase:

Then you can enter the passphrase and the service should then start normally.

↧

Bruce Momjian: Expanding Permission Letters

February 6, 2019, 10:30 am

≫ Next: Mark Wong: PDXPUG: February Meetup: Temporal Databases: Theory and Postgres

≪ Previous: Peter Eisentraut: PostgreSQL with passphrase-protected SSL keys under systemd

Thanks to a comment on my previous blog post by Kaarel, the ability to simply display the Postgres permission letters is not quite as dire as I showed. There is a function, aclexplode(), which expands the access control list (ACL) syntax used by Postgres into a table with full text descriptions. This function exists in all supported versions of Postgres. However, it was only recently documented in this commit based on this email thread, and will appear in the Postgres 12 documentation.

Since aclexplode() exists (undocumented) in all supported versions of Postgres, it can be used to provide more verbose output of the pg_class.relacl permission letters. Here it is used with the test table created in the previous blog entry:

SELECT relacl
FROM pg_class
WHERE relname = 'test';
                         relacl
--------------------------------------------------------
 {postgres=arwdDxt/postgres,bob=r/postgres,=r/postgres}
 
SELECT a.*
FROM pg_class, aclexplode(relacl) AS a
WHERE relname = 'test'
ORDER BY 1, 2;
 grantor | grantee | privilege_type | is_grantable
---------+---------+----------------+--------------
      10 |       0 | SELECT         | f
      10 |      10 | SELECT         | f
      10 |      10 | UPDATE         | f
      10 |      10 | DELETE         | f
      10 |      10 | INSERT         | f
      10 |      10 | REFERENCES     | f
      10 |      10 | TRIGGER        | f
      10 |      10 | TRUNCATE       | f
      10 |   16388 | SELECT         | f

↧

Mark Wong: PDXPUG: February Meetup: Temporal Databases: Theory and Postgres

February 6, 2019, 11:51 am

≫ Next: Vasilis Ventirozos: managing xid wraparound without looking like a (mail) chimp

≪ Previous: Bruce Momjian: Expanding Permission Letters

2019 February 21 Meeting (Note: Back to third Thursday this month!)

Location:

PSU Business Accelerator
2828 SW Corbett Ave · Portland, OR
Parking is open after 5pm.

Speaker: Paul Jungwirth

Temporal databases let you record history: either a history of the database (what the table used to say), a history of the thing itself (what it used to be), or both at once. The theory of temporal databases goes back to the 90s, but standardization has only just begun with some modest recommendations in SQL:2011, and database products (including Postgres) are still missing major functionality.

This talk will cover how temporal tables are structured, how they are queried and updated, what SQL:2011 offers (and doesn’t), what functionality Postgres has already, and what remains to be built.

Paul started programming on a Tandy 1000 at age 8 and hasn’t been able to stop since. He helped build one of the Mac’s first web servers in 1994 and has founded software companies in politics and technical hiring. He works as an independent consultant specializing in Rails, Postgres, and Chef.

↧

Vasilis Ventirozos: managing xid wraparound without looking like a (mail) chimp

February 6, 2019, 11:52 am

≫ Next: Vasilis Ventirozos: Managing xid wraparound without looking like a (mail) chimp

≪ Previous: Mark Wong: PDXPUG: February Meetup: Temporal Databases: Theory and Postgres

My colleague Payal came across an outage that happened to mailchimp's mandrill app yesterday, link can be found HERE.
Since this was PostgreSQL related i wanted to post about the technical aspect of it.
According to the company :

“Mandrill uses a sharded Postgres setup as one of our main datastores,”
the email explains.
“On Sunday, February 3, at 10:30pm EST, 1 of our 5 physical Postgres instances saw a significant spike in writes.”
The email continues:
“The spike in writes triggered a Transaction ID Wraparound issue. When this occurs, database activity is completely halted. The database sets itself in read-only mode until offline maintenance (known as vacuuming) can occur.”

So, lets see what that "transaction id wraparound issue" is and how someone could prevent similar outages from ever happening.

PostgreSQL uses MVCC to control transaction visibility, basically by comparing transaction IDs (XIDs). A row with an insert XID greater than the current transaction XID shouldn't be visible to the current transaction. But since transaction IDs are not unlimited a cluster will eventually run out after
(2^32 transactions 4+ billion) causing transaction ID wraparound: transaction counter wraps around to zero, and all past transaction would appear to be in the future

This is being taken care of by vacuum that will mark rows as frozen, indicating that they were inserted by a transaction that committed far in the past that can be visible to all current and future transactions. To control this behavior, postgres has a configurable called autovacuum_freeze_max_age, which defaults at 200.000.000 transactions, a very conservative default that must be tuned in larger production systems.

It sounds complicated but its relatively easy not to get to that point,for most people just having autovacuum on will prevent this situation from ever happening. You can simply schedule manual vacuums by getting a list of the tables "closer" to autovacuum_freeze_max_age with a simple query like this:

 SELECT 'vacuum analyze ' || c.oid::regclass ||' /* manual_vacuum */ ;'  
 FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid  
 WHERE c.relkind = 'r'   
 ORDER BY greatest(age(c.relfrozenxid),age(t.relfrozenxid)) desc  
 LIMIT 10;

Even if you want to avoid manual vacuums, you can simply create a report or a monitoring metric based on age of relfrozenxid of pg_class combined with pg_settings, eg :

 SELECT oid::regclass::text AS table,pg_size_pretty(pg_total_relation_size(oid)) AS table_size,age(relfrozenxid) AS xid_age,   
 (SELECT setting::int FROM pg_settings WHERE   
  name = 'autovacuum_freeze_max_age') -   
  age(relfrozenxid) AS tx_for_wraparound_vacuum   
 FROM pg_class    
 WHERE relfrozenxid != 0   
 AND oid > 16384   
 ORDER BY tx_for_wraparound_vacuum ;

But, lets assume that you got to the point that you started seeing "autovacuum: VACUUM tablename (to prevent wraparound)"
This vacuum will more likely happen when you don't want it and even if you kill it, it will keep respawning.
If you already set autovacuum_freeze_max_age to a more viable production setting, we usually set it at 1.5bil, you can change autovacuum_freeze_max_age to a higher value, say 3 billion and immediately kick of a manual vacuum on the table. This vacuum will be a "normal vacuum" hence live. If the table is so big that each manual vacuum take days to complete, then you should've partitioned it...Proper schema and proper tuning of autovacuum are really important, especially in write heavy workloads.

And to get back to the outage, the mail from the company insinuates that it was a problem with postgres, reality is that it wasn't, it was clearly an OPS oversight.

Thanks for reading
Vasilis Ventirozos
credativ LLC

↧

Vasilis Ventirozos: Managing xid wraparound without looking like a (mail) chimp

February 6, 2019, 12:07 pm

≫ Next: Bruce Momjian: Pooler Authentication

≪ Previous: Vasilis Ventirozos: managing xid wraparound without looking like a (mail) chimp

 SELECT 'vacuum analyze ' || c.oid::regclass ||' /* manual_vacuum */ ;'  
 FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid  
 WHERE c.relkind = 'r'   
 ORDER BY greatest(age(c.relfrozenxid),age(t.relfrozenxid)) desc  
 LIMIT 10;

Even if you want to avoid manual vacuums, you can simply create a report or a monitoring metric based on age of relfrozenxid of pg_class combined with pg_settings, eg :

 SELECT oid::regclass::text AS table,pg_size_pretty(pg_total_relation_size(oid)) AS table_size,age(relfrozenxid) AS xid_age,   
 (SELECT setting::int FROM pg_settings WHERE   
  name = 'autovacuum_freeze_max_age') -   
  age(relfrozenxid) AS tx_for_wraparound_vacuum   
 FROM pg_class    
 WHERE relfrozenxid != 0   
 AND oid > 16384   
 ORDER BY tx_for_wraparound_vacuum ;

↧

Bruce Momjian: Pooler Authentication

January 25, 2019, 7:15 am

≫ Next: Michael Paquier: Postgres 12 highlight - Functions for partitions

≪ Previous: Vasilis Ventirozos: Managing xid wraparound without looking like a (mail) chimp

One frequent complaint about connection poolers is the limited number of authentication methods they support. While some of this is caused by the large amount of work required to support all 14 Postgres authentication methods, the bigger reason is that only a few authentication methods allow for the clean passing of authentication credentials through an intermediate server.

Specifically, all the password-based authentication methods (scram-sha-256,md5,password) can easily pass credentials from the client through the pooler to the database server. (This is not possible using SCRAM with channel binding.) Many of the other authentication methods, e.g. cert, are designed to prevent man-in-the-middle attacks and therefore actively thwart passing through of credentials. For these, effectively, you have to set up two sets of credentials for each user — one for client to pooler, and another from pooler to database server, and keep them synchronized.

A pooler built-in to Postgres would have fewer authentication pass-through problems, though internal poolers have some down sides too, as I already stated.

↧