commit: c5660e0aa52d5df27accd8e5e97295cf0e64f7d4
author: Michael Paquier <michael@paquier.xyz>
date: Fri, 18 Jan 2019 09:21:44 +0900
Restrict the use of temporary namespace in two-phase transactions
Attempting to use a temporary table within a two-phase transaction is
forbidden for ages. However, there have been uncovered grounds for
a couple of other object types and commands which work on temporary
objects with two-phase commit. In short, trying to create, lock or drop
an object on a temporary schema should not be authorized within a
two-phase transaction, as it would cause its state to create
dependencies with other sessions, causing all sorts of side effects with
the existing session or other sessions spawned later on trying to use
the same temporary schema name.
Regression tests are added to cover all the grounds found, the original
report mentioned function creation, but monitoring closer there are many
other patterns with LOCK, DROP or CREATE EXTENSION which are involved.
One of the symptoms resulting in combining both is that the session
which used the temporary schema is not able to shut down completely,
waiting for being able to drop the temporary schema, something that it
cannot complete because of the two-phase transaction involved with
temporary objects. In this case the client is able to disconnect but
the session remains alive on the backend-side, potentially blocking
connection backend slots from being used. Other problems reported could
also involve server crashes.
This is back-patched down to v10, which is where 9b013dc has introduced
MyXactFlags, something that this patch relies on.
Reported-by: Alexey Bashtanov
Author: Michael Paquier
Reviewed-by: Masahiko Sawada
Discussion: https://postgr.es/m/5d910e2e-0db8-ec06-dd5f-baec420513c3@imap.cc
Backpatch-through: 10
In PostgreSQL, temporary objects are assigned into a temporary namespace which
gets cleaned up automatically when the session ends, taking care consistently
of any object which are session-dependent. This can include any types of
objects which can be schema-qualified: tables, functions, operators, or even
extensions (linked with a temporary schema). The schema name is chosen based
on the position of the session in a backend array, prefixed with “pg_temp_”,
hence it is perfectly possible to finish with different temporary namespace
names if reconnecting a session. There are a couple of functions which can be
used to status of this schema:
pg_my_temp_schema, to get the OID of the temporary schema used,
useful when casted with “::regnamespace”.
pg_is_other_temp_schema, to check if a schema is from the existing
session or not.
At a certain degree, current_schema and current_schemas are also useful
as they can display respectively the current schema in use and the schemas
in “search_path”. Note that it is possible to include directly “pg_temp”
in “search_path” as an alias of the temporary schema, and that those
functions will return the effective temporary schema name. For example:
=# SET search_path = ‘pg_temp’;
SET
=# SELECT current_schema();
current_schema
—————-
pg_temp_3
(1 row)
=# SELECT pg_my_temp_schema()::regnamespace;
pg_my_temp_schema
——————-
pg_temp_3
(1 row)
=# SELECT pg_is_other_temp_schema(pg_my_temp_schema());
pg_is_other_temp_schema
————————-
f
(1 row)
One thing to note in this particular case is that current_schema() may
finish by creating a temporary schema as it needs to return the real
temporary namespace associated to a session, and not an alias like
“pg_temp” as in some cases the alias is not able to work with some
commands. One example of that is CREATE EXTENSION specified to create
objects on the session’s temporary schema (note that ALTER EXTENSION
cannot move an extension contents from a persistent schema to a temporary
one).
Another thing, essential to understand, is that all those temporary objects
are linked to a given session, but two-phase commit is not. Hence, it is
perfectly possible to run PREPARE TRANSACTION in one session, and COMMIT
PREPARED in a second session. The problem discussed in the thread mentioned
up-thread is that one could possibly associate temporary object within a
two-phase transaction, which is logically incorrect. An effect of doing so
is that the temporary schema dropped at the end of a session would block
until the two-phase transaction is commit-prepared, blocking a backend
slot from being used, and potentially messing up upcoming sessions trying
to use the same temporary schema. So if this effect accumulates and many
two-phase transactions are not committed, this could bloat the shared
memory areas for upcoming connections, preventing future connections.
Multiple object types may be involved, but there are other patterns like
LOCK on a temporary table within a transaction running two-phase commit,
or just the drop of a temporary object. One visible effect is for example a
session waiting for a lock to be released, while the client thinks that
the session has actually finished, which could be accomplished with just
that:
=# CREATE TEMP TABLE temp_tab (a int);
CREATE TABLE
=# BEGIN;
BEGIN
=# LOCK temp_tab IN ACCESS EXCLUSIVE MODE;
LOCK TABLE
=# PREPARE TRANSACTION '2pc_lock_temp';
PREPARE TRANSACTION
-- Leave the session
=# \q
When patched, PREPARE TRANSACTION would just throw an error instead.
=# PREPARE TRANSACTION '2pc_lock_temp';
ERROR: 0A000: cannot PREPARE a transaction that has operated on temporary objects
LOCATION: PrepareTransaction, xact.c:2284
The fix here involves more restriction of two-phase transactions when
involving temporary objects, which has been on preventing only the use
of tables within such transactions for many years, so this tightens the
corner cases found.
Note that this found only its way down to Postgres 10, as the bug fix relies
on a session-level variable called MyXactFlags, which can be used in a
transaction to mark certain events. And in the case of two-phase commit, the
flag is used to issue properly an error at PREPARE TRANSACTION phase so as
the state of the transaction does not mess up with the temporary namespace,
so as there are no after-effects with the existing session or a future session
trying to use the same temporary namespace. It could be possible to lower
the restriction, particularly for temporary tables which use ON COMMIT
DROP, but that would be rather tricky to achieve so as it would need
special handling of temporary objects which now happens at COMMIT PREPARED
phase.
Braintree Payments uses PostgreSQL as its primary datastore. We rely heavily on the data safety and consistency guarantees a traditional relational database offers us, but these guarantees come with certain operational difficulties. To make things even more interesting, we allow zero scheduled functional downtime for our main payments processing services.
Several years ago we published a blog post detailing some of the things we had learned about how to safely run DDL (data definition language) operations without interrupting our production API traffic.
Since that time PostgreSQL has gone through quite a few major upgrade cycles — several of which have added improved support for concurrent DDL. We’ve also further refined our processes. Given how much has changed, we figured it was time for a blog post redux.
For all code and database changes, we require that:
Live code and schemas be forward-compatible with updated code and schemas: this allows us to roll out deploys gradually across a fleet of application servers and database clusters.
New code and schemas be backward-compatible with live code and schemas: this allows us to roll back any change to the previous version in the event of unexpected errors.
For all DDL operations we require that:
Any exclusive locks acquired on tables or indexes be held for at most ~2 seconds.
Rollback strategies do not involve reverting the database schema to its previous version.
Transactionality
PostgreSQL supports transactional DDL. In most cases, you can execute multiple DDL statements inside an explicit database transaction and take an “all or nothing” approach to a set of changes. However, running multiple DDL statements inside a transaction has one serious downside: if you alter multiple objects, you’ll need to acquire exclusive locks on all of those objects in a single transactions. Because locks on multiple tables creates the possibility of deadlock and increases exposure to long waits, we do not combine multiple DDL statements into a single transaction. PostgreSQL will still execute each separate DDL statement transactionally; each statement will be either cleanly applied or fail and the transaction rolled back.
Note: Concurrent index creation is a special case. Postgres disallows executing CREATE INDEX CONCURRENTLY inside an explicit transaction; instead Postgres itself manages the transactions. If for some reason the index build fails before completion, you may need to drop the index before retrying, though the index will still never be used for regular queries if it did not finish building successfully.
Locking
PostgreSQL has many different levels of locking. We’re concerned primarily with the following table-level locks since DDL generally operates at these levels:
ACCESS EXCLUSIVE: blocks all usage of the locked table.
SHARE ROW EXCLUSIVE: blocks concurrent DDL against and row modification (allowing reads) in the locked table.
SHARE UPDATE EXCLUSIVE: blocks concurrent DDL against the locked table.
Note: “Concurrent DDL” for these purposes includes VACUUM and ANALYZE operations.
All DDL operations generally necessitate acquiring one of these locks on the object being manipulated. For example, when you run:
PostgreSQL attempts to acquire an ACCESS EXCLUSIVE lock on the table foos. Atempting to acquire this lock causes all subsequent queries on this table to queue until the lock is released. In practice your DDL operations can cause other queries to back up for as long as your longest running query takes to execute. Because arbitrarily long queueing of incoming queries is indistinguishable from an outage, we try to avoid any long-running queries in databases supporting our payments processing applications.
But sometimes a query takes longer than you expect. Or maybe you have a few special case queries that you already know will take a long time. PostgreSQL offers some additional runtime configuration options that allow us to guarantee query queueing backpressure doesn’t result in downtime.
Instead of relying on Postgres to lock an object when executing a DDL statement, we acquire the lock explicitly ourselves. This allows us to carefully control the time the queries may be queued. Additionally when we fail to acquire a lock within several seconds, we pause before trying again so that any queued queries can be executed without significantly increasing load. Finally, before we attempt lock acquisition, we query pg_locks¹ for any currently long running queries to avoid unnecessarily queueing queries for several seconds when it is unlikely that lock acquisition is going to succeed.
Starting with Postgres 9.3, you adjust the lock_timeout parameter to control how long Postgres will allow for lock acquisition before returning without acquiring the lock. If you happen to be using 9.2 or earlier (and those are unsupported; you should upgrade!), then you can simulate this behavior by using the statement_timeout parameter around an explicit LOCK <table> statement.
In many cases an ACCESS EXCLUSIVE lock need only be held for a very short period of time, i.e., the amount of time it takes Postgres to update its "catalog" (think metadata) tables. Below we'll discuss the cases where a lower lock level is sufficient or alternative approaches for avoiding long-held locks that block SELECT/INSERT/UPDATE/DELETE.
Note: Sometimes holding even an ACCESS EXCLUSIVE lock for something more than a catalog update (e.g., a full table scan or even rewrite) can be functionally acceptable when the table size is relatively small. We recommend testing your specific use case against realistic data sizes and hardware to see if a particular operation will be "fast enough". On good hardware with a table easily loaded into memory, a full table scan or rewrite for thousands (possibly even 100s of thousands) of rows may be "fast enough".
Table operations
Create table
In general, adding a table is one of the few operations we don’t have to think too hard about since, by definition, the object we’re “modifying” can’t possibly be in use yet. :D
While most of the attributes involved in creating a table do not involve other database objects, including a foreign key in your initial table definition will cause Postgres to acquire a SHARE ROW EXCLUSIVE lock against the referenced table blocking any concurrent DDL or row modifications. While this lock should be short-lived, it nonetheless requires the same caution as any other operation acquiring such a lock. We prefer to split these into two separate operations: create the table and then add the foreign key.
Drop table
Dropping a table requires an exclusive lock on that table. As long as the table isn’t in current use you can safely drop the table. Before allowing a DROP TABLE ... to make its way into our production environments we require documentation showing when all references to the table were removed from the codebase. To double check that this is the case you can query PostgreSQL's table statistics view pg_stat_user_tables² confirming that the returned statistics don't change over the course of a reasonable length of time.
Rename table
While it’s unsurprising that a table rename requires acquiring an ACCESS EXCLUSIVE lock on the table, that's far from our biggest concern. Unless the table is not being read from or written to, it's very unlikely that your application code could safely handle a table being renamed underneath it.
We avoid table renames almost entirely. But if a rename is an absolute must, then a safe approach might look something like the following:
Create a new table with the same schema as the old one.
Backfill the new table with a copy of the data in the old table.
Use INSERT and UPDATE triggers on the old table to maintain parity in the new table.
Begin using the new table.
Other approaches involving views and/or RULEs may also be viable depending on the performance characteristics required.
Column operations
Note: For column constraints (e.g., NOT NULL) or other constraints (e.g., EXCLUDES), see Constraints.
Add column
Adding a column to an existing table generally requires holding a short ACCESS EXCLUSIVE lock on the table while catalog tables are updated. But there are several potential gotchas:
Default values: Introducing a default value at the same time of adding the column will cause the table to be locked while the default value in propagated for all rows in the table. Instead, you should:
Add the new column (without the default value).
Set the default value on the column.
Backfill all existing rows separately.
Note: In the recently release PostgreSQL 11, this is no longer the case for non-volatile default values. Instead adding a new column with a default value only requires updating catalog tables, and any reads of rows without a value for the new column will magically have it “filled in” on the fly.
Not-null constraints: Adding a column with a NOT NULL constraint is only possible if there are no existing rows or a DEFAULT is also provided. If there are no existing rows, then the change is effectively equivalent to a catalog only change. If there are existing rows and you are also specifying a default value, then the same caveats apply as above with respect to default values.
Note: Adding a column will cause all SELECT * FROM ... style queries referencing the table to begin returning the new column. It is important to ensure that all currently running code safely handles new columns. To avoid this gotcha in our applications we require queries to avoid * expansion in favor of explicit column references.
Change column type
In the general case changing a column’s type requires holding an exclusive lock on a table while the entire table is rewritten with the new type.
There are a few exceptions:
Note: Even though one of the exceptions above was added in 9.1, changing the type of an indexed column would always rewrite the index even if a table rewrite was avoided. In 9.2 any column data type that avoids a table rewrite also avoids rewriting the associated indexes. If you’d like to confirm that your change won’t rewrite the table or any indexes, you can query pg_class³ and verify the relfilenode column doesn't change.
If you need to change the type of a column and one of the above exceptions doesn’t apply, then the safe alternative is:
Add a new column new_<column>.
Dual write to both columns (e.g., with a BEFORE INSERT/UPDATE trigger).
Backfill the new column with a copy of the old column’s values.
Rename <column> to old_<column> and new_<column> inside a single transaction and explicit LOCK <table> statement.
Drop the old column.
Drop column
It goes without saying that dropping a column is something that should be done with great care. Dropping a column requires an exclusive lock on the table to update the catalog but does not rewrite the table. As long as the column isn’t in current use you can safely drop the column. It’s also important to confirm that the column is not referenced by any dependent objects that could be unsafe to drop. In particular, any indexes using the column should be dropped separately and safely with DROP INDEX CONCURRENTLY since otherwise they will be automatically dropped along with the column under an ACCESS EXCLUSIVE lock. You can query pg_depend⁴ for any dependent objects.
Before allowing a ALTER TABLE ... DROP COLUMN ... to make its way into our production environments we require documentation showing when all references to the column were removed from the codebase. This process allows us to safely roll back to the release prior to the one that dropped the column.
Note: Dropping a column will require that you update all views, triggers, function, etc. that rely on that column.
Index operations
Create index
The standard form of CREATE INDEX ... acquires an ACCESS EXCLUSIVE lock against the table being indexed while building the index using a single table scan. In contrast, the form CREATE INDEX CONCURRENTLY ... acquires an SHARE UPDATE EXCLUSIVE lock but must complete two table scans (and hence is somewhat slower). This lower lock level allows reads and writes to continue against the table while the index is built.
Caveats:
Multiple concurrent index creations on a single table will not return from either CREATE INDEX CONCURRENTLY ... statement until the slowest one completes.
CREATE INDEX CONCURRENTLY ... may not be executed inside of a transaction but does maintain transactions internally. This holding open a transaction means that no auto-vacuums (against any table in the system) will be able to cleanup dead tuples introduced after the index build begins until it finishes. If you have a table with a large volume of updates (particularly bad if to a very small table) this could result in extremely sub-optimal query execution.
CREATE INDEX CONCURRENTLY ... must wait for all transactions using the table to complete before returning.
Drop index
The standard form of DROP INDEX ... acquires an ACCESS EXCLUSIVE lock against the table with the index while removing the index. For small indexes this may be a short operation. For large indexes, however, file system unlinking and disk flushing can take a significant amount of time. In contrast, the form DROP INDEX CONCURRENTLY ... acquires a SHARE UPDATE EXCLUSIVE lock to perform these operations allowing reads and writes to continue against the table while the index is dropped.
Caveats:
DROP INDEX CONCURRENTLY ... cannot be used to drop any index that supports a constraint (e.g., PRIMARY KEY or UNIQUE).
DROP INDEX CONCURRENTLY ... may not be executed inside of a transaction but does maintain transactions internally. This holding open a transaction means that no auto-vacuums (against any table in the system) will be able to cleanup dead tuples introduced after the index build begins until it finishes. If you have a table with a large volume of updates (particularly bad if to a very small table) this could result in extremely sub-optimal query execution.
DROP INDEX CONCURRENTLY ... must wait for all transactions using the table to complete before returning.
Note: DROP INDEX CONCURRENTLY ... was added in Postgres 9.2. If you're still running 9.1 or prior, you can achieve somewhat similar results by marking the index as invalid and not ready for writes, flushing buffers with the pgfincore extension, and the dropping the index.
Rename index
ALTER INDEX ... RENAME TO ... requires an ACCESS EXCLUSIVE lock on the index blocking reads from and writes to the underlying table. However a recent commit expected to be a part of Postgres 12 lowers that requirement to SHARE UPDATE EXCLUSIVE.
Reindex
REINDEX INDEX ... requires an ACCESS EXCLUSIVE lock on the index blocking reads from and writes to the underlying table. Instead we use the following procedure:
Create a new index concurrently that duplicates the existing index definition.
Rename the new index to match the original index’s name.
Note: If the index you need to rebuild backs a constraint, remember to re-add the constraint as well (subject to all of the caveats we’ve documented.)
Constraints
NOT NULL Constraints
Removing an existing not-null constraint from a column requires an exclusive lock on the table while a simple catalog update is performed.
In contrast, adding a not-null constraint to an existing column requires an exclusive lock on the table while a full table scan verifies that no null values exist. Instead you should:
Add a CHECK constraint requiring the column be not-null with ALTER TABLE <table> ADD CONSTRAINT <name> CHECK (<column> IS NOT NULL) NOT VALID;. The NOT VALID tells Postgres that it doesn't need to scan the entire table to verify that all rows satisfy the condition.
Manually verify that all rows have non-null values in your column.
Validate the constraint with ALTER TABLE <table> VALIDATE CONSTRAINT <name>;. With this statement PostgreSQL will block acquisition of other EXCLUSIVE locks for the table, but will not block reads or writes.
Bonus: There is currently a patch in the works (and possibly it will make it into Postgres 12) that will allow you to create a NOT NULL constraint without a full table scan if a CHECK constraint (like we created above) already exists.
Foreign keys
ALTER TABLE ... ADD FOREIGN KEY requires a SHARE ROW EXCLUSIVE lock (as of 9.5) on both the altered and referenced tables. While this won't block SELECT queries, blocking row modification operations for a long period of time is equally unacceptable for our transaction processing applications.
To avoid that long-held lock you can use the following process:
ALTER TABLE ... ADD FOREIGN KEY ... NOT VALID: Adds the foreign key and begins enforcing the constraint for all new INSERT/UPDATE statements but does not validate that all existing rows conform to the new constraint. This operation still requires SHARE ROW EXCLUSIVE locks, but the locks are only briefly held.
ALTER TABLE ... VALIDATE CONSTRAINT <constraint>: This operation checks all existing rows to verify they conform to the specified constraint. Validation requires a SHARE UPDATE EXCLUSIVE so may run concurrently with row reading and modification queries.
Check constraints
ALTER TABLE ... ADD CONSTRAINT ... CHECK (...) requires an ACCESS EXCLUSIVE lock. However, as with foreign keys, Postgres supports breaking the operation into two steps:
ALTER TABLE ... ADD CONSTRAINT ... CHECK (...) NOT VALID: Adds the check constraint and begins enforcing it for all new INSERT/UPDATE statements but does not validate that all existing rows conform to the new constraint. This operation still requires an ACCESS EXCLUSIVE lock.
ALTER TABLE ... VALIDATE CONSTRAINT <constraint>: This operation checks all existing rows to verify they conform to the specified constraint. Validation requires a SHARE UPDATE EXCLUSIVE on the altered table so may run concurrently with row reading and modification queries. A ROW SHARE lock is held on the reference table which will block any operations requiring exclusive locks while validating the constraint.
Uniqueness constraints
ALTER TABLE ... ADD CONSTRAINT ... UNIQUE (...) requires an ACCESS EXCLUSIVE lock. However, Postgres supports breaking the operation into two steps:
Create a unique index concurrently. This step will immediately enforce uniqueness, but if you need a declared constraint (or a primary key), then continue to add the constraint separately.
Add the constraint using the already existing index with ALTER TABLE ... ADD CONSTRAINT ... UNIQUE USING INDEX <index>. Adding the constraint still requires an ACCESS EXCLUSIVE lock, but the lock will only be held for fast catalog operations.
Note: If you specify PRIMARY KEY instead of UNIQUE then any non-null columns in the index will be made NOT NULL. This requires a full table scan which currently can't be avoided. See NOT NULL Constraints for more details.
Exclusion constraints
ALTER TABLE ... ADD CONSTRAINT ... EXCLUDE USING ... requires an ACCESS EXCLUSIVE lock. Adding an exclusion constraint builds the supporting index, and, unfortunately, there is currently no support for using an existing index (as you can do with a unique constraint).
Enum Types
CREATE TYPE <name> AS (...) and DROP TYPE <name> (after verifying there are no existing usages in the database) can both be done safely without unexpected locking.
Modifying enum values
ALTER TYPE <enum> RENAME VALUE <old> TO <new> was added in Postgres 10. This statement does not require locking tables which use the enum type.
Deleting enum values
Enums are stored internally as integers and there is no support for gaps in the valid range, removing a value would currently shifting values and rewriting all rows using those values. PostgreSQL does not currently support removing values from an existing enum type.
Announcing Pg_ha_migrations for Ruby on Rails
We’re also excited to announce that we have open-sourced our internal library pg_ha_migrations. This Ruby gem enforces DDL safety in projects using Ruby on Rails and/or ActiveRecord with an emphasis on explicitly choosing trade-offs and avoiding unnecessary magic (and the corresponding surprises). You can read more in the project’s README.
Footnotes
[1] You can find active long-running queries and the tables they lock with the following query:
On 19th of January 2019, Tomas Vondra committed patch: Allow COPY FROM to filter data using WHERE conditions Extends the COPY FROM command with a WHERE condition, which allows doing various types of filtering while importing the data (random sampling, condition on a data column, etc.). Until now such filtering required either preprocessing of … Continue reading "Waiting for PostgreSQL 12 – Allow COPY FROM to filter data using WHERE conditions"
That will not last very long if you have a busy database, doing many writes over the day. MVCC keeps the new and old versions of a row in the table, and the TXID will increase with every transaction. At some point the 4 billion transactions are reached, the TXID will overrun, and start again at the beginning. The way transactions are working in PostgreSQL, suddenly all data in your database will become invisible. No one wants that!
To limit this problem, PostgreSQL has a number mechanism in place:
Old, deleted row versions are enevtually removed by VACUUM (or Autovacuum), the XID is no longer used.
Old row versions, which are still live, are marked as "freezed" in a table, and assigned a special XID - the previously used XID is no longer needed. The problem here is that every single table in every database must be Vacuumed before the 2 billion threshold is reached.
PostgreSQL uses lazy XIDs, where a "real" transaction id is only assigned if the transaction changes something on disk - if a transaction is read only, and does not change anything, no transaction id is consumed.
Today’s an exciting day in the Victoria office of Crunchy Data: our local staff count goes from one to two, as Martin Davisjoins the company!
This is kind of a big deal, because this year Martin and I will be spending much or our time on the core computational geometry library that powers PostGIS, the GEOS library, and the JTS library from which it derives its structure.
Why is that a big deal? Because GEOS, JTS and other language ports provide the computational geometry algorithms underneath most of the open source geospatial ecosystem – so improvements in our core libraries ripple out to help a huge swathe of other software.
JTS came first, initially as a project of the British Columbia government. GEOS is a C++ port of JTS. There are also Javascript and .Net ports (JSTS and NTS.
Each of those libraries has developed a rich downline of other libraries and projects that depend on them. On the desktop, on the web, in the middleware, JTS and GEOS power all of it.
So we know that work on JTS and GEOS on our side is going to benefit far more than just PostGIS.
I’ve already spent a decent amount of time on bringing the GEOS library up to date with the changes in JTS over the past few months, and trying to fulfill the “maintainer” role, merging pull requests and closing some outstanding tickets.
As Martin starts adding to JTS, I now feel more confident in my ability to bring those changes into the C++ world of GEOS as they land.
Without pre-judging what will get first priority, topics of overlay robustness, predicate performance, and geometry cleaning are near the top of our list.
Our spatial customers at Crunchy process a lot of geometry, so ensuring that PostGIS (GEOS) operations are robust and high performance is a big win for PostgreSQL and for our customers as well.
Yesterday someone on irc asked: i've a query that returns sequential numbers with gaps (generate_series + join) and my question is: can is somehow construct ranges out of the returned values? sort of range_agg or something? There was no further discussion, aside from me saying sure you can. not trivial task, but possible. you'd need … Continue reading "Converting list of integers into list of ranges"
Most often, there’s a trade off involved in optimizing
software. The cost of better performance is the opportunity
cost of the time that it took to write the optimization,
and the additional cost of maintenance for code that
becomes more complex and more difficult to understand.
Many projects prioritize product development over improving
runtime speed. Time is spent building new things instead of
making existing things faster. Code is kept simpler and
easier to understand so that adding new features and fixing
bugs stays easy, even as particular people rotate in and
out and institutional knowledge is lost.
But that’s certainly not the case in all domains. Game code
is often an interesting read because it comes from an
industry where speed is a competitive advantage, and it’s
common practice to optimize liberally even at some cost to
modularity and maintainability. One technique for that is
to inline code in critical sections even to the point of
absurdity. CryEngine, open-sourced a few years ago, has a
few examples of this, with “tick” functions like this
one that are 800+ lines long with 14 levels of
indentation.
Another common place to find optimizations is in databases.
While games optimize because they have to, databases
optimize because they’re an example of software that’s
extremely leveraged – if there’s a way to make running
select queries or building indexes 10% faster, it’s not an
improvement that affects just a couple users, it’s one
that’ll potentially invigorate millions of installations
around the world. That’s enough of an advantage that the
enhancement is very often worth it, even if the price is a
challenging implementation or some additional code
complexity.
Postgres contains a wide breadth of optimizations, and
happily they’ve been written conscientiously so that the
source code stays readable. The one that we’ll look at
today is SortSupport, a technique for localizing the
information needed to compare data into places where it can
be accessed very quickly, thereby making sorting data much
faster. Sorting for types that have had Sortsupport
implemented usually gets twice as fast or more, a speedup
that transfers directly into common database operations
like ORDER BY, DISTINCT, and CREATE INDEX.
While sorting, Postgres builds a series of tiny structures
that represent the data set being sorted. These tuples have
space for a value the size of a native pointer (i.e. 64
bits on a 64-bit machine) which is enough to fit the
entirety of some common types like booleans or integers
(known as pass-by-value types), but not for others that are
larger than 64 bits or arbitrarily large. In their case,
Postgres will follow a references back to the heap when
comparing values (they’re appropriately called
pass-by-reference types). Postgres is very fast, so that
still happens quickly, but it’s slower than comparing
values readily available in memory.
An array of sort tuples.
SortSupport augments pass-by-reference types by bringing a
representative part of their value into the sort tuple to
save trips to the heap. Because sort tuples usually don’t
have the space to store the entirety of the value,
SortSupport generates a digest of the full value called an
abbreviated key, and stores it instead. The contents of
an abbreviated key vary by type, but they’ll aim to store
as much sorting-relevant information as possible while
remaining faithful to pre-existing sorting rules.
Abbreviated keys should never produce an incorrect
comparison, but it’s okay if they can’t fully resolve one.
If two abbreviated keys look equal, Postgres will fall back
to comparing their full heap values to make sure it gets
the right result (called an “authoritative comparison”).
A sort tuple with an abbreviated key and pointer to the heap.
Implementing an abbreviated key is straightforward in many
cases. UUIDs are a good example of that: at 128 bits long
they’re always larger than the pointer size even on a
64-bit machine, but we can get a very good proxy of their
full value just by sampling their first 64 bits (or 32 on a
32-bit machine). Especially for V4 UUIDs which are almost
entirely random 1, the first 64 bits will be enough to
definitively determine the order for all but unimaginably
large data sets. Indeed, the patch that brought in
SortSupport for UUIDs made sorting them about
twice as fast!
String-like types (e.g. text, varchar) aren’t too much
harder: just pack as many characters from the front of the
string in as possible (although made somewhat more
complicated by locales). Adding SortSupport for them made
operations like CREATE INDEXabout three times
faster. My only ever patch to Postgres was
implementing SortSupport for the macaddr type, which was
fairly easy because although it’s pass-by-reference, its
values are only six bytes long 2. On a 64-bit machine we
have room for all six bytes, and on 32-bit we sample the
MAC address’ first four bytes.
Some abbreviated keys are more complex. The implementation
for the numeric type, which allows arbitrary scale and
precision, involves excess-K coding and breaking
available bits into multiple parts to store sort-relevant
fields.
Let’s try to get a basic idea of how SortSupport is
implemented by examining a narrow slice of source code.
Sorting in Postgres is extremely complex and involves
thousands of lines of code, so fair warning that I’m going
to simplify some things and skip a lot of others.
A good place start is with Datum, the pointer-sized type
(32 or 64 bits, depending on the CPU) used for sort
comparisons. It stores entire values for pass-by-value
types, abbreviated keys for pass-by-reference types that
implement SortSupport, and a pointer for those that don’t.
You can find it defined in postgres.h:
/*
* A Datum contains either a value of a pass-by-value type or a pointer
* to a value of a pass-by-reference type. Therefore, we require:
*
* sizeof(Datum) == sizeof(void *) == 4 or 8
*/
typedef uintptr_t Datum;
#define SIZEOF_DATUM SIZEOF_VOID_P
The format of abbreviated keys for the uuid type is one
of the easiest to understand, so let’s look at that. In
Postgres, the struct pg_uuid_t defines how UUIDs are
physically stored in the heap (from uuid.h):
You might be used to seeing UUIDs represented in string
format like 123e4567-e89b-12d3-a456-426655440000, but
remember that this is Postgres which likes to be as
efficient as possible! A UUID contains 16 bytes worth of
information, so pg_uuid_t above defines an array of
exactly 16 bytes. No wastefulness to be found.
SortSupport implementations define a conversion routine
which takes the original value and produces a datum
containing an abbreviated key. Here’s the one for UUIDs
(from uuid.c):
static Datum
uuid_abbrev_convert(Datum original, SortSupport ssup)
{
pg_uuid_t *authoritative = DatumGetUUIDP(original);
Datum res;
memcpy(&res, authoritative->data, sizeof(Datum));
...
/*
* Byteswap on little-endian machines.
*
* This is needed so that uuid_cmp_abbrev() (an unsigned integer 3-way
* comparator) works correctly on all platforms. If we didn't do this,
* the comparator would have to call memcmp() with a pair of pointers to
* the first byte of each abbreviated key, which is slower.
*/
res = DatumBigEndianToNative(res);
return res;
}
memcpy (“memory copy”) extracts a datum worth of bytes
from a pg_uuid_t and places it into res. We can’t take
the whole UUID, but we’ll be taking its 4 or 8 most
significant bytes, which will be enough information for
most comparisons.
Abbreviated key formats for the `uuid` type.
The call DatumBigEndianToNative is there to help with an
optimization. When comparing our abbreviated keys, we could
do so with memcmp (“memory compare”) which would compare
each byte in the datum one at a time. That’s perfectly
functional of course, but because our datums are the same
size as native integers, we can instead choose to take
advantage of the fact that CPUs are optimized to compare
integers really, really quickly, and arrange the datums in
memory as if they were integers. You can see this integer
comparison taking place in the UUID abbreviated key
comparison function:
static int
uuid_cmp_abbrev(Datum x, Datum y, SortSupport ssup)
{
if (x > y)
return 1;
else if (x == y)
return 0;
else
return -1;
}
However, pretending that some consecutive bytes in memory
are integers introduces some complication. Integers might
be stored like data in pg_uuid_t with the most
significant byte first, but that depends on the
architecture of the CPU. We call architectures that store
numerical values this way big-endian. Big-endian
machines exist, but the chances are that the CPU you’re
using to read this article stores bytes in the reverse
order of their significance, with the most significant at
the highest address. This layout is called
little-endian, and is in use by Intel’s X86, as well as
being the default mode for ARM chips like the ones in
Android and iOS devices.
If we left the big-endian result of the memcpy unchanged
on little-endian systems, the resulting integer would be
wrong. The answer is to byteswap, which reverses the order
of the bytes, and corrects the integer.
Example placement of integer bytes on little and big endian architectures.
You can see in pg_bswap.h that
DatumBigEndianToNative is defined as a no-op on a
big-endian machine, and is otherwise connected to a
byteswap (“bswap”) routine of the appropriate size:
Let’s touch upon one more feature of uuid_abbrev_convert.
In data sets with very low cardinality (i.e, many
duplicated items) SortSupport introduces some danger of
worsening performance. With so many duplicates, the
contents of abbreviated keys would often show equality, in
which cases Postgres would often have to fall back to the
authoritative comparator. In effect, by adding SortSupport
we would have added a useless additional comparison that
wasn’t there before.
To protect against performance regression, SortSupport has
a mechanism for aborting abbreviated key conversion. If the
data set is found to be below a certain cardinality
threshold, Postgres stops abbreviating, reverts any keys
that were already abbreviated, and disables further
abbreviation for the sort.
Cardinality is estimated with the help of
HyperLogLog, an algorithm that estimates the
distinct count of a data set in a very memory-efficient
way. Here you can see the conversion routine adding new
values to the HyperLogLog if an abort is still possible:
And where it makes an abort decision (from uuid.c):
static bool
uuid_abbrev_abort(int memtupcount, SortSupport ssup)
{
...
abbr_card = estimateHyperLogLog(&uss->abbr_card);
/*
* If we have >100k distinct values, then even if we were
* sorting many billion rows we'd likely still break even,
* and the penalty of undoing that many rows of abbrevs would
* probably not be worth it. Stop even counting at that point.
*/
if (abbr_card > 100000.0)
{
uss->estimating = false;
return false;
}
/*
* Target minimum cardinality is 1 per ~2k of non-null inputs.
* 0.5 row fudge factor allows us to abort earlier on genuinely
* pathological data where we've had exactly one abbreviated
* value in the first 2k (non-null) rows.
*/
if (abbr_card < uss->input_count / 2000.0 + 0.5)
{
return true;
}
...
}
It also covers aborting the case where we have a data set
that’s poorly suited to the abbreviated key format. For
example, imagine a million UUIDs that all shared a common
prefix in their first eight bytes, but were distinct in
their last eight 3. Realistically this will be extremely
unusual, so abbreviated key conversion will rarely abort.
Sort tuples are the tiny structures that Postgres sorts
in memory. They hold a reference to the “true” tuple, a
datum, and a flag to indicate whether or not the first
value is NULL (which has its own special sorting
semantics). The latter two are named with a 1 suffix as
datum1 and isnull1 because they represent only one
field worth of information. Postgres will need to fall back
to different values in the event of equality in a
multi-column comparison. From tuplesort.c:
/*
* The objects we actually sort are SortTuple structs. These contain
* a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
* which is a separate palloc chunk --- we assume it is just one chunk and
* can be freed by a simple pfree() (except during merge, when we use a
* simple slab allocator). SortTuples also contain the tuple's first key
* column in Datum/nullflag format, and an index integer.
*/
typedef struct
{
void *tuple; /* the tuple itself */
Datum datum1; /* value of first key column */
bool isnull1; /* is first key column NULL? */
int tupindex; /* see notes above */
} SortTuple;
In the code we’ll look at below, SortTuple may reference
a heap tuple, which has a variety of different struct
representations. One used by the sort algorithm is
HeapTupleHeaderData (from htup_details.h):
struct HeapTupleHeaderData
{
union
{
HeapTupleFields t_heap;
DatumTupleFields t_datum;
} t_choice;
ItemPointerData t_ctid; /* current TID of this or newer tuple (or a
* speculative insertion token) */
...
}
Heap tuples have a pretty complex structure which we won’t
cover, but you can see that it contains an
ItemPointerData value. This struct is what gives Postgres
the precise information it needs to find data in the heap
(from itemptr.h):
/*
* ItemPointer:
*
* This is a pointer to an item within a disk page of a known file
* (for example, a cross-link from an index to its parent table).
* blkid tells us which block, posid tells us which entry in the linp
* (ItemIdData) array we want.
*/
typedef struct ItemPointerData
{
BlockIdData ip_blkid;
OffsetNumber ip_posid;
}
The algorithm to compare abbreviated keys is duplicated in
the Postgres source in a number of places depending on the
sort operation being carried out. We’ll take a look at
comparetup_heap (from tuplesort.c) which
is used when sorting based on the heap. This would be
invoked for example if you ran an ORDER BY on a field
that doesn’t have an index on it.
static int
comparetup_heap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
{
SortSupport sortKey = state->sortKeys;
HeapTupleData ltup;
HeapTupleData rtup;
TupleDesc tupDesc;
int nkey;
int32 compare;
AttrNumber attno;
Datum datum1,
datum2;
bool isnull1,
isnull2;
/* Compare the leading sort key */
compare = ApplySortComparator(a->datum1, a->isnull1,
b->datum1, b->isnull1,
sortKey);
if (compare != 0)
return compare;
ApplySortComparator gets a comparison result between two
datum values. It’ll compare two abbreviated keys where
appropriate and handles NULL sorting semantics. The
return value of a comparison follows the spirit of C’s
strcmp: when comparing (a, b), -1 indicates a < b,
0 indicates equality, and 1 indicates a > b.
The algorithm returns immediately if inequality (!= 0)
was detected. Otherwise, it checks to see if abbreviated
keys were used, and if so applies the authoritative
comparison if they were. Because space in abbreviated keys
is limited, two being equal doesn’t necessarily indicate
that the values that they represent are.
After finding abbreviated keys to be equal, full values to
be equal, and all additional sort fields to be equal, the
last step is to return 0, indicating in classic libc
style that the two tuples are really, fully equal.
SortSupport is a good example of the type of low-level
optimization that most of us probably wouldn’t bother with
in our projects, but which makes sense in an extremely
leveraged system like a database. As implementations are
added for it and Postgres’ tens of thousands of users like
myself upgrade, common operations like DISTINCT, ORDER
BY, and CREATE INDEX get twice as fast, for free.
Credit to Peter Geoghegan for some of the original
exploration of this idea and implementations for UUID and a
generalized system for SortSupport on variable-length
string types, Robert Haas and Tom Lane for adding the
necessary infrastructure, and Andrew
Gierth for a difficult implementation for
numeric. (I hope I got all that right.)
1 A note for the pedantic that V4 UUIDs usually have only
122 bits of randomness as four bits are used for the
version and two for the variant.
2 The new type macaddr8 was later introduced to handle
EUI-64 MAC addresses, which are 64 bits long.
3 A data set of UUIDs with common datum-sized prefixes is
a pretty unlikely scenario, but it’s a little more
realistic for variable-length string types, where users
are storing much more free-form data.
If you have looked at Postgres object permissions in the past, I bet you were confused. I get confused, and I have been at this for a long time.
The way permissions are stored in Postgres is patterned after the long directory listing of Unix-like operating systems, e.g., ls -l. Just like directory listings,
the Postgres system stores permissions using single-letter indicators. r is used for read (SELECT) in both systems, while w is used for write permission in
ls, and UPDATE in Postgres. The other nine letters used by Postgres don't correspond to any directory listing permission letters, e.g., d is
DELETE permission. The full list of Postgres permission letters is in the GRANT documentation
page; the other letters are:
D -- TRUNCATE
x -- REFERENCES
t -- TRIGGER
X -- EXECUTE
U -- USAGE
C -- CREATE
c -- CONNECT
T -- TEMPORARY
We’ve seen a lot of questions regarding the options available in PostgreSQL for rebuilding a table online. We created this blog post to explain the
pg_repack
extension, available in PostgreSQL for this requirement. pg_repack is a well-known extension that was created and is maintained as an open source project by several authors.
There are three main reasons why you need to use
pg_repack
in a PostgreSQL server:
Reclaim free space from a table to disk, after deleting a huge chunk of records
Rebuild a table to re-order the records and shrink/pack them to lesser number of pages. This may let a query fetch just one page ( or < n pages) instead of n pages from disk. In other words, less IO and more performance.
Reclaim free space from a table that has grown in size with a lot of bloat due to improper autovacuum settings.
You might have already read our previous articles that explained what bloat is, and discussed the internals of autovacuum. After reading these articles, you can see there is an autovacuum background process that removes dead tuples from a table and allows the space to be re-used by future updates/inserts on that table. Over a period of time, tables that take the maximum number of updates or deletes may have a lot of bloated space due to poorly tuned autovacuum settings. This leads to slow performing queries on these tables. Rebuilding the table is the best way to avoid this.
Why is just autovacuum not enough for tables with bloat?
We have discussed several parameters that change the behavior of an autovacuum process in this blog post. There cannot be more than
autovacuum_max_workers
number of autovacuum processes running in a database cluster at a time. At the same time, due to untuned autovacuum settings and no manual vacuuming of the database as a weekly or monthy jobs, many tables can be skipped from autovacuum. We have discussed in this post that the default autovacuum settings run autovacuum on a table with ten records more times than a table with a million records. So, it is very important to tune your autovacuum settings, set table-level customized autovacuum parameters and enable automated jobs to identify tables with huge bloat and run manual vacuum on them as scheduled jobs during low peak times (after thorough testing).
VACUUM FULL
VACUUM FULL
is the default option available with a PostgreSQL installation that allows us to rebuild a table. This is similar to
ALTER TABLE
in MySQL. However, this command acquires an exclusive lock and locks reads and writes on a table.
VACUUM FULL tablename;
pg_repack
pg_repack
is an extension available for PostgreSQL that helps us rebuild a table online. This is similar to
pt-online-schema-change
for online table rebuild/reorg in MySQL. However,
pg_repack
works for tables with a Primary key or a NOT NULL Unique key only.
# yum install pg_repack11 (This works for PostgreSQL 11)
Similarly, for PostgreSQL 10,
# yum install pg_repack10
In Debian/Ubuntu from PGDG repo
Add certificates, repo and install
pg_repack
:
Following certificate may change. Please validate before you perform these steps.
# sudo apt-get install wget ca-certificates
# wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
# sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
# sudo apt-get update
# apt-get install postgresql-server-dev-11
# apt-get install postgresql-11-repack
Loading and creating pg_repack extension
Step 1 :
You need to add
pg_repack
to
shared_preload_libraries
. For that, just set this parameter in postgresql.conf or postgresql.auto.conf file.
shared_preload_libraries = 'pg_repack'
Setting this parameter requires a restart.
$ pg_ctl -D $PGDATA restart -mf
Step 2 :
In order to start using
pg_repack
, you must create this extension in each database where you wish to run it:
$ psql
\c percona
CREATE EXTENSION pg_repack;
Using pg_repack to Rebuild Tables Online
Similar to
pt-online-schema-change
, you can use the option
--dry-run
to see if this table can be rebuilt using
pg_repack
. When you rebuild a table using
pg_repack
, all its associated Indexes does get rebuild automatically. You can also use
-t
instead of
--table
as an argument to rebuild a specific table.
Success message you see when a table satisfies the requirements for pg_repack.
$ pg_repack --dry-run -d percona --table scott.employee
INFO: Dry run enabled, not executing repack
INFO: repacking table "scott.employee"
Error message when a table does not satisfy the requirements for pg_repack.
$ pg_repack --dry-run -d percona --table scott.sales
INFO: Dry run enabled, not executing repack
WARNING: relation "scott.sales" must have a primary key or not-null unique keys
Now to execute the rebuild of a table: scott.employee ONLINE, you can use the following command. It is just the previous command without
Over the years many people have asked for “timetravel” or “AS OF”-queries in PostgreSQL. Oracle has provided this kind of functionality for quite some time already. However, in the PostgreSQL world “AS OF timestamp” is not directly available. The question now is: How can we implement this vital functionality in user land and mimic Oracle functionality?
Implementing “AS OF” and timetravel in user land
Let us suppose we want to version a simple table consisting of just three columns: id, some_data1 and some_data2. To do this we first have to install the btree_gist module, which adds some valuable operators we will need to manage time travel. The table storing the data will need an additional column to handle the validity of a row. Fortunately PostgreSQL supports “range types”, which allow to store ranges in an easy and efficient way. Here is how it works:
CREATE EXTENSION IF NOT EXISTS btree_gist;
CREATE TABLE t_object
(
id int8,
valid tstzrange,
some_data1 text,
some_data2 text,
EXCLUDE USING gist (id WITH =, valid WITH &&)
);
Mind the last line here: “EXLUDE USING gist” will ensure that if the “id” is identical the period (“valid”) must not overlap. The idea is to ensure that the same “id” only has one entry at a time. PostgreSQL will automatically create a Gist index on that column. The feature is called “exclusion constraint”. If you are looking for more information about this feature consider checking out the official documentation (https://www.postgresql.org/docs/current/ddl-constraints.html).
If you want to filter on some_data1 and some_data2 consider creating indexes. Remember, missing indexes are in many cases the root cause of bad performance:
CREATE INDEX idx_some_index1 ON t_object (some_data1);
CREATE INDEX idx_some_index2 ON t_object (some_data2);
By creating a view, it should be super easy to extract data from the underlying tables:
CREATE VIEW t_object_recent AS
SELECT id, some_data1, some_data2
FROM t_object
WHERE current_timestamp <@ valid;
SELECT * FROM t_object_recent;
For the sake of simplicity I have created a view, which returns the most up to date state of the data. However, it should also be possible to select an old version of the data. To make it easy for application developers I decided to introduce a new GUC (= runtime variable), which allows users to set the desired point in time. Here is how it works:
SET timerobot.as_of_time = '2018-01-10 00:00:00';
Then you can create a second view, which returns the old data:
CREATE VIEW t_object_historic AS
SELECT id, some_data1, some_data2
FROM t_object
WHERE current_setting('timerobot.as_of_time')::timestamptz <@ valid;
SELECT * FROM t_object_historic;
It is of course also possible to do that with just one view. However, the code is easier to read if two views are used (for the purpose of this blog post). Feel free to adjust the code to your needs.
If you are running an application you usually don’t care what is going on behind the scenes – you simply want to modify a table and things should take care of themselves in an easy way. Therefore it makes sense to add a trigger to your t_object_current table, which takes care of versioning. Here is an example:
CREATE FUNCTION version_trigger() RETURNS trigger AS
$$
BEGIN
IF TG_OP = 'UPDATE'
THEN
IF NEW.id <> OLD.id
THEN
RAISE EXCEPTION 'the ID must not be changed';
END IF;
UPDATE t_object
SET valid = tstzrange(lower(valid), current_timestamp)
WHERE id = NEW.id
AND current_timestamp <@ valid;
IF NOT FOUND THEN
RETURN NULL;
END IF;
END IF;
IF TG_OP IN ('INSERT', 'UPDATE')
THEN
INSERT INTO t_object (id, valid, some_data1, some_data2)
VALUES (NEW.id,
tstzrange(current_timestamp, TIMESTAMPTZ 'infinity'),
NEW.some_data1,
NEW.some_data2);
RETURN NEW;
END IF;
IF TG_OP = 'DELETE'
THEN
UPDATE t_object
SET valid = tstzrange(lower(valid), current_timestamp)
WHERE id = OLD.id
AND current_timestamp <@ valid;
IF FOUND THEN
RETURN OLD;
ELSE
RETURN NULL;
END IF;
END IF;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER object_trig
INSTEAD OF INSERT OR UPDATE OR DELETE
ON t_object_recent
FOR EACH ROW
EXECUTE PROCEDURE version_trigger();
The trigger will take care that INSERT, UPDATE, and DELETE is properly taken care of.
Finally: Timetravel made easy
It is obvious that versioning does have an impact on performance. You should also keep in mind that UPDATE and DELETE are more expensive than previously. However, the advantage is that things are really easy from an application point of view. Implementing time travel can be done quite generically and most applications might not have to be changed at all. What is true, however, is that foreign keys will need some special attention and might be easy to implement in general. It depends on your applications whether this kind of restriction is in general a problem or not.
This latest release provides further feature enhancements designed to support users intending to deploy large-scale PostgreSQL clusters on Kubernetes, with enterprise high-availability and disaster recovery requirements.
When combined with the Crunchy PostgreSQL Container Suite, the PostgreSQL Operator provides an open source, Kubernetes-native PostgreSQL-as-a-Service capability.
Read on to see what is new in PostgreSQL Operator 3.5.
PostgreSQL supports SSL, and SSL private keys can be protected by a passphrase. Many people choose not to use passphrases with their SSL keys, and that’s perhaps fine. This blog post is about what happens when you do have a passphrase.
If you have SSL enabled and a key with a passphrase and you start the server, the server will stop to ask for the passphrase. This happens automatically from within the OpenSSL library. Stopping to ask for a passphrase obviously prevents automatic starts, restarts, and reboots, but we’re assuming here that you have made that tradeoff consciously.
When you run PostgreSQL under systemd, which is very common nowadays, there is an additional problem. Under systemd, the server process does not have terminal access, and so it cannot ask for any passphrases. By default, the startup will fail in such setups.
As of PostgreSQL 11, it is possible to configure an external program to obtain the SSL passphrase, using the configuration setting ssl_passphrase_command. As a simple example, you can set
ssl_passphrase_command = 'echo "secret"'
and it will apply the passphrase that the program prints out. You can use this to fetch the passphrase from a file or other secret store, for example.
But what if you still want the security of having to manually enter the password? Systemd has a facility that lets services prompt for passwords. You can use that like this::
Except that that doesn’t actually work, because non-root processes are not permitted to use the systemd password system; see this bug. But there are workarounds.
A more evil workaround (discussed in the above-mentioned bug report) is to override the permissions on the socket file underlying this mechanism. Add this to the postgresql service unit:
This enables access to the socket before the service starts and then removes it again. Note that the + in the Exec lines, which runs those lines as root, requires at least system version 231. So for example CentOS 7 won’t support it. Maybe don’t do this.
Anyway, if you have this set up and run the usual sudo systemctl start postgresql, you should see
Enter PEM pass phrase:
Then you can enter the passphrase and the service should then start normally.
Thanks to a comment on my previous blog post by Kaarel, the ability to simply display the
Postgres permission letters is not quite as dire as I showed. There is a function,
aclexplode(), which expands the access control list (ACL) syntax
used by Postgres into a table with full text descriptions. This function exists in all supported versions of Postgres. However, it was only recently documented in this
commit based on this
email thread, and will appear in the Postgres 12 documentation.
Since aclexplode() exists (undocumented) in all supported versions of Postgres, it can be used to provide more verbose output of the pg_class.relacl
permission letters. Here it is used with the test table created in the previous blog entry:
SELECT relacl
FROM pg_class
WHERE relname = 'test';
relacl
--------------------------------------------------------
{postgres=arwdDxt/postgres,bob=r/postgres,=r/postgres}
SELECT a.*
FROM pg_class, aclexplode(relacl) AS a
WHERE relname = 'test'
ORDER BY 1, 2;
grantor | grantee | privilege_type | is_grantable
---------+---------+----------------+--------------
10 | 0 | SELECT | f
10 | 10 | SELECT | f
10 | 10 | UPDATE | f
10 | 10 | DELETE | f
10 | 10 | INSERT | f
10 | 10 | REFERENCES | f
10 | 10 | TRIGGER | f
10 | 10 | TRUNCATE | f
10 | 16388 | SELECT | f
2019 February 21 Meeting (Note: Back to third Thursday this month!)
Location:
PSU Business Accelerator
2828 SW Corbett Ave · Portland, OR
Parking is open after 5pm.
Speaker: Paul Jungwirth
Temporal databases let you record history: either a history of the database (what the table used to say), a history of the thing itself (what it used to be), or both at once. The theory of temporal databases goes back to the 90s, but standardization has only just begun with some modest recommendations in SQL:2011, and database products (including Postgres) are still missing major functionality.
This talk will cover how temporal tables are structured, how they are queried and updated, what SQL:2011 offers (and doesn’t), what functionality Postgres has already, and what remains to be built.
Paul started programming on a Tandy 1000 at age 8 and hasn’t been able to stop since. He helped build one of the Mac’s first web servers in 1994 and has founded software companies in politics and technical hiring. He works as an independent consultant specializing in Rails, Postgres, and Chef.
My colleague Payal came across an outage that happened to mailchimp's mandrill app yesterday, link can be found HERE. Since this was PostgreSQL related i wanted to post about the technical aspect of it. According to the company : “Mandrill uses a sharded Postgres setup as one of our main datastores,” the email explains. “On Sunday, February 3, at 10:30pm EST, 1 of our 5 physical Postgres instances saw a significant spike in writes.” The email continues: “The spike in writes triggered a Transaction ID Wraparound issue. When this occurs, database activity is completely halted. The database sets itself in read-only mode until offline maintenance (known as vacuuming) can occur.” So, lets see what that "transaction id wraparound issue" is and how someone could prevent similar outages from ever happening. PostgreSQL uses MVCC to control transaction visibility, basically by comparing transaction IDs (XIDs). A row with an insert XID greater than the current transaction XID shouldn't be visible to the current transaction. But since transaction IDs are not unlimited a cluster will eventually run out after (2^32 transactions 4+ billion) causing transaction ID wraparound: transaction counter wraps around to zero, and all past transaction would appear to be in the future This is being taken care of by vacuum that will mark rows as frozen, indicating that they were inserted by a transaction that committed far in the past that can be visible to all current and future transactions. To control this behavior, postgres has a configurable called autovacuum_freeze_max_age, which defaults at 200.000.000 transactions, a very conservative default that must be tuned in larger production systems. It sounds complicated but its relatively easy not to get to that point,for most people just having autovacuum on will prevent this situation from ever happening. You can simply schedule manual vacuums by getting a list of the tables "closer" to autovacuum_freeze_max_age with a simple query like this:
SELECT 'vacuum analyze ' || c.oid::regclass ||' /* manual_vacuum */ ;' FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind = 'r' ORDER BY greatest(age(c.relfrozenxid),age(t.relfrozenxid)) desc LIMIT 10;
Even if you want to avoid manual vacuums, you can simply create a report or a monitoring metric based on age of relfrozenxid of pg_class combined with pg_settings, eg :
SELECT oid::regclass::text AS table,pg_size_pretty(pg_total_relation_size(oid)) AS table_size,age(relfrozenxid) AS xid_age, (SELECT setting::int FROM pg_settings WHERE name = 'autovacuum_freeze_max_age') - age(relfrozenxid) AS tx_for_wraparound_vacuum FROM pg_class WHERE relfrozenxid != 0 AND oid > 16384 ORDER BY tx_for_wraparound_vacuum ;
But, lets assume that you got to the point that you started seeing "autovacuum: VACUUM tablename (to prevent wraparound)" This vacuum will more likely happen when you don't want it and even if you kill it, it will keep respawning. If you already set autovacuum_freeze_max_age to a more viable production setting, we usually set it at 1.5bil, you can change autovacuum_freeze_max_age to a higher value, say 3 billion and immediately kick of a manual vacuum on the table. This vacuum will be a "normal vacuum" hence live. If the table is so big that each manual vacuum take days to complete, then you should've partitioned it...Proper schema and proper tuning of autovacuum are really important, especially in write heavy workloads. And to get back to the outage, the mail from the company insinuates that it was a problem with postgres, reality is that it wasn't, it was clearly an OPS oversight. Thanks for reading Vasilis Ventirozos credativ LLC
My colleague Payal came across an outage that happened to mailchimp's mandrill app yesterday, link can be found HERE. Since this was PostgreSQL related i wanted to post about the technical aspect of it. According to the company : “Mandrill uses a sharded Postgres setup as one of our main datastores,” the email explains. “On Sunday, February 3, at 10:30pm EST, 1 of our 5 physical Postgres instances saw a significant spike in writes.” The email continues: “The spike in writes triggered a Transaction ID Wraparound issue. When this occurs, database activity is completely halted. The database sets itself in read-only mode until offline maintenance (known as vacuuming) can occur.” So, lets see what that "transaction id wraparound issue" is and how someone could prevent similar outages from ever happening. PostgreSQL uses MVCC to control transaction visibility, basically by comparing transaction IDs (XIDs). A row with an insert XID greater than the current transaction XID shouldn't be visible to the current transaction. But since transaction IDs are not unlimited a cluster will eventually run out after (2^32 transactions 4+ billion) causing transaction ID wraparound: transaction counter wraps around to zero, and all past transaction would appear to be in the future This is being taken care of by vacuum that will mark rows as frozen, indicating that they were inserted by a transaction that committed far in the past that can be visible to all current and future transactions. To control this behavior, postgres has a configurable called autovacuum_freeze_max_age, which defaults at 200.000.000 transactions, a very conservative default that must be tuned in larger production systems. It sounds complicated but its relatively easy not to get to that point,for most people just having autovacuum on will prevent this situation from ever happening. You can simply schedule manual vacuums by getting a list of the tables "closer" to autovacuum_freeze_max_age with a simple query like this:
SELECT 'vacuum analyze ' || c.oid::regclass ||' /* manual_vacuum */ ;' FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind = 'r' ORDER BY greatest(age(c.relfrozenxid),age(t.relfrozenxid)) desc LIMIT 10;
Even if you want to avoid manual vacuums, you can simply create a report or a monitoring metric based on age of relfrozenxid of pg_class combined with pg_settings, eg :
SELECT oid::regclass::text AS table,pg_size_pretty(pg_total_relation_size(oid)) AS table_size,age(relfrozenxid) AS xid_age, (SELECT setting::int FROM pg_settings WHERE name = 'autovacuum_freeze_max_age') - age(relfrozenxid) AS tx_for_wraparound_vacuum FROM pg_class WHERE relfrozenxid != 0 AND oid > 16384 ORDER BY tx_for_wraparound_vacuum ;
But, lets assume that you got to the point that you started seeing "autovacuum: VACUUM tablename (to prevent wraparound)" This vacuum will more likely happen when you don't want it and even if you kill it, it will keep respawning. If you already set autovacuum_freeze_max_age to a more viable production setting, we usually set it at 1.5bil, you can change autovacuum_freeze_max_age to a higher value, say 3 billion and immediately kick of a manual vacuum on the table. This vacuum will be a "normal vacuum" hence live. If the table is so big that each manual vacuum take days to complete, then you should've partitioned it...Proper schema and proper tuning of autovacuum are really important, especially in write heavy workloads. And to get back to the outage, the mail from the company insinuates that it was a problem with postgres, reality is that it wasn't, it was clearly an OPS oversight. Thanks for reading Vasilis Ventirozos credativ LLC
One frequent complaint about connection poolers is the limited number of authentication methods they
support. While some of this is caused by the large amount of work required to support all 14 Postgres authentication methods, the bigger reason is that only a few
authentication methods allow for the clean passing of authentication credentials through an intermediate server.
Specifically, all the password-based authentication methods (scram-sha-256,md5,password) can easily pass credentials from the client through the pooler to the
database server. (This is not possible using SCRAM with channel binding.)
Many of the other authentication methods, e.g. cert, are designed to prevent man-in-the-middle attacks and
therefore actively thwart passing through of credentials. For these, effectively, you have to set up two sets of credentials for each user — one for client to pooler,
and another from pooler to database server, and keep them synchronized.
A pooler built-in to Postgres would have fewer authentication pass-through problems, though internal poolers have some down sides too, as I already
stated.